7 February 2017

Lightbulb Moment with Bluemix

Lightbulb Moment with Bluemix


I have seen the light.  Yes this one.  I live in a dark street so an outside light is essential on the front of the house. This is my light. It is a basic light without any motion detection or anything clever. I simply leave it on 24 hours a day during the winter and turn it off 24 hours a day during the summer. It works but it is not very efficient. Therefore this month I finally relented and bought myself a smart lightbulb.  I chose a Hive Active Bulb as I already have a Hive central heating controller and therefore didn't need to buy another smart hub for my router.

The Hive system provides a simple scheduler for turning the light on and off at specific times. But the reasons I went ‘smart’ was so that I could come up with a more sophisticated way of turning the light on and off; specifically by linking the light on/off to sunrise and sunset times.


If this then that

I did some research and found that IFTTT.com can be used to control Hive Active bulbs (and others such as the Philips Hue).  Even better there is a sunrise/sunset IFTTT event trigger provided by Weather Underground.

I wired up the IFTTT Weather Underground applets and was happy that it worked right away. However there is a slight annoyance.  Not a major annoyance I grant you but it still bugged me.  The problem was that the event to trigger the light turning on and off was often delayed. I then spotted the small print on the IFTTT website:


Admittedly the events were only 5-20 minutes delayed rather than an hour but I thought I could do better…

Custom Sunrise/Sunset Trigger


Enter IFTTTSunTimes.mybluemix.net!

Long story short I wrote my own sunrise/sunset trigger that is accurate to the nearest minute of a sunrise or sunset and hosted it on IBM Bluemix.

Check out my trigger app here:
http://iftttsuntimes.mybluemix.net


Techie Bit

As with a lot of Internet of Things based solutions there is a bit of a daisy chain of elements used to actually control the light!


For the techies out there I am using a NodeJS app hosted on IBM Bluemix.  I chose NodeJS as I knew that there are a *lot* of ready built packages of function out there.  In this case there were ready built packages for providing things such as a web API (Express), calculating sunrise/sunset times (Suncalc), integrating with a database (Cloudant), scheduling a trigger (Node-schedule), encryption of Maker keys in the database (Crypto-js) and security (Helmet). The app simply sets timers based on calculated sunrise and sunset times that then call a web service at sunrise and sunset to control the light. I still rely on IFTTT but instead of using the WU trigger I use the custom ‘Maker’ trigger instead. This allows me to trigger the light when I want. Where the Weather Underground trigger can be delayed up to an hour, my custom trigger is triggering the events consistently within a minute.

It turned out to be quite easy to write the application, but as I have said in other posts, the non-functional elements of the solution took up significantly more time than getting the core functionality to work.  Take a look at the pie chart below where I have estimated the time it took to do each feature of the application:


As you can see about 95% of the effort was handling the non-core functional activities.  Some of these activities I had never done before so there was an element of learning time and some trial and error but still it does go to show that the bits of a solution that you can see really are only the tip of the iceberg.

The IFTTTSunTimes trigger is not just for Hive lights, it will work with any IFTTT applet that you want to trigger based on the the sunrise and sunset times.



I have joked that this is the most over engineered lightbulb application in existence, but it has been fun!

10 June 2016

Running Containers in a Secure Environment

Running Containers in a Secure Environment

It is common to see container demos or videos that demonstrate how quick and easy it is to take a container from a public registry, extend it with custom stuff and deploy it in multiple locations. A wonderful step forward in our industry to be sure. However, just as the technology advances before it, it is important not to get carried away and forget about the complexity that non-functional requirements bring to a solution. Often the non-functional requirements completely change the way a solution works due to the constraints that they bring. This article is going to focus on one such non-functional requirement: Security Vulnerability Management.

As you will see, this one requirement modifies the original functional solution significantly. Security vulnerability management is only one of the non-functional ‘lenses’ that an enterprise IT system needs to be reviewed against to ensure it works. See Figure 1 below for some other lenses that need to be applied to the solution to validate that it is fit for purpose.


Figure 1: Enterprise IT Non-Functional Lenses

Consider the simple use case, deployment of a container. This is the example that is seen time and time again on videos and demos. The archetype pattern for containers. In an environment where application development and hosting are separate functions (i.e. most of the time) the process might look similar to that in shown in Figure 2 below.


Figure 2: Container Deployment

The application development team is responsible for packaging up the container and the operating team is responsible for installing it on a host (or cloud platform) and, typically, managing it to achieve and retain certain service levels. The difference between the demonstration and the real world deployment is that the non-functional requirements are an absolute reality. Not only a reality, often non-negotiable in a mature IT organisation who have the experience of when things go bad! The next section looks into a little more detail about what security controls might be needed to ensure that the deployed containers don’t compromise the security of the IT estate.


Container Security

Container security is mostly focussed with spotting, containing and managing security vulnerabilities. Vulnerabilities come from a variety of sources:
  • Vulnerabilities in containers we make ourselves
  • Vulnerabilities in containers we reuse from external sources
  • Vulnerabilities in the underlying host operating system
It is a common misconception that containers are inherently secure because security is built in. It is true that containers at the application level appear to be more secure as by default containers don’t open any connection ports thereby reducing the attack surface of the container. However, containers (at the time of writing) need to run at root level privileges on the host machine and there is currently very limited protection to stop a malicious container from exploiting this feature to its advantage.

One of the differences between virtual machines (VM) and containers is that VM technology has matured to a level where VM separation can be protected (though not ensured) by hardware enforced separation. In other words, computer CPUs contain instruction sets to enable VM separation and greatly reduce the attack surface between VMs on the same physical host. Containers do not have this level of maturity yet. See the section below on deployment options for more discussion on this topic.


Immutability

One of the features of containers is that once you have built them they are immutable. This means that you can move containers between environments and, except for a few environment variables, they do not need to change. This immutability means that once a container has been tested to work on the development environment then it can be moved to other environments with a higher chance of it working. There are still the standard risks of moving between environments such as:
  • External dependencies — things that the container is dependent on but not inside the container (e.g. an API) are installed at different versions between environments
  • Global settings in the environment having a material difference and causing different behaviour (e.g. OS level settings such as memory or disk access methods)
  • Data differences — typically in the areas such as reference data or interface data
Despite the above risks, containers are easier to move between environments than with previous methods.

The core concept of container immutability is that once a container is composed that it always be treated as a locked box. A container must never, ever be modified in any other place other than where it was originally built. This integrity is key to container portability and can be validated via container signing. Signing is a whole different topic and not one that will be explored further here.

The immutability is an advantage from a functional perspective but it is a very large disadvantage from a security management perspective. Imagine the case where a security vulnerability is identified in one of the components inside a production deployed container. Rather than conducting the cardinal sin of patching the deployed container in situ, the container contents need to be patched at source by the application development team.

Once the patching of the container contents is completed, the container needs to be rebuilt, tested and then re-deployed. Finally, the system as a whole needs to be regression tested in a staging environment to make sure something didn’t break. This is where an automated build pipeline helps significantly as it will take a lot of the effort out of this process.

The conclusion here is that the immutable containers, in perfect world, will be frequently refreshed from source in order to make sure that the latest patches are included in the live containers. In reality it is unlikely that there would be the need or desire to do this refresh except when there is a direct need. For example, when a security vulnerability has been found in a production component. Therefore it is very important that live running containers as well as development containers are frequently checked for new vulnerabilities as new vulnerabilities may have been discovered (and patched) after the container was deployed to production.


Public Containers

The other source of security risks is the re-use of containers from public container registries. Public container repositories are a very important source of innovation to the code base of systems. They allow developers to share and reuse to improve productivity and reduce functional risk. However public containers are also an excellent source of security vulnerabilities.

In order to provide an element of control it is worth considering adding a vetting process to create a trusted source repository of security checked ‘parent’ containers. The trusted source would be the master repository for deployments to controlled environments rather than the public registry. This process would provide an element of control for security but it is recognised that it can constrain the velocity of a project being delivered via agile methods. Where the middle ground is will depend on the organisation’s Architecture Entropy level (see further reading).


Container Deployment Process

Applying the suggested modifications to the process shown in Figure 2, of course, adds in complexity. The core pattern of deployment still exists but the increased controls and process dwarf the original pattern. This is illustrated in Figure 3 below.

Remember this complexity arose from just applying one non-functional lens to the deployment. Many of the other non-functional lenses will add their own complexity to the process and may fundamentally change the pattern due to the constraints the non-functional requirements bring.


Figure 3: Secure Container Deployment Process

The next section goes on to explain the deployment rules and policy aspects of determining the most appropriate hosting locations for containers.

Deployment Options

Container deployment requires careful management and control. This might be:
  • To apply organisation policy to enforce a delivery model — for example supplier X is not allowed to deploy in environment Y
  • More granular control — for example containers with data storage are not allowed to be deployed on a host situated in a DMZ network zone
  • A combination
Regardless of the situation there needs to be some policy and control. The deployment is all about balancing security and flexibility. For instance Figure 4 shows perhaps the most secure container deployment option. Clearly this is taken to the extreme but the policy decision was that no cohabitation between containers is allowed on the same host. The physical separation enforces security but also removes all of the benefits of containers.


Figure 4: High Security Container Deployment

Going to the other extreme it is feasible that all containers could be deployed on any node in any combination. In other words there are no constraints on cohabitation. Figure 5 shows this type of flexible configuration. Although it is an extreme example it is highly likely that there are a number of deployments in the real world that look like this in production. This is not a good idea for a secure system as can be seen by the variety of security threats at each level of the stack (Figure 7).


Figure 5: Highly Flexible Container Deployment

The reality is that there will be a compromise between security separation and the need for flexibility. The cohabitation rules will likely be different for different environments depending on the security levels that environment is running. The good news is that a container shouldn’t care what host it is deployed on as long as there is a network path to the containers and resources that it needs to communicate with.

To determine the rules it is important to understand, for each security level, what constraints are in place for splitting virtualisation layers across different boundaries. For instance can VMs (or container hosts) in different network zones be on the same physical node? If so, in what security classical levels is this allowed?


Figure 6: Hybrid Container Deployment


Figure 7: Security Levels

Differentiating Container Security Levels

Containers themselves are immutable but their security status is constantly changing. Unchecked containers are validated, previously ok containers are found to have vulnerabilities and problems are fixed.

To keep track of the security position it is required to allocate a security status to the containers. It is proposed that the following levels are used:
  • Grey — unknown
  • Blue — validated public
  • Red — ready for production verification
  • Amber — formerly suitable for production but currently ‘in doubt’
  • Green — suitable for production operation

The relationships between the statuses are shown in Figure 8 below. The status is not a static concept and therefore the containers must be continuously validated.


Figure 8: Container Security Status Levels

Security Management Process

The continuous security testing performed as part of the validation process will be altering the security state of containers. The larger the estate, the higher the frequency of the change. It is important to put in place a container security management process to keep on top of the security problems in the estate.

A very high level process is shown in Figure 9 below. The concept is that there is a continuous container security management capability sitting between the application development and the operations teams. The security management capability is responsible to ensuring that only “green” containers are running in production and that any containers that go “amber” are fixed as soon as possible.
The security management capability’s workload is helped by instrumentation that is mandated to be added to the containers. The inclusion of the instrumentation is checked in the continuous security testing process and only containers with the instrumentation implemented correctly will be allowed to go “green”.

The instrumentation helps by automatically sending callback notifications to a central deployment register. These callbacks are sent by the containers themselves at key points:
  • When a container is deployed to a node
  • When a container is started (or stopped) on a node
  • When a container is suspended on a node
The callbacks allow the deployment register, a form of CMDB, to keep up to date on what containers are deployed where. More importantly it also tracks when containers are started, stopped and suspended. This latter part is important for policy enforcement to ensure that requests to suspend or stop a container found to have a problem with it have been complied with.



Figure 9: Container Security Management Process

Conclusion

Pulling together all of the threads mentioned in this paper means that it is possible to start to understand the end-to-end aspects of secure container management. Of course, containers still need hosts to run on and so similar processes are required to manage and check the host operating systems. It is interesting to see how much complexity is added when looking through a single non-functional lens. Imagine the complexity of a design when all of the non-functional lenses have been applied to a design. This is one of the reasons why enterprise IT is actually very hard to do well!
Do containers simplify everything? Not really when you look at the big picture. They certainly help simplify certain areas and of course create new areas of complexity. I am watching with interest if “Containers as a Service” (CaaS) takes off in the market. A CaaS host will worry about a lot of this for you. Hopefully!

Further Reading

  1. NCC Group whitepaper: “Understanding and Hardening Linux Containers”; 29 June 2016
  2. Chris Milsted: “Patching, Anti-Virus and Configuration management when adopting docker format containers”; April 2016
  3. Simon Greig blog: “Architecture Entropy”; November 2015

11 April 2016

Micro Services Will Not Change the World!

Micro services will not change the world; like those tech trends that came before micro services (SOA, object oriented, object brokers, XML, client/server, etc ) didn't either. That all said I think, like the trends that came before it, micro services will make an incremental improvement into how we design and build software and systems. Just not fundamentally change it.  I don’t have anything against micro services, in fact I like the model, but this is my two pence worth point of view.

At the moment there is a lot of talk about micro services being the answer to all known problems in IT including ‘fixing' big, complex, hard to change enterprise systems. Hereafter referred to as 'monoliths'. In my eyes these monoliths are in fact made up of micro services - except the services are called modules, packages, components and scripts.  The monolith is really the fact that all of those modules, packages, components, scripts are written, integrated and tested by a single system integrator supplier in a typically opaque manner.

I suggest that the fix to the monolith is not the adoption of micro services but a change to the way the system is contracted. However, it is worth noting that those single contract delivery models, although seen as cumbersome, do provide a lot of intangible benefits.  The largest benefit is the ability to delegate making sure that the non-functional aspects of the system (performance, capacity, stability, guaranteed service levels, etc) are covered and put measurable targets in place to get a consistent level of service.


Service Contracts

Micro services will provide the ability to make changes quickly and, if the overall architecture is supportive, those changes will be isolated to small areas of the system thereby limiting the testing.  Right?

A number of years ago I worked on a large integrated system with tricky NFRs around performance and availability.  The system was based on a services oriented architecture and designed to be flexible.  There is a strong argument that says that micro services are just an evolution of SOA and that the same principles apply.  I tend to agree.  In this system we could process a change request and make the technical change in 5 minutes.  Great eh?  However it would take us up to 100 days effort to regression test and deploy a change such as this – much to everyone’s annoyance (mine included).

The problem was not that we were poor at testing; it was because the contract included service penalties of up to £1,000,000 a month if performance and availability of the system did not meet the contracted targets.  Numbers like that focus the mind and drive a risk averse behaviour when it comes to implementing changes!

Where there are risks such as the service penalties then the mitigation is to add rigorous controls and processes. The consequence is that these change processes and testing regimes impact the speed and agility that changes can be implemented in the live environment. Couple that with industry regulation, third party assessments and licensing then it makes changing any live and business (or safety or national security) critical IT system, no matter how small a change, quite a risky undertaking.

What does this have to do with micro services?

The system referred to above was built around an SOA architecture and when we do a traditional SOA based design we aim for coarse-grained abstract services that encapsulate the complexity of the back end.  Micro services are similar but tend to be much finer grained and therefore there are going to be much more micro services in a system compared to the equivalent in an SOA platform.

In other words, micro services generate more moving parts.  Of course this code logic needs to exist regardless of it being exposed as a micro service or not but the point of the micro service is that it is visible, flexible and reusable. Therefore, if a code function is moved from being wrapped and protected by its surroundings to being visible and usable in a variety of ways, this exposure creates moving parts. As every engineer knows, the more moving parts a system has, the more fragile a system will get.

It is this fragility that will mean that the unbounded flexibility promised by micro services is at risk of not being realised.  Where fragility exists, then change uncertainty and change risk start to materialise.  The obvious mitigation to this risk is to up the amount of testing. Automated testing will help but there will still need to be an element of manual testing to mitigate the risks in high stakes systems.

Additionally, because micro services create a loosely coupled environment with potentially many alternate paths the change impacts are less understood. This is because the interactions between services are not always evident and in some cases not consistent. This uncertainty is amplified in an environment where changes are split between multiple suppliers and/or contract boundaries.  Therefore, as we saw in SOA, it will be possible to make changes quickly but will still take time and effort to assure all of the stakeholders and service management teams that those changes didn’t break something important.

I still think micro services are good however!

What micro services have done; like object oriented design, SOA and others before it; is formalise leading application architecture thinking into simple patterns that everyone can understand, debate and share. Micro services also push the envelope and increase the significance and importance of good APIs. When I mean 'good' I mean usable and reasonably static in terms of change. API changes are the source of the most complex impact assessments. At the end of the day though, we are building complex IT systems and there is always going to be a level of risk generated from that complexity. No matter how simple it is to change the code, the higher the criticality of the system the more rigorous the demands of the users and sponsors of the system to making sure that changes are impact assessed properly and the risks are mitigated with sufficient testing.

Using DevOps means that the automation allows for small changes to be deployed more frequently, therefore reducing the risk of individual changes causing unknown outcomes and failures.  This is true but most organisations are not yet in the place where they can safely implement little and often changes with no manual test requirements.

The point I am trying to make is that IT technology by itself cannot be seen as the solution to all IT problems. The commercial and legislative aspects also need to be modified as well in order to achieve more agility. Some organisations are doing this by taking the integration responsibility back in house and essentially letting suppliers off service levels (because it is just too hard to identify route cause and therefore who is at fault).  I wouldn’t be surprised to hear these organisations claiming that it is micro services that have allowed them to be more agile when in fact it is more likely that the contractual goal posts have moved with their supply chain.

So I don’t think micro services will change the world.  Complex IT will stay complex and every time we create a new way of doing things with a new style, paradigm or technology we make it just a little bit more complicated!

5 November 2015

Architecture Entropy

A.k.a. Why Enterprise IT is never simple

I have been thinking about this theory for a while now but it has taken me some time to put it into words. There is a lot more thinking to be done but I thought I would share the current state of my thinking now rather than continue to mull over it for another 2–3 years. In a nutshell, architecture entropy attempts to define the complexity state of an enterprise architecture.

In thermodynamics, entropy is used as a measure of the disorder in a closed system, the higher the entropy value the higher the disorder. This measurement of disorder can carry across to IT architecture. In thermodynamics, the entropy gain typically comes from an external force such as energy. In IT architecture the entropy gain comes from change.

Architecture entropy gain is the term I use to describe the slow design erosion away from a structured, governed and organised solution state towards a more disordered state as the architectural and structural integrity of the system are eroded.

The entropy gain typically comes from changes to the system. These changes, if not managed correctly, increase the architecture entropy level and therefore the level of disorder in the system. Disorder is bad in architecture as disorder drives cost.

All systems in a single organisation will eventually reach equilibrium at a similar level of entropy. Each organisation’s natural state of entropy will differ from organisation to organisation but it will always reflect the principles and attitudes of the overall organisation management.

The best way to manage entropy gain is to retain sound governance over the system and use that governance to measure and manage the complexity gains and consequences.

The ongoing governance and management of an IT estate is a complex problem and architecture entropy theory is attempting to put a name to that complexity in order to start the process of working out how to measure it. There is a lot more thinking required but I have started the process with this post.

What is “Architecture Entropy”?

The dictionary defines architecture as:

architecture
noun.
  1. The art and science of designing and erecting buildings.
  2. Buildings and other large structures
  3. A style and method of design and construction
  4. Orderly arrangement of parts; structure
  5. The overall design or structure of a computer system or microprocessor, including the hardware or software required to run it.
  6. Any of various disciplines concerned with the design or organization of complex systems

If we ignore the definitions in italics then we are left with a reasonable description of Enterprise Architecture.

The dictionary defines entropy as:

entropy 
noun. 
  1. Symbol S For a closed thermodynamic system, a quantitative measure of the amount of thermal energy not available to do work.
  2. A measure of the disorder or randomness in a closed system.
  3. A measure of the loss of information in a transmitted message.
  4. The tendency for all matter and energy in the universe to evolve toward a state of inert uniformity.
  5. Inevitable and steady deterioration of a system or society.


If we ignore the definitions in italics then we are left with a sense that entropy is the propensity for something to lean towards disorder rather than order.  Just like my desk at home!

Therefore, in dictionary terms, I define architecture entropy as:

architecture entropy
compound noun.
  1. A measure of the disorder in a computing system.
  2. The inevitable and steady deterioration of a computing system toward a state of disorder.

Architecture Entropy is a term used to describe the slow design erosion away from the structured, governed and organised towards a more disordered state. Regardless of how well designed a computer system is, it will be subjected to the laws of Architecture Entropy.

Typically, a well designed system will initially have a low entropy due to the structure and architecture of the solution.  However over time the system will be subjected to ‘entropy gain’ as the architectural and structural integrity of the system are eroded.

All systems in a single organisation will eventually reach equilibrium at a similar level of entropy. Each organisation’s natural state of entropy will differ from organisation to organisation but it will always reflect the principles and attitudes of the overall organisation management.

Architecture Entropy gain cannot be avoided but the levels of entropy gain can be minimised with appropriate governance and budgeting.

Example Architecture Entropy in Action

Consider this high level example based on real experiences, it is not based on one single enterprise but the concepts and outcomes are real.

The graphic below shows a snapshot of part of a complex enterprise estate.  It is not unusual to see many connections between many components.  This many-to-many connectivity leads to complexity and high cost to change.



Given this situation it is very common to consider a the creation of a new integration bus. In the graphic below an enterprise service bus component has been added to provide simplification of connectivity.

Components D and G & H have been decommissioned and the overall vision architecture, compared to the original, is structured, organised, tidy, clean.  And expensive.



The executives are sold on the vision and delivery starts.  However, during delivery it becomes hard to justify altering legacy systems that have been running for years without issue.  In addition, some connections are rationalised but others remain for operational reasons.

As with every delivery, there are short term pressures to deliver some benefit early so an interim ‘transition architecture’ is developed to provide earlier benefit. The transition architecture is complex but a later release will ‘tidy things up’.  Eventually connections that bypass the ESB are re-established because they are quicker and cheaper in the short term. The transition architecture ends up looking like the graphic below.



The outcome of all of this was:

  •  The plan was to give the business what was needed as soon as possible and then tidy up the IT in the next release.  The cost of later releases couldn’t be justified and so didn’t happen.
  • The additional IT complexity increased downstream costs and therefore “quicker” and “cheaper” alternatives to following the strategy were championed by the funding stakeholders.
  • The plan was based on rationalising and decommissioning legacy systems.  However it was discovered late on that there were many more dependencies on the legacy systems and so it was determined to be too costly to decommission all of the legacy systems.
  • The short term “tactical solution” that was only intended to be live for a few months is now many years old and requires a lot of effort to keep it running.

The result was that the enterprise estate remained complex and expensive.  Sound familiar?

Consequences of Entropy Gain

Entropy gain is directly linked to an increase in costs.  The higher the entropy gain, the higher the overall architecture entropy and the higher the architecture’s relative operational costs.

The graphic below shows the typical entropy gain causes.



At the end of the day the costs need to be balanced but there is a tension between the priorities indicated in the graphic below.  The enterprise dilemma is which one or two to focus on because it is impossible to have all three.



Low cost to operate:

  • Impact on change costs: Potential inflexibility due to the run costs being optimised around the ‘go live’ state of the system
  • Impact on build costs: Increased levels of automation that requires additional design, build and test effort

Low cost to change:

  • Impact on operate costs: An increase of overall system complexity to accommodate the flexibility features
  • Impact on build costs: Extra effort to design, build and (in particular) test the flexibility features

Low cost to build:

  • Impact on operate costs: Risk of overall system fragility if “low cost” means “corners were cut” or elements of the system were left to be performed manually
  • Impact on change costs: Possibility of functional duplication as it was cheaper to ‘copy and paste’ function than it was to share and reuse existing.  Therefore increases the cost to change

The end goal is to reach, what I call, architectural equilibrium where we reach a point where the architectural integrity of a system or enterprise is in balance with the costs.  Achievement of this goal is incredibly hard and arguably one of the holy grails of IT.  However, we should not give up trying our best to balance as best as we can.



What can we do about Architecture Entropy?

The level of “entropy gain” is variable.  Many factors determine the level of “entropy gain” of a system:

  • Strength of technical governance
  • Size of the general investment budget
  • Business’s attitude to the complexities of enterprise IT
  • Organisational preference to ‘tactical’ vs ‘strategic’
  • The ‘background level’ of complexity already inherent in the IT estate

An amount of gain on every project is inevitable due to pressures on time and budget.  In fact, a small amount of gain may be beneficial to allow a system to reach equilibrium by taking some overall cost out for very little impact.

The amount of gain and downstream impact can be minimised with appropriate governance and management. Ultimately it is the IT department’s relationship with the business stakeholders that determines the entropy levels.

I see three steps to keep entropy in check:

  1. Measure
  2. Manage
  3. Minimise

Measure
The simplest way to measure entropy gain is to focus on the downstream costs of a particular cost.  Don’t just focus the business case on the cost to implement; look also at a portfolio of common business change scenarios and the 5 year cost. Research the actual long term ‘lights on cost’ that the enterprise has accrued over time. 
In addition, when comparing solution options and when ‘tactical’ vs ‘strategic’ consider the average annual cost rather than the upfront cost when comparing options.

Manage
A few considerations of how to manage entropy gain:

  • Strengthen governance of system change to minimise the risk of short term changes causing long term costs.
  • Create a change checklist to ensure that solution designers are considering the full life cycle changes.
  • Keep focus on the cost case for the solution.
  • Tightly manage deviations and exceptions from the solution architecture as if the system was being created from new

Minimise
A few suggestions on how to minimise entropy gain:

  • Make sure that each solution release provides value to the business and is not ‘just’ IT benefit
  • Use establish facts based on history and current costs
  • Use ‘tactical solutions’ with caution
  • Have a strong exit plan to get off the tactical solution
  • Calculate the full lifecycle costs of the tactical solution
  • Overall though, be pragmatic!
  • Every solution has an equilibrium point where the balance between the architecture purity and the overall costs is met

Conclusion
Aiming for low entropy is a good thing.  To do this we need to create strong business and technical governance who look at the full lifecycle design and total cost of ownership considerations when making decisions.  There will always be exceptions and short term urgency so there needs to be a managed exception processes so that exceptions to the standards can be achieved with managed consequences.

Conversely slipping into a high entropy state is a bad thing.  The consequences are that the medium to long term operational cost increase and it becomes incrementally slower and more expensive to change systems.  When the entropy gain gets out of hand there is a real risk of fragility in the enterprise as systems get more and more unstable.  Finally, the higher the entropy gain, the more it costs to ‘keep the lights on’ in the data centre.

To summarise:

  • Architecture Entropy will always exist
  • Nothing can be done to prevent entropy gain
  • Awareness of the existence of Architecture Entropy should help to minimise entropy gain
  • Invest effort to measure the impacts of decisions, especially in the longer term
  • Use the measurements to manage better outcomes
  • Minimise short term behaviours that can negatively impact an enterprise’s Architecture Entropy



Most of this thinking is captured in the slide deck I have put onto SlideShare and embedded below.