Cloud sustainability guidance

The NHS has set several targets on carbon and environmental sustainability which need to be considered when choosing the type of technology deployment. Sustainability should also be considered when it comes to the ability to maintain and support a product or solution.

Cloud Centre of Excellence - NHS Cloud strategy

Let us know what you think about the Cloud Centre of Excellence (CCOE) strategy.

Carbon sustainability is still very immature. Most organisations have data on what percentage of their power requirements are provided by renewable & carbon-neutral sources but all other data on the carbon footprint associated with their hardware (scope 3) is at the moment an estimate. It is expected that over the next couple of years industry will rapidly mature. However, making comparisons between solutions and vendors on a carbon footprint perspective is currently difficult.

Additionally, the fact that by its nature producing semiconductors is an energy and carbon-intensive process, albeit one that drives performance and efficiency improvements, means there are substantial cross industry challenges in making IT truly carbon neutral.

The practical approach to minimising carbon footprint comes from the efficient use of IT. While this doesn't get to the sustainable destination it minimises the (as yet unmeasurable) carbon footprint from services. Moreover, it minimises the cost of running those services and will be captured by vendor best practise frameworks.

Sustainability from a perspective of building services, which are maintainable going forward, must meet changing customer needs keeping the cost of serve as low as possible. These goals are not new for IT teams but the changes in delivery approaches supported by public cloud adoption support.

Sustainability targets

Overall, the priority around efficiency should be to ensure any service or infrastructure is only running when it is required, it is sized appropriately for demand, and the impact on client devices is understood. Ideally, information on the energy cost of individual transactions will be measured.

There are a number of facets to understanding in reducing the carbon footprint of services.

Understand the value of the service

When minimising carbon impact, the first thing to consider is what productivity improvement is delivered by the product/service and what overlaps there are with other products/services. Migration of workload to the cloud should be proceeded by an assessment which will not only look from a technical perspective of how a move can happen, but also why specific services should be migrated or whether they can be decommissioned. When identifying new use cases which require a technology solution, clear decisions should be made on creating new services or building features into existing ones.

Every product team should be able to quantify the carbon/sustainability cost of their service in relation to the quantifiable benefit it brings to the customer. This should be backed by KPI’s which measure this impact, at the highest level of a format that compares carbon produced against effort saved.

Minimise client-side overhead

Services should be built to have a minimal client-side resource demand. This has multiple benefits, not only is there a sustainability perspective from reduced energy usage but there are also improvements in the hardware upgrade cycle for client devices. Where client-side processing is required, make use of caching techniques to avoid the use of energy-intensive edge network devices such as wireless access points and where possible take common versions of frequently used client-side files. For example, common libraries such as jQuery provide their own CDN use of this by multiple products should ensure a single cached file. Where this isn’t possible consider sharing cached files between projects or organisations to minimise the storage needed client-side.

Maximise utilisation

Aim for a high level of utilisation, this can take a number of forms depending on the technology used. Where the product uses IaaS structures regularly review instance/machine utilisation and aim to keep utilisation as high as practically possible (80%+). This approach should be used in combination with auto-scaling; the ability for infrastructure to expand based on utilisation metrics is the fundamental approach to public cloud cost efficiency. To truly benefit from this approach, machine/instance startup time becomes an important metric, the higher utilisation there is of running infrastructure the shorter the startup time of new infrastructure needs to be to enable scaling as demand increases. If utilising containers, be aware there may be a need to scale underlying infrastructure before additional containers can be launched.

PaaS services usually either come in a form where the consumer has control over the underlying infrastructure the service is running on (i.e. you need to think about instance sizing) or the provider abstracts any infrastructure and provides a per transaction or data specific cost. For the first class of services, the mechanisms to manage carbon are similar to IaaS services. In addition, for all types of PaaS services consider what is being managed, how long it is being stored, and what backup or replication policy is in place. A service that is only required to support 3 9’s availability (99.9%) may not need a multiple region/availability zone deployment.

FaaS or serverless deployments have far less control for the consumer. The efficiency of the code deployed, and careful consideration of what is persisted will have the biggest impact on sustainability. Equally the time taken for functions to start, the latency of any operations they perform and their capability to run in parallel become critical to the performance of a service. The cold start time of any FaaS solution will be relatively large (as in the background infrastructure is being provisioned) this startup time and the usage profile of the service being delivered needs to be understood and regularly reviewed to understand how many copies of the functions to keep warm/ready for use.

SaaS services are entirely the responsibility of the provider however it is reasonable for a customer to expect the provider to report on their sustainability footprint and be expected to make decisions on the service with consideration to sustainability goals.

In addition to considering how to optimise the running product or service, care should be taken to ensure that only required infrastructure is running across the entire organisation. While the focus so far has been on the infrastructure supporting live service most organisations will be running a number of instances of each product to support development, test, integration, and potentially business continuity. Substantial sustainability savings can be made by

stopping any environments which aren’t being actively used. For most organisations this would likely mean anything outside of production is only running during office hours, ideally, this should mean that environments are only running for the specific period they are being utilised
sizing environments appropriately for the workload, outside of performance test environments used for a soak or capacity testing will not need to handle product transaction volumes as such they should be sized for their intended need

A challenge with a run-while-you-use policy is it has the potential to tie up members of the product team to manage startup and shutdown, especially where there are a number of integration to different products. Automation should be used to simplify this process and provide the capability for users of the test environments to trigger their own startup/shutdown. As teams become more mature it is also worth providing stubs and harnesses to allow dependant teams to test without the need for a full instance of the product.

Equally there can be a sustainability win from re-evaluating and potentially renegotiating SLA (Service level agreements). For example, in a world where no disruption to clients is allowed capacity always has to be available to deal with failure. Smart use of infrastructure can minimise this idle overhead but it will always be present. Changing the SLA to tolerate a small (measured in seconds) disruption to clients can ensure the solution runs with little to no spare capacity and auto-scaling is used to manage growth or failure, now the speed of instance startup and the metrics used to manage auto-scaling will be critical to service. This will however require discussion with stakeholders and my not be achievable for every service.

Maintainability

A fundamental requirement for strong maintainability is the use of automation to remove and reduce the level of team effort required to make changes. The less time a team has to dedicate to validating and deploying their enhancements the more they get to spend on adding value to the product.

Looking at maintainability purely from a sustainability perspective, the benefits come from being able to quickly make operating changes based on monitoring/feedback of utilisation and efficiency.

This leads to a series of recommendations

Monitoring

Any changes should be based on data and insights, the impact of any changes should be analysed and reviewed. This is impossible without good monitoring and logging. In addition to traditional infrastructure metrics, cloud services require additional data collection and likely application instrumentation to understand the sustainability impact of changes. With IaaS traditional metrics such as CPU and memory utilisation will be recorded but regardless of deployment structure data should be recorded about service response time and the breakdown of that response time for individual components. Utilisation high and low watermarks should be recorded including for FaaS services where there are various processes running in parallel. While this data will likely come from multiple sources being able to correlate different metrics across the system is very valuable when it comes to understanding where to focus the next improvements.

Deployment

The aim should be for any changes to be deployed as code and as part of a CICD pipeline, this provides several benefits including the ability to shut down and remove infrastructure easily, create multiple copies of the product as need requires, understand the impact of changes before they have been deployed and rapidly release changes to the platform.

Infrastructure as code (IaC) is now a relatively mature discipline. Many of the practices that apply to good software engineering include test strategies, version control, and development methodologies.

While the large cloud service providers and vendors have their own approaches to IaC tooling It would be worth exploring options that permit the benefits to be realised on-premises and across different technologies. There are a number of open source tools such as Terraform, Ansible, Puppet, and Chef which usually support a dual license model (providing additional capabilities and support with paid-for licensing).

Test

The capability to automatically test and validate release candidates of products before production change is critical to realising the benefit of frequent change. As the cloud platforms are being constantly updated and new features are being released there is a requirement that product teams are aware of the changes and can act when a new feature saves them money or effort.

While test-driven development (TDD) and behaviour driven development (BDD) are mature in software development the experience and tooling are less mature when it comes to infrastructure deployment. Equally automated testing of CotS software depends to some extent on trust in the vendor and their support of tooling. As with any area of system design, the starting point is to understand the areas of largest risk and deliver small incremental improvements.

Constraints

Cloud service providers' best practices when it comes to cost management and sustainability are to migrate to newer energy-efficient or lower carbon impacting hardware/services as they become available. The rationale for this seems to be that efficiency improves and costs per MB or invocation always gets lower with each new release. At present the full carbon cost of this is not captured and as there is an expectation that the carbon cost of silicon/hardware production is high this appears to be a challenge to sustainability goals.

Cloud impact

Cloud providers and the model of service delivery bring a number of sustainability benefits including:

an incentive to customers to only run what they require
the ability to quantify the carbon used through the use of frameworks
providing recommendations on where further savings can be made

The concept of shared responsibility as shown below provides clear demarcation between the responsibilities of the provider and the customer and places a number of concerns such as energy and cooling firmly onto the provider. The scale of their operation and commercial incentives provides them with increased scope to design and operate their environments to be efficient.

The mature financial framework provided by operators also aligns operational efficiency well with sustainability. In an environment where services are charged for by the hour there are financial along with sustainability gains to be had, from utilising compute highly and switching off anything that is not actively providing value.

Last edited: 4 July 2023 5:51 pm