Cost Optimisation in AWS – Governance and Maturity

Introduction

In my previous blog, I highlighted changes that you could make, both technically and also in your ways of working processes, to reduce unnecessary costs in your AWS bill. If you haven’t read it already or need a refresher, the blog is available at https://nivleshc.wordpress.com/2022/06/17/cost-optimisation-in-aws-laying-the-foundations/.

However, unless appropriate governance is in place, there is a high probability that all that hard work will get reverted and wastage will creep back into your AWS environments.

In this blog I will share insights into how you can implement such governance processes and also on how you can mature your AWS workloads, to further reduce your AWS bill.

Prerequisites

It is assumed that all the activities listed in the previous blog have already been implemented. Below is a list of all the mentioned activities.

  • An AWS Organization with All Features has been setup for your business and all your separate AWS Accounts have been linked to it.
  • Mandatory tags have been socialised with all your users and are being used. The tags are being enforced using Service Control Policies.
  • Appropriate T-shirt sizes have been created for your workload profiles, each of which has been optimised using data from your monitoring system.
  • Appropriate workload environment schedules have been created for your non-production environments and are being actively used.
  • Your non-production environment deployment pipelines have been modified to do the following:
    • add mandatory tags to each resource that is provisioned
    • require the user to choose the appropriate T-shirt size and environment schedule (the defaults should be at the top of the selection list)
    • send alerts to a distribution group, with relevant people added to it, when non-default values are chosen for T-shirt size and environment schedule during provisioning of an environment
  • New pipelines have been created for the following:
    • to enable users to manually turn off and turn on their non-production environments
    • to enable users to manually backup the data in their environments
    • to enable users to manually terminate their non-production environment
  • User education artefacts have been created regarding the do’s and don’ts for non-production environments. These artefacts should clearly list the default settings that should always be used when provisioning environments and the approval process required, for non-default settings.
  • User education artefacts for non-production environments have been included in the user onboarding pack

Governance

Lets start with some governance processes that will help to ensure your environments remain cost optimised, and if any deviations occur, they can be quickly identified.

Create a budget and enable notifications for when thresholds are breached

After implementing all that was described in the previous blog, your AWS Accounts should be cost optimised. This is a good time to take note of your new AWS cost, as this will become the benchmark against which all future AWS costs must be measured. If any of your future bills deviate from this amount significantly, this is a signal that something unexpected has happened in your AWS Accounts.

The caveat to this reasoning is that, as your demand for AWS increases, the yard stick for comparing the cost should be appropriately adjusted.

Wouldn’t it be great if there was some tool that would monitor your AWS cost and notify you if it went above a certain budget?

Well, you are in luck. To help you keep your costs in check, AWS provides a service called AWS Budgets. This service allows you to create a budget to monitor your usage in your AWS Accounts. It also enables you to set thresholds, so that you can be alerted when your costs exceed these thresholds, so that you can take appropriate action, before the costs skyrocket. You can set these alerts for actual costs and/or for forecasted costs. Forecasted costs use your past usage pattern and try to predict your total cost for the whole month. These can provide early warning signs, for when your usage is ballooning, however if your usage patterns are expected to change in the next few days, this prediction might not be as accurate.

To create a budget for your AWS account:

When creating your budget, I highly recommend to use the following tips.

  • create a budget for actual and forecasted costs.
  • set the monthly budget value to what you recorded after the cost optimisation tasks had been completed.
  • create alert thresholds for 75%, 85% and 95% of the budgeted amount.
  • add email addresses of all that need to be notified when the budget thresholds are breached. Keep in mind that there is a limit of 10 emails that can be specified. If you need to add more, create an Amazon Simple Notification Service topic, add all the email addresses to it and then specify that Amazon SNS topic as the target of your budget alert notifications. You can also use AWS Chatbot as a notification target.

AWS Budgets has evolved significantly over the years and it now includes budget actions that can be automatically trigged if your thresholds are breached. This can be extremely useful in non-production accounts, where you want to tightly control the costs. Budget actions can be used to automatically address any overspending in these accounts. An example of this can be found in https://aws.amazon.com/blogs/mt/manage-cost-overruns-part-1/, where once an AWS Account has breached its budget threshold, it is blocked from any further provisioning of new resources.

Restrict your AWS usage to only whitelisted Regions and sizes

Most organisations only operate in certain countries and have strict policies around transferring their data outside of these countries, due to data sovereignty. Also, due to their business requirements, not all AWS services would be needed, and from those that are, only a subset of the available sizes would be required. For example, a business might not need access to AWS DeepRacer, however they might require AWS EC2 instances. For Amazon EC2 instances, their need might be just for r5.large, r5.xlarge, r5.2xlarge and r5.4xlarge instances.

For such scenarios, businesses must only allow their users to use AWS services within the approved AWS Regions and for those Regions that are approved, only services that have been allowed should be available. You can then further restrict these services, for example, in the above scenario, only allow Amazon EC2 instances of sizes r5.large, r5.xlarge, r5.2xlarge and r5.4xlarge.

You can use service control policies (SCPs) to create these guardrails. This will allow you to:

  • prevent usage of AWS Services outside of the approved AWS Regions
  • prevent usage of non-approved AWS Services
  • restrict AWS Services to only certain size profiles for example only allow r5.large, r5.xlarge, r5.2xlarge and r5.4xlarge Amazon EC2 instances.

The above will not only help ensure your business aligns to any data sovereignty policies, but also prevent users from provisioning large profiles of unapproved AWS Services, which could end up costing your business a lot of money.

SCPs are available when you create AWS Organizations with all features enabled. You can read more about them at https://docs.aws.amazon.com/organizations/latest/userguide/orgs_manage_policies_scps.html.

Predict costs for any new workloads before provisioning

AWS allows businesses to deploy their workloads quickly and easily into the Cloud. When new workloads are to be provisioned in your AWS Account, in addition to all the architectural design approvals, you should also include a financial assessment and approval process. This will ensure that the new workload abides by any financial constraints that the business has. If it doesn’t then there might be some re-architecting to be done, to ensure there are no bill shocks, or if such modifications cannot be done, then expectations can be adjusted to ensure there is no bill shock, once the workload has been provisioned.

To predict your workload’s cost, you can use AWS Pricing Calculator. It is a great tool, that allows you to predict, based on each service that you will be consuming and the amount that you will be consuming. It enables you to save your estimations and provides a url, which you can then share with your approvers.

Continuous Cost Optimisation

Cost Optimisation is not a destination but a journey, as its something that needs to be done continuously.

To help with this, AWS provides native tools that you can use to further optimise your costs. Some of these are listed below.

I would recommend that you evaluate the recommendations from these tools at least once a month and apply any that might be applicable.

AWS Compute Optimizer

AWS Compute Optimizer uses machine learning to analyse your Amazon Elastic Compute (EC2) instance types, Amazon Elastic Block Store (EBS) volume configurations and AWS Lambda function memory sizes to recommend optimal configurations to help reduce costs and increate workload performance.

It is a good practice to evaluate AWS Compute Optimizer recommendations periodically, to see if additional cost savings can be obtained from them.

AWS Compute Optimizer is not enabled by default, you will need to “opt in” for it. When you first opt in, it may take up to 12 hours to fully analyse the AWS resources in your account. More details are available at https://aws.amazon.com/compute-optimizer/

AWS Cost Management – Rightsizing recommendations

AWS Cost Explorer now includes Amazon Elastic Compute (EC2) rightsizing recommendations. These recommendations are powered by AWS Compute Optimizer.

The rightsizing recommendations help identify cost-saving opportunities by downsizing or terminating instances in Amazon Elastic Compute Cloud (Amazon EC2). All your under-utilised Amazon EC2 instances across member accounts are listed in a single view to immediately identify how much you can save.

Rightsizing recommendations are not enabled by default. Use the steps below to enable them.

To enable rightsizing recommendations:

  1. Open the AWS Cost Management Portal at https://console.aws.amazon.com/cost-management/home.
  2. In the navigation pane, choose Preferences.
  3. In the Recommendations section, choose Receive Amazon EC2 resource recommendations.
  4. Choose Save preferences.

More information about AWS Cost Management Rightsizing recommendations is a available at https://docs.aws.amazon.com/cost-management/latest/userguide/ce-rightsizing.html

AWS Trusted Advisor

AWS Trusted Advisor is one of the most important tools in AWS. It provides recommendations across the following pillars:

  • Cost Optimization
  • Performance
  • Security
  • Fault Tolerance
  • Service Limits

Unfortunately, the AWS Basic Support and AWS Developer Support plan only allows core security and all service quota checks. To access all other checks, including Cost Optimization checks, you need to have AWS Business Support or AWS Enterprise Support.

Here is a list of cost optimisation checks that AWS Trusted Advisor can do you for you

https://aws.amazon.com/premiumsupport/knowledge-center/trusted-advisor-cost-optimization/.

You can configure AWS Trusted Advisor to send you weekly email notifications, which you can use to further optimise your AWS usage (for example, disassociate unused Elastic IP addresses, delete unused Elastic Block Store (EBS) volumes).

More information about AWS Trusted Advisor is available at https://aws.amazon.com/premiumsupport/technology/trusted-advisor/

Workload Maturity

After you have migrated your workloads to AWS. that is not the end of the journey, but just the start. To further reduce your costs in AWS, you can look at re-architecting your workloads, to make better use of all that is available in the cloud.

Revisit your Capacity Reservations

Let me paint you a picture here.

You have deployed your workloads in AWS in the Sydney region across two availability zones (AZs) ap-southeast-2a and ap-southeast-2b. Your workload requires that at least ten r5.8xlarge Amazon EC2 instances are running at any given time, otherwise its performance will be affected.

On a certain day, one of the AZs in Sydney (ap-southeast-2a) suffers an outage. Since your AWS deployment has been configured to be resilient to this, your autoscaling groups start provisioning Amazon EC2 instances in the other AZ (ap-southeast-2b). However, you soon notice that these new Amazon EC2 instances are failing to launch and you receive the following error.

“An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 4). We currently do not have sufficient capacity in the Availability Zone you requested.”

Due to this, your workload is unable to get the minimum number of r5.8xlarge Amazon EC2 instances and performance degrades to the point that your workload suffers an outage!

How did this happen you ask yourself? You had followed all the best practices and ensured your workloads were provisioned across two AZs?

The issue wasn’t with your configuration but with the capacity of the AZ. An AZ, as you might be aware, is one or more physical AWS datacenters were you would be provisioning your workloads. So, there is a physical limit on the number of host servers that can be housed at these datacenters. To ensure that there is no wastage of resources, there are complex algorithms that AWS uses, based on demand, to allocate a certain number of each Amazon EC2 instance type. When the AZ outage occurred, all AWS customers that had workloads configured for resiliency, started provisioning Amazon EC2 instances in ap-southeast-2b. If your instance type was a popular choice, which other customers were also using, this would quickly deplete its supply and cause the insufficient instance capacity issue for that particular instance type that you faced.

How do you get around this?

Well the simplest way is to reserve capacity in each AZ, so that you are always guaranteed that number of a particular instance type in a particular AZ. For high availability of services, customers would normally purchase capacity reservations in each AZ that corresponds to their minimal instance requirement. In the above example, the you could have purchased a capacity reservation for 10 r5.8xlarge instances in AZs ap-southeast-2a and ap-southeast-2b. If ap-southeast-2a suffered an outage, all your Amazon EC2 instances will provision successfully in ap-southeast-2b since you had reserved that capacity. Here is the KB article for this https://aws.amazon.com/premiumsupport/knowledge-center/ec2-insufficient-capacity-errors/

However, if you look at this from a cost perspective, you are doubling your spend to ensure a 100% workload uptime. I am sure your finance team won’t be very happy.

Is there a better way you might ask? Well there just might be.

In the example above, I stated that your workload had a hard requirement for r5.8xlarge Amazon EC2 instance type. If this is a popular choice across all AWS customers in that particular AZ, or if there are not as many of these instance types available for you to use in that AZ, you would hit that insufficient capacity issue. However, if you were to use instance types that were available but were not as popular, you might be able to provision your workloads in that AZ.

It can be quite challenging to find all different permutations of instance types that would suit your workload, without affecting performance. For example, would your workload work the same if instead of one r5.8xlarge, you used two r5.4xlarge? You might have to go through a few cycles of testing to find the correct combinations that work. However, the reward is well worth the effort, as not only will it make your workload’s instance profile more diverse, you will also half your costs as you won’t have to purchase reserved capacity in other AZs.

Note: The above recommendation is well suited for non-production environments. For production environments, thorough testing needs to be done before diversifying your instance type fleet. The testing has to be done in collaboration with the application vendor, to ensure it is supported by them.

Modernise your workloads by using cloud native resources

The driver for most organisations cloud migration is to get off their on-premise datacenter. The quickest approach to getting into the cloud is a lift and shift approach (Rehosting), where the on-premise setup is replicated in the cloud and the application workloads are migrated across.

Simple as it may be, this is not the best approach. You would have migrated your workloads to the cloud, however you would have brought all the baggage of being hosted on-premise with it. This is definitely not what you want, instead you want to be using all the benefits of being in the cloud as much as possible.

For most things in AWS, there is an applicable managed service. For example, if your workload needs databases, instead of self provisioning databases on Amazon EC2 instances, where you have to administer and patch them, you could instead use Amazon Relational Database Services. This is a managed service from AWS, which means that AWS looks after the database platform and ensures that it is up all the time. AWS is also in-charge of patching the underlying infrastructure of the database platform. This means, you are free from all the administration burden that comes with provisioning databases yourself, and instead you can just concentrate on using it as a service.

When you modify your workloads to use AWS managed services, you remove the administration burden of those services from your team. This can result in higher productivity for your team as they have more time to concentrate on other more important tasks. This also means that your team doesn’t require those special administration skills anymore, which could further reduce your teams cost.

You can further modify your workloads to make use of AWS Serverless services. This includes services such as AWS Lambda, Amazon API Gateway, Amazon Aurora Serverless, Amazon DynamoDB. This will enable you to further reduce your cost since you only get charged when these services are used. For example, if you have an Amazon EC2 instance that just runs python scripts at a certain time of the day, instead of paying for the server to up 24×7, you can instead use AWS Lambda to run your scripts and get charged only for when the scripts run. That is a lot of savings. Did you know that the popular Cloud education company, A Cloud Guru is fully hosted on serverless technologies? If you want to read more about it, this article covers it well https://siliconangle.com/2017/08/15/a-cloud-guru-uses-lambda-and-api-gateway-to-build-serverless-company-awssummit/.

Also, with Serverless, there is really no servers to manage, so the administration burden for these services is removed from your team.

Summary

Lets recap on the topics that were discussed in this blog.

To ensure all the work that you did for your AWS cost optimisation doesn’t get reverted, below are some governance policies you can implement:

  • create budgets with alerts that will notify you when your AWS bill has reached 75%, 85% and 95% of your budget amount. You should have policies and procedures that will help bring down the cost should any of these thresholds get breached.
  • restrict your AWS usage to only whitelisted services and regions. This ensures that only allowed services (further restrict the available sizes within that service for example only certain Amazon EC2 instance types should be available, if there is no need for others) in allowed AWS regions can be used by users.

It is imperative to note that cost optimisation is not a destination, but a journey. What I mean by this is that once you have optimised your cost, you shouldn’t stop there. AWS provides tools that always provide recommendations on where else you can save on costs. You should periodically evaluate these recommendations to see if they fit your workloads.

To get even more cost savings, you could mature your workloads. This might take some effort, however most of the time, the reward is worth the effort. Below are some workload maturity strategies you can adopt.

  • adopt a diverse range of instance types that your workloads can run on. This will enable you to reduce and even remove all your capacity reservations across all AZs, saving you lots of money.
  • rearchitect your workloads to use AWS managed services. This will free your users from the administrative burden for these services and instead give them more time to innovate and do other interesting things
  • rearchitect your workloads to make use of AWS Serverless technologies, for example AWS Lambda.

I hope you find the insights in this blog useful and they enable you to further cost optimise your AWS usage, but more importantly, allow you to create governance processes to keep your AWS environments cost optimised for months to come.

Till the next time, stay safe!