For years, the CloudFix team has managed and maintained 120+ AWS hosted SaaS products across hundreds of AWS accounts. Although this model follows established AWS best practices, the team’s scope introduced operational challenges. Their team needed a way to identify cost-saving opportunities across their applications without making architectural compromises or introducing service disruption.

The team responded to the challenge by developing CloudFix. CloudFix proved that it could meet their architectural requirements while cutting costs by 10-20%. For example, CloudFix helped with:

Here is the story of how CloudFix was built using AWS Systems Manager components and specifically how the Change Manager capability unlocked key functionality that other tools could not.

Automated Opportunity Finding

As the CloudFix team explored possible solutions, their first realization was that identifying cost saving opportunities needed to be automated. Although converting all gp2 volumes to gp3 could save up to 20% in costs they had thousands of these volumes spread across their accounts. Each volume needed analysis of past performance to determine if this was the right path or if provisioned IOPS would be necessary to optimize the workload. The data gathering and opportunity identification had to be automated to be feasible due to the scope.

AWS Config turned out to be a great tool to gather inventory of all of the AWS resources across hundreds of accounts. They enabled Config recording and created configuration snapshots in response to configuration updates generated when resources were added, removed, or reconfigured. The snapshots included resource IDs and configuration metadata.

In addition to resource configuration metadata and IDs, usage metrics were necessary to generate quality analysis and recommendations. The team leveraged cross-account functionality in Amazon CloudWatch metrics to collect this data.

Once resource configs and metrics were all gathered and stored in a centralized data store, CloudFix ran finders against the data to create a list of recommended optimization actions.

Automated Opportunity Fixing

The natural next step after identifying cost-saving opportunities was to execute on the recommendations. It was also important to the operations team to automate this process, because even a simple change needs operational guardrails to make sure that it is implemented in a safe, secure, and repeatable manner.

For example, there are multiple considerations required when converting a gp2 volume to gp3. With each conversion, the team would create an EBS snapshot, create a snapshot lifecycle policy to make sure that the backup does not accrue unnecessary costs, initiate the volume conversion, and monitor the state of the change. Doing this manually for thousands of recommendations would be both error-prone and tedious.

This realization led the team to develop fixers using custom AWS Systems Manager Automation runbooks. CloudFix published an example of their runbook for converting gp2 to gp3 here.

Recommendation Workflows

Even with automated finders and fixers, rolling out changes at scale turned out to be a significant operations workflow challenge. The following requirements had to be considered:

  • Account and resource owners needed to be informed of all changes.
  • Account and resource owners needed to be able to easily review the AWS Systems Manager document corresponding to the change.
  • For fixers identified as low-risk, such as converting an EBS volume from gp2 to gp3, operations staff should be able to deploy the change without waiting for approval.
  • For fixers identified that are higher-risk and require review, the fixer should only be executed if explicitly approved. Any rejected changes must also be recorded for analysis.
  • For all identified fixers, the operations team needed to keep track of changes performed and monitor their impact.
  • Account and resource owners needed to be able to pull up all governance related changes made to their resources.

All of these requirements needed to be met and had to work with tens of thousands of resource changes spread across hundreds of AWS accounts, each with different owners.

AWS Systems Manager Change Manager to the Rescue

The Change Manager capability was added to AWS Systems Manager in December 2020. The feature set of Change Manager instantly transformed CloudFix into a scalable cost-savings tool by making the change process streamlined and more secure with the following flow:

 Architectural diagram. The workflow from left to right is: CloudFix’s automated finders identify cost-savings opportunities, which triggers an AWS Lambda function. The function generates a change request and logs details into an Amazon Aurora database. When the change request is proposed, approvers/account owners are notified with Amazon Simple Notification Service (SNS). The approver reviews the change request and the proposed CloudFix Fixer Document. If they approve, then the Fixer document is executed. If it is rejected, then it does not execute. Both workflows end with another SNS notification to the approvers followed by another Lambda function to handle rejections.

  • CloudFix creates change templates describing each new type of finder/fixer. Account owners have the opportunity to review and approve the templates before any new change request can be made in Change Manager.
  • For each new set of resources to be fixed, a change request is created. CloudFix itself only has permission to create a change request based on an approved change template.
  • A change request is automatically executed after the account owner or designated approver approves the change. Designated approvers can be AWS Identity and Access Managemet (IAM) users, roles, or AWS Single Sign On (SSO) users or groups. The delegates only need permission to approve change requestchange requests and not all of the operations executed during the request.
  • Changes that have no performance impact or risk can be auto-approved. The approver still receives notifications when changes are executed.
  • CloudFix lets changes be either auto-approved or rejected after a timeout.
  • Complex workflows with multiple stages of approval, multiple approvers, or a group of approvers are supported.
  • Approved changes can be set to execute at specific times.
  • A change request can be tracked and aggregated centrally for analytics and reporting.

Conclusion

AWS Systems Manager Change Manager provided the features to transform the CloudFix product into a scalable, multi-account, cost-savings tool. It completed their infrastructure-as-code transition by delivering operational changes as pull requests. All of this was achievable by integrating native AWS services.

You can learn more about Change Manager and its available feature set by reviewing the AWS Systems Manager Change Manager User Guide.

CloudFix is available in the AWS Marketplace. With one selection, you can install the AWS CloudFormation templates needed to start getting change requests with cost-reducing recommendations.


This post was written in collaboration with Badri Varadarajan – Executive VP, Technical Product Management (DevFactory) and Ravi Duddukuru – Chief Product Officer (DevGraph)

About the author

Dan Hammel

Dan Hammel is a Training Delivery Manager in AWS Training and Certification. He is an AWS Authorized Instructor with a DevOps and Engineering background. Whenever he is not in the classroom or working with trainers, he is finding ways to automate manual processes and build his technical depth. Outside of AWS, he has a passion for cooking, enjoys a wide variety of music, and has been writing the first 17 pages of a novel for over a decade.