Enabling agile application development teams to self-serve and quickly provision the resources that they need while adhering to the organization’s governance and controls can be challenging. In this post, we’ll explore Expedia Group’s (EG) Cerebro platform, a Database as a Service (DBaaS) offering built on AWS technologies. By using this platform, EG is able to quickly deploy, manage, and support thousands of database clusters, all while enforcing governance, security controls, and best practices with a handful of engineers.
When Covid-19 happened, the travel industry saw a significant business slowdown. EG, the world’s largest travel platform, was no different. The EG leadership took multiple initiatives to make sure that the company thrives even during the pandemic, including an effort to evolve the company toward a common platform strategy. This initiative has 3 goals: Drive efficiency, improve developer experience, and reduce platform costs by consolidating and decommissioning redundant tools. EG’s Cerebro platform was developed to accelerate persistent store platform consolidation while reducing the operational overhead of maintaining it. Today it supports the data layer for over 700 applications within the EG portfolio with managed services as well as self-managed Amazon Elastic Compute Cloud (Amazon EC2) database instances.
Existing database challenges
EG has over 200+ travel sites in 70+ countries that offer 5+ million properties, as well as 500+ Airlines. To support this, EG runs over 9,000 applications managed by 650 engineering teams across 400+ AWS accounts that utilize tens of database technologies. Each of these accounts has different automation levels to provision and manage data infrastructure. Many of these teams were managing their own data infrastructure and operations. This distributed management made it impractical to implement standard database best practices, compliance requirements, observability, and resiliency across the data stores while accelerating product development time. For example, when data infrastructure is stood-up without tags, it can be a painful process to identify the owners and applications being supported. Without clear association to application workloads via tagging, cost tracking becomes a series of spreadsheet calculations that make operational efficiency all but impossible.
Another challenge is that many of these teams are development teams with no embedded database specialists. This results in performance issues masked by scaling infrastructure, and thus failing to capture red flags in time, resulting in multiple incidents. In addition, work redundancies occurred with every team trying to automate their database management operations. By considering the common platform strategy, EG started looking for a solution that scales across hundreds of accounts without adding complexity to manage database infrastructure, and simultaneously empowers development teams with self-service tools to build, manage, and secure compliant and governed data infrastructure.
Cerebro is built using AWS Service Catalog. Before we dive deep and discuss the architecture, there are a few AWS Service Catalog components you must be familiar with:
Product – A product is an IT service, such as a database, Amazon Virtual Private Cloud (Amazon VPC), Amazon EC2 instance, n-tier environment, etc., that we want to make available for deployment on AWS.
Portfolio – A portfolio is a collection of products that contains configuration information.
Provisioned Product – A provisioned product is a stack. When an end user launches a product, the product instance that’s provisioned by AWS Service Catalog is a stack with the resources necessary to run the product.
Constraints – Constraints control the ways that you can deploy specific AWS resources for a product. You can use them to apply limits to products for governance or cost control. There are different types of AWS Service Catalog constraints: launch constraints, notification constraints, and template constraints.
Hub and Spoke model
At a high-level, as shown in the following figure, Cerebro architecture is based on the Hub and Spoke model. A central AWS account (Hub Account) is used to create and manage baseline products and portfolios and share with multiple child AWS accounts (Spoke Accounts). Once portfolios are shared with spoke accounts, they import the portfolios locally and can deploy products. This hub-spoke model lets Cerebro admins easily deploy and manage portfolios and products centrally in a single hub account. In turn, this allows spoke accounts deploy new products or update existing provisioned products to the latest versions.
Cerebro architecture for self-managed databases
Cerebro uses multiple configuration files for various functionalities, such as onboarding new AWS accounts, onboarding new products, updating provisioning products, etc. On the top left of the following figure, you can see both configuration files and AWS CloudFormation templates that are committed to a Git repository. You must understand these files at a high level, as they’re referred to in the later part of this post.
Account-level configuration files: A metadata file that’s generated for each account that contains account-specific information, such as monitoring, logging, secret management solutions, Amazon VPC Ids, subnets etc., which is used when provisioning and managing data stores within these accounts
Onboarding templates: Contains CloudFormation templates to create AWS Identity and Access Management (IAM) roles to provision and manage products in the spoke accounts.
Product CloudFormation templates: Contains AWS Service Catalog product CloudFormation templates for both open-source data stores and AWS managed data stores.
Cerebro Agent: A custom in-house developed process that gets installed for self-managed databases that run on Amazon EC2. This agent manages some important tasks, such as health checks, backups, and maintenance jobs.
Cluster Level Configuration: Contains a non-default database configuration for each cluster. This information is utilized by Cerebro agent to configure the database with different settings that are unique to the database engine type and use case.
Maintenance scripts: Contains scripts for routine maintenance tasks, such as performing rolling upgrades, ad-hoc jobs, configuration changes, rolling restarts, etc.
The Cerebro admins check in these configuration files into the Git repository. The Jenkins jobs synchronize these templates with an Amazon Simple Storage Service (Amazon S3) bucket in the Hub account. The AWS CodePipeline gets triggers when new files are added to the Amazon S3 bucket. CodePipeline creates the products under the Cerebro portfolio in the Hub account, shares the portfolio, and imports them in the spoke accounts.
These shared products are available for the developers in their accounts, and they can provision the products when they need them. When a product is provisioned, Cerebro creates the database cluster nodes and installs Cerebro agent on these cluster nodes. Once the Cerebro agents become available, the nodes are combined to create the database cluster using the cluster level configuration templates.
The Cerebro Agent periodically performs the health checks on the database nodes and streams the data into an Amazon Kinesis Data Stream within the hub account. Using AWS Lambda functions, these events in the Kinesis Data Stream are processed and loaded to an Amazon Aurora cluster for reporting. The Cerebro UI generates a centralized health reporting dashboard of the entire data infrastructure that’s running in the organization. Cerebro’s dashboard is a single-panel view of the entire data infrastructure health across all 400+ AWS accounts. Using this UI means that the operations team can easily view a node’s health status, perform joining or bootstrapping, and drill down other cluster details for just-in-time troubleshooting.
Furthermore, the Cerebro Agent performs backups and maintenance activities on each data node, and it inserts operational status updates into the live Kinesis Data Stream. The Cerebro UI aggregates the stream in real time to track and remediate if any of the maintenance jobs failed to execute across the 400+ accounts. For performing operational tasks – such as database upgrades, running ad-hoc jobs, automated patching, etc. – AWS Systems Manager is used. The operation teams can easily run these tasks from Cerebro’s UI. This not only provided ease of use, but also made all the operations auditable and reusable.
For Monitoring and Alerting, multiple AWS services – such as Amazon CloudWatch and Amazon Simple Notification Service (Amazon SNS), integrated with partner products such as Datadog, and Splunk – are used. The combination of services varies depending on the database manageability and database engine type. Alerts are received via these different monitoring and logging solutions. These services are integrated with PagerDuty, and they send an event to Amazon EventBridge. The EventBridge triggers the AWS Step Functions to auto-remediate these alerts. We’ll discuss these actions in detail in the Cerebro First Responder section.
Cerebro First Responder
First Responder was developed with the goal to automate the incident resolution so that there won’t be any manual intervention and effort required to resolve incidents in the database space. First Responder works as follows: upon receiving an alert, PagerDuty’s EventBridge integration will gather the alert’s data and invoke the Step Functions. This data will be processed by the Step Functions workflow. Then, this calls a Lambda function to run automated incident response run books to resolve the alert.
After implementing First Responder, EG automatically resolved the most common incidents – such as resources contention, connection limits, process termination, etc. – with standard run books that accounted for approximately 30% of the pager alerts. This resulted in faster incident resolution, data platform availability improvements, and team fatigue reductions due to repetitive work. Most importantly, the on-call engineers can get better sleep, as these incidents were getting resolved without waking them up at 2 AM.
Cerebro architecture for AWS managed databases
Similar to the Cerebro architecture for self-managed databases, AWS managed databases are also provisioned using AWS Service Catalog from the spoke accounts. However, these are AWS managed databases, so we don’t have the Cerebro agent running on the instances. Therefore, Lambda functions are used to populate the inventory. Furthermore, during the provisioning process via CloudFormation templates, monitoring and alerting is enabled and on-boarded to First Responder for all of the AWS managed database clusters.
Onboarding new AWS Accounts and new Products to Cerebro
For onboarding new spoke accounts, IAM roles are provisioned with appropriate policies for granting access to the Cerebro Agent to stream health metrics from Amazon EC2 instances to Kinesis Data Streams, as well as on Amazon S3 to perform and store backups. Additional IAM roles are created for AWS Systems Manager to retrieve patch files from Amazon S3 to run on Amazon EC2 instances. This is also done for CloudFormation to access Amazon S3 buckets for retrieving Account level configuration and Cluster level configuration during the onboarding process.
In addition to the IAM roles, portfolios are shared and imported in the spoke accounts, and the AMI’s are shared separately. The permissions are granted to approved users to provision the products. All of the steps to onboard an account are created as a product within the Cerebro portfolio. This is the first step on any new account that’s expected to host databases. As an ongoing activity, the new self-managed databases will be on-boarded to the Cerebro platform by creating product CloudFormation templates, as well as adding health checks, backup, and maintenance intelligence to the Cerebro agent. For AWS managed databases, only product CloudFormation templates are created
Provisioning new Products and updating existing Products
Let’s look at how an end user, typically developers at EG, could provision or launch new products using the Cerebro platform. From the spoke account, a developer can use the appropriate IAM permissions to use the AWS Service Catalog console to see the list of available products that they’re allowed to deploy. Moreover, they can view the list of existing provisioned products. An end user can simply choose and deploy a product from the list of products made available by Cerebro admins. End users must provide the required minimal parameter inputs, such as product name, database type, tags, etc., and then launch the product.
This solution lets developers simply launch the product and focus on their tasks. There’s no need to worry or spend time on configurations or other steps that don’t add value to their developer work. The configurations include the underlying networking, meeting resiliency requirements, setting-up monitoring and alerting, self-healing and fast scaling the clusters, 24×7 accountability, automated backups, patching and maintenance, and so on. These are all automatically handled by the Cerebro platform once the products are deployed using Cerebro.
Since the manageability and governance are already built into the Cerebro platform, the time to deploy a database cluster has been reduced from few days to just a few minutes. As an example, in the following diagram, if a developer wants to deploy a MongoDB cluster, then the entire architecture supporting the cluster is abstracted from them, and just the endpoints are provided once the cluster is provisioned.
Existing provisioned products are updated using the Update Provisioned Product option from the AWS Service Catalog. This option lets an admin easily perform operations such as scaling the cluster nodes, changing instance types, updating database versions, etc. For example, when a database version is updated using the Cerebro UI, AWS Systems Manager is used to roll out the new version in a rolling fashion to a newer version using Systems Manager and Ansible (Initiated via Cerebro UI). Furthermore, the provisioned product is updated so that any future data infrastructure will use the newer version of the database engine. Similarly, instance type changes can be performed by replacing instances after updating the provisioned product with a new instance type.
Current deployment Profile
As of this post writing, Cerebro enables the launching of 15 products with 55 versions, and the platform supports 1000+ provisioned products with 6000+ nodes serving 700+ applications across different organizations within EG. Looking into the future roadmap, given the company’s initiative evolving toward a common database platform, more products are inline to be on-boarded to Cerebro, and the platform is expected to grow 4x over the next two years while still maintaining the two-pizza team.
Advantages of using the Cerebro platform
By implementing Cerebro, EG’s Database Engineering team can gain control and visibility into thousands of database clusters across hundreds of AWS accounts, all while improving the posture by following best practices, enforcing governance, and encouraging compliance requirements. Additional benefits include improved uptime because of the Cerebro self-healing capability, better application tracking by mandating tag requirements, reduced infrastructure provisioning time, an establishing unified solution for monitoring, patching, and other maintenance across hundreds of AWS accounts.
Enterprises that run large-scale Cloud services across their internal organizations often find it challenging to implement an Agile methodology. These teams run the risk of working in silos that make it difficult to control cost and enforce best practices. We shared in this post how EG built a centralized database platform, Cerebro, using AWS native services. Cerebro helped Expedia administrators standardize the deployment of new databases, and simultaneously helped gain more visibility into provisioned products across accounts while enforcing best practices, governance, and compliance.
If you’d like to learn more about Cerebro’s building blocks, then check out AWS Service Catalog, CloudFormation, and this blog on building a Hub and Spoke model with AWS Service Catalog.
Dilip Kolasani is a principal database engineer at Expedia based out of Austin. He has over 12 years of experience in managing commercial & open-source databases. His core interests include architecting highly scalable and resilient database systems. He holds numerous certifications in both Relational & NoSQL databases.
Chakravarthy Kotaru is a senior technology leader at Expedia Group based out of Austin, Texas. He has over 15 years of experience in IT industry and is currently helping build the world’s best travel data platform at Expedia Group. He has amassed extensive experience in database reliability engineering, platform and infrastructure services at various companies including Walt Disney, State farm and Sears. He holds master’s degree in computers science along with over a dozen certifications in various database and cloud technologies.
Suresh Raavi is a Solutions Architect at AWS based out of Seattle. He is currently working with AWS Strategic Customers to help them craft highly scalable, flexible, and resilient cloud architectures on their digital transformation journey in the cloud. He has amassed extensive experience in cloud technologies, automation and infrastructure at various companies including Microsoft, and holds numerous AWS certifications, and a Dale Carnegie graduate in effective communications and human relations.
Dayo Ogunleke is a Solutions Architect in the Strategics accounts Org at Amazon Web Services based out of Austin. He has a background in Networking and End user computing . He’s passionate about helping customers build cloud-native solutions to support workloads.