Performing controlled chaos experiments on your Amazon Relational Database Service (RDS) database instances and validating the application behavior is essential to making sure that your application stack is resilient. How does the application behave when there is a database failover? Will the connection pooling solution or tools being used gracefully connect after a database failover is successful? Will there be a cascading failure if the database node gets rebooted for a few seconds? These are some of the fundamental questions that you should consider when evaluating the resiliency of your database stack. Chaos engineering is a way to effectively answer these questions.

Traditionally, database failure conditions, such as a failover or a node reboot, are often triggered using a script or 3rd party tools. However, at scale, these external dependencies often become a bottleneck and are hard to maintain and manage. Scripts and 3rd party tools can fail when called, whereas a web service is highly available. The scripts and 3rd party tools also tend to require elevated permissions to work, which is a management overhead and insecure from a least privilege access model perspective. This is where AWS Fault Injection Simulator (FIS) comes to the rescue.

AWS Fault Injection Simulator (AWS FIS) is a fully managed service for running fault injection experiments on AWS that makes it easier to improve an application’s performance, observability, and resiliency. Fault injection experiments are used in chaos engineering, which is the practice of stressing an application in testing or production environments by creating disruptive events, such as a sudden increase in CPU or memory consumption, database failover and observing how the system responds, and implementing improvements.

We can define the key phases of chaos engineering as identifying the steady state of the workload, defining a hypothesis, running the experiment, verifying the experiment results and making necessary improvements based on the experiment results. These phases will confirm that you are injecting failures in a controlled environment through well-planned experiments in order to build confidence in the workloads and tools we are using to withstand turbulent conditions.

This diagram explains the phases of chaos engineering. We start with identifying the steady state, defining a hypothesis, run the experiment, verify the experiment results and improve. This is a cycle.

Example—

  • Baseline: we have a managed database with a replica and automatic failover enabled.
  • Hypothesis: failure of a single database instance / replica may slow down a few requests but will not adversely affect our application.
  • Run experiment: trigger a DB failover.
  • Verify: confirm/dis-confirm the hypothesis by looking at KPIs for the application (e.g., via CloudWatch metric/alarm).

Methodology and Walkthrough

Let’s look at how you can configure AWS FIS to perform failure conditions for your RDS database instances. For this walkthrough, we’ll look at injecting a cluster failover for Amazon Aurora PostgreSQL. You can leverage an existing Aurora PostgreSQL cluster or you can launch a new cluster by following the steps in the Create an Aurora PostgreSQL DB Cluster documentation.

Step 1: Select the Aurora Cluster.

The Aurora PostgreSQL instance that we’ll use for this walkthrough is provisioned in us-east-1 (N. Virginia), and it’s a cluster with two instances. There is one writer instance and another reader instance (Aurora replica). The cluster is named chaostest, the writer instance is named chaostest-instance-1, and the reader is named chaostest-intance-1-us-east-1a.

Under RDS Databases Section, the cluster named chaostest is selected. Under the cluster there are two instances which is available. Chaostest-instance-1 is the writer instance and chaostest-instance-1-us-east-1a is the reader instance.

The goal is to simulate a failover for this Aurora PostgreSQL cluster so that the existing chaostest-intance-1-us-east-1a reader instance will switch roles and then be promoted as the writer, and the existing chaostest-instance-1 will become the reader.

Step 2: Navigate to the AWS FIS console.

We will now navigate to the AWS FIS console to create an experiment template. Select Create experiment template.

Under FIS console, Create experiment template needs to be selected.

Step 3: Complete the AWS FIS template pre-requisites.

Enter a Description, Name, and select the AWS IAM Role for the experiment template.

Under create experiment template section, Simulate Database Failover is entered for the Description field. DBFailover is entered for the Name field(Optional). FISWorkshopServiceRole is selected for the IAM role drop down field.

The IAM role selected above was pre-created. To use AWS FIS, you must create an IAM role that grants AWS FIS the permissions required so that the service can run experiments on your behalf. The role follows the least privileged model and includes permissions to act on your database clusters like trigger a failover. AWS FIS only uses the permissions that have been delegated explicitly for the role. To learn more about how to create an IAM role with the required permissions for AWS FIS, refer to the FIS documentation.

Step 4: Navigate to the Actions, Target, Stop Condition section of the template.

The next key section of AWS FIS is Action, Target, and Stop Condition.

Action, Target and Stop Conditions section is highlighted in the image.

Action—An action is an activity that AWS FIS performs on an AWS resource during an experiment. AWS FIS provides a set of pre-configured actions based on the AWS resource type. Each Action runs for a specified duration during an experiment, or until you stop the experiment. An action can run sequentially or in parallel.

For our experiment, the Action will be aws:rds:failover-db-cluster.

Target—A target is one or more AWS resources on which AWS FIS performs an action during an experiment. You can choose specific resources or select a group of resources based on specific criteria, such as tags or state.

For our experiment, the target will be the chaostest Aurora PostgreSQL cluster.

Stop Condition—AWS FIS provides the controls and guardrails that you need to run experiments safely on your AWS workloads. A stop condition is a mechanism to stop an experiment if it reaches a threshold that you define as an Amazon CloudWatch alarm. If a stop condition is triggered while the experiment is running, then AWS FIS stops the experiment.

For our experiment, we won’t be defining a stop condition. This is because this simple experiment contains only one action. Stop conditions are especially useful for experiments with a series of actions, to prevent them from continuing if something goes wrong.

Step 5: Configure Action.

Now, let’s configure the Action and Target for our experiment template. Under the Actions section, we will select Add action to get the New action window.

The action section displays Name,Description, Action Type and Start After fields. There is a Add action button that needs to be selected.

Enter a Name, a Description, and select Action type aws:rds:failover-db-cluster. Start after is an optional setting. This setting allows you to specify an action that should precede the one we are currently configuring.

Under Actions section, DBFailover is entered for the Name field. DB Failover Action is entered for the Description field. aws:rds:failover-db-cluster is entered for the Action Type field. Start after field is left blank and Clusters-Target-1 is selected for the Target field. The save button will save the info entered for the respective fields.

Step 6: Configure Target.

Note that a Target has been automatically created with the name Clusters-Target-1. Select Save to save the action.

Next, you will edit the Clusters-Target-1 target to select the target, i.e., the Aurora PostgreSQL cluster.

Under Targets section, the edit button for Clusters-Target-1 is highlighted.

Select Target method as Resource IDs, and select the chaostest cluster.  If you are interested to select a group of resources, then select Resource tags, filters and parameters option.

Under Edit Target section, Clusters-Target-1 is selected for the Name field. aws:rds:cluster is selected for the Resource type field. Action is set as DBFailover. Resource IDs is selected for the Target Method field. Chaostest cluster is selected for the Resource IDs drop down box. Save button is also available in this section to save the configuration.

Step 7: Create the experiment template to complete this stage.

We will wrap up the process by selecting the create experiment template.

Create experiment template option is highlighted. The user will click this button to proceed creating a template.

We will get a warning stating that a stop condition isn’t defined. We’ll enter create in the provided field to create the template.

After selecting the create experiment template option in the previous screen, the user is prompted to enter "create" in the field to proceed.

We will get a success message if the entries are correct and the template will be successfully created.

"You successfully created experiment template" success message is displayed in the screen and its highlighted in green color.

Step 8: Verify the Aurora Cluster.

Before we run the experiment, let’s double-check the chaostest Aurora Cluster to confirm which instance is the writer and which is the reader.

Under the RDS section, chaos cluster is listed. The user is confirming that chaostest-instance-1 is the writer and chaostest-instance-1-us-east-1a is the reader.

We confirmed that chaostest-instance-1 is the writer and chaostest-instance-1-us-east-1a is the reader.

Step 9: Run the AWS FIS experiment.

Now we’ll run the FIS experiment. Select Actions, and then select Start for the experiment template.

Under the experiment template section, the Simulate Database Failover template is selection. Under the actions section, option Start is selected to start the experiment. The other options under Actions section includes Update, Manage tags and Delete.

Select Start experiment and you’ll get another warning to confirm if you really want to start this experiment. Confirm by entering start say Start experiment.

Under the Start Experiement section, user is promoted to enter "start" in the field to start the experiement.

Step 10: Observe the various stages of the experiment.

The experiment will be in initiating, running and will eventually be in completed states.

The experiment is in initiating state.

The experiment is in Complete state.

Step 11: Verify the Aurora Cluster to confirm failover.

Now let’s look at the chaostest Aurora PostgreSQL cluster to check the state. Note that a failover was indeed triggered by FIS and chaostest-instance-1-us-east-1a is the newly promoted writer and chaostest-instance-1 is the reader now.

Under RDS Section, Chaostest cluster is shown. This time, the writer is chaostest-instance-1-us-east-1a.

Step 12: Verify the Aurora Cluster logs.

We can also confirm the failover action by looking at the Logs and events section of the Aurora Cluster.

Under the Recent Events section of the chaos-test cluster, the failover messages is displayed. One of the messages lists "Started cross AZ failover to DB instance:chaostest-instance-1-us-east-1a. This confirms that the experiment was successful.

Clean up

If you created a new Aurora PostgreSQL cluster for this walkthrough, then you can terminate the cluster to optimize the costs by following the steps in the Deleting an Aurora DB cluster documentation.

You can also delete the AWS FIS experiment template by following the steps in the Delete an experiment template documentation.

You can refer to the AWS FIS documentation to learn more about the service. If you want to know more about chaos engineering, check out the AWS re:Invent session Testing resiliency using chaos engineering and The Chaos Engineering Collection. Finally, check out the following GitHub repo for additional example experiments, and how you can work with AWS FIS using the AWS Cloud Development Kit (AWS CDK).

Conclusion

In this walkthrough, you learned how you can leverage AWS FIS to inject failures into your RDS Instances. To get started with AWS Fault Injection Service for Amazon RDS, refer to the service documentation.

Author:

Anup Sivadas

Anup Sivadas is a Principal Solutions Architect at Amazon Web Services and is based out of Arlington, Virginia. With 18 + years in technology, Anup enjoys working with AWS customers and helps them craft highly scalable, performing, resilient, secure, sustainable and cost-effective cloud architectures. Outside work, Anup’s passion is to travel and explore the nature with his family.