Amazon CloudWatch Alarms is natively integrated with Amazon CloudWatch metrics. Many AWS services send metrics to CloudWatch, and AWS also offers many approaches that let you emit your applications’ metrics as custom metrics. CloudWatch Alarms let you monitor the metrics changes when crossing a static threshold or falling out of an anomaly detection band. Furthermore, it lets you monitor the calculated result of multiple alarms. Then, CloudWatch Alarms automatically initiate actions when its state changes between OK, ALARM, and INSUFFICIENT_DATA.
The most commonly used alarm action is to notify a person-of-interest or trigger downstream automation by sending a message to an Amazon Simple Notification Service (SNS) Topic. CloudWatch Alarms are designed to invoke only the alarm actions when a state change happens. The one exception is Autoscaling actions, where the scaling action will keep being invoked periodically when the alarm remains in the state that was configured for the action.
There are scenarios where you may find it useful to have repeated notifications on certain critical alarms so that the corresponding team is alerted to take actions promptly. In this post, I will show you how to use Amazon EventBridge, AWS Step Function, and AWS Lambda to enable repeated alarm notification on selected CloudWatch Alarms. I will also discuss the other customization use cases that can be achieved with alarm state change using the same solution model.
Since 2019, Amazon EventBridge has integrated with Amazon CloudWatch so that when a CloudWatch alarm’s state changes, a corresponding CloudWatch alarm state change event is sent to the EventBridge service. You can create an EventBridge rule with a customized rule pattern to capture
- all of the alarms’ stage change events,
- the alarms’ transitions to particular states,
- and state change events of the alarms with certain prefixed names.
Matched events mean that the rule invokes downstream automations to process the alarm’s state change event. This solution uses an AWS Step function to orchestrate repeated alarm notification workflow.
In this solution, we will enable repeated alarm notification by applying a specific tag on the CloudWatch alarm resources. Within the Step Function, a Lambda function can query the tags of the triggered alarm and only process further when the specific tag <key:value> is present. Moreover, this approach lets you create a centralized view of all of the alarms with repeated alarm notification enabled by creating a tag-based resource group. The resource group is included as an optional part of this solution.
This solution is deployed as an AWS Cloud Development Kit (CDK) application that deploys the resources highlighted within the blue rectangle in the following diagram to your AWS account. These resources are:
- An EventBridge rule to capture all of the alarms’ state change events.
- A Lambda function to check the alarm’s tag, describe the alarm’s current state, and send notifications to existing SNS alarm actions on the alarm.
- A Step Function state machine with a wait task, the previously mentioned Lambda function as a task, and a choice task.
- Two AWS Identity and Access Management (IAM) roles used for EventBridge to invoke the step function and for Lambda to perform the required actions respectively.
- (Optional) A tag-based resource group including all of the CloudWatch alarms with the feature enablement tag.
This solution works as follows:
- CloudWatch alarm is triggered and goes into the ALARM state.
- CloudWatch alarm sends the first alarm notification to the associated SNS alarm actions.
- CloudWatch Alarms service sends an alarm state change event which triggers the EventBridge rule. The rule pattern used is shown as follows, which captures all of the alarms’ state changes to the ALARM event.
- With a match event, the EventBridge rule invokes the Step Function target.
- Once the Step Function starts execution, it first enters a Wait state (“Wait X Seconds” as shown in the following figure). The wait period can be configured in the CDK application and passed to the state machine definition.
- Then, the Step Function enters the Lambda Invocation task (“Check alarm tag and status” in the figure below). The Lambda invocation task:
- Checks if the alarm has the specific tag key and value (e.g., RepeatedAlarm:true). If not, the function exits
- Checks the alarm’s current state by performing a DescribeAlarms API with the alarm name.
- Publishes the existing alarm’s status returned from the DescribeAlarms API call to all the SNS topics subscribed on the alarm
- Returns the alarm’s current state together with the original received event back to the Step Function.
- The Choice state (“Is alarm still in ALARM state?” in the figure below) checks the alarm state returned by the Lambda function and directs the workflow to go back to the Wait state if the alarm state is ‘ALARM’ otherwise it ends the step function’s execution.
The repeated notification for an alarm within the workflow above stops when:
- The alarm transitions to a non-ALARM state.
- The alarm is deleted.
- A specific tag is removed from the alarm.
Now, let’s deploy the solution and see how it works.
- AWS Account with AWS Command Line Interface (CLI) access
- Node.js 10.13 or later
- AWS CDK
- Docker service (in running state when performing the following steps)
Step 1: Deploy solution using AWS CDK
Before you can deploy a CDK application, make sure that you have the AWS CDK CLI installed and AWS account bootstrapped, as describe here. Then, run the following command from your terminal to download the solution code and deploy:
- RepeatedNotificationPeriod: The time in seconds between two consecutive notifications from an alarm. The default is set to 300 in the CDK code.
- TagForRepeatedNotification: The tag used to enable repeated notification on an alarm. It must be in a key:value pair. The default for this parameter is RepeatedAlarm:true.
- RequireResourceGroup: Whether or not to create a tag-based resource group to monitor all of the CloudWatch Alarms with repeated notification enabled. Allowed values: true/false.
Step 2: Wait for the deployment to finish
Because this is a new deployment, you will see a summary of IAM resources created in the target account. These IAM resources are used by the components in the solution. No change is performed to any existing IAM resources in your account. You can review the change and accept by entering “y” to continue the deployment.
Then, you will see the progress of the deployment from your terminal. Wait for it to finish. You can also see the progress of the deployment from the CloudFormation.
Step 3: Test the solution
Once the deployment completes, you can test the solution on an alarm by applying the tag that you used.
- Find a test alarm with a state that is in the ALARM state and has the SNS alarm actions associated.
- Apply the tag on the selected alarm with the following AWS CLI command:
- Manually set the alarm state to OK by using the set-alarm-state CLI command:
- Wait for the next alarm evaluation. For a standard alarm, it will re-evaluate within one minute and transition to its actual state.
- Verify that you received the ALARM notification every five minutes. The repeated notification will have a subject similar to the following:
AWS Resource Groups lets you search and group AWS resources based on tag. In this post, I will show you how to use this to have a centralized view of all of the alarms with repeated notification enabled.
- Go to the Resource Groups & Tag Editor console.
- If you select “true” for RequireResourceGroup when deploying CDK code, then you will see a tag-based resource named “repeatedAlarmsGroup”.
- You can now view all of the alarms with repeated notification enabled.
Step 5: Disable repeated notification and untag the alarm
Run the following CLI command to untag the CloudWatch alarm. You should see the alarm disappear from the resource group created in the previous step as well:
In April 2021, we provided support for cross-region event routing in EventBridge With the launch of this feature, you only need to deploy this solution in one of the supported destination regions to process repeated notification workflow across alarms in any commercial AWS Region. You can choose to deploy this solution in any one of the supported destination regions listed here. The solution is shown in the following diagram:
This framework lets you centralize alarm state change events from any commercial region to a single supported region. This significantly reduces operation overhead related to resource management and troubleshooting.
You can also use Amazon EventBridge to capture alarm state change events and orchestrate downstream workflows a to perform more advanced alarm processing tasks utilizing various targets supported by Amazon EventBridge. For example you can enrich/format/pretty-print the alarm message or execute playbooks with a Lambda Function target or an SSM automation.
To avoid additional infrastructure costs from the examples described in this post, ensure to delete all of the resources created. You can simply clean up the resources by running the following command:
In addition, the Lambda function created in this solution will log to CloudWatch Log group with the prefix “/aws/lambda/RepeatedCloudWatchAlarm”. Make sure to delete the log group to avoid CloudWatch Log storage charges.
In this post, I’ve provided you with a solution that enables repeated notifications on CloudWatch Alarms utilizing the alarm’s state change event via Amazon EventBridge and AWS Step Function. With this solution, hopefully you won’t miss any mission critical alarms and improve the response time of an incident. The same framework can also be extended to handle more advanced alarm processing tasks.