Amazon CloudWatch Alarms is natively integrated with Amazon CloudWatch metrics. Many AWS services send metrics to CloudWatch, and AWS also offers many approaches that let you emit your applications’ metrics as custom metrics. CloudWatch Alarms let you monitor the metrics changes when crossing a static threshold or falling out of an anomaly detection band. Furthermore, it lets you monitor the calculated result of multiple alarms. Then, CloudWatch Alarms automatically initiate actions when its state changes between OK, ALARM, and INSUFFICIENT_DATA.

The most commonly used alarm action is to notify a person-of-interest or trigger downstream automation by sending a message to an Amazon Simple Notification Service (SNS) Topic. CloudWatch Alarms are designed to invoke only the alarm actions when a state change happens. The one exception is Autoscaling actions, where the scaling action will keep being invoked periodically when the alarm remains in the state that was configured for the action.

There are scenarios where you may find it useful to have repeated notifications on certain critical alarms so that the corresponding team is alerted to take actions promptly. In this post, I will show you how to use Amazon EventBridge, AWS Step Function, and AWS Lambda to enable repeated alarm notification on selected CloudWatch Alarms. I will also discuss the other customization use cases that can be achieved with alarm state change using the same solution model.

Overview

Since 2019, Amazon EventBridge has integrated with Amazon CloudWatch so that when a CloudWatch alarm’s state changes, a corresponding CloudWatch alarm state change event is sent to the EventBridge service. You can create an EventBridge rule with a customized rule pattern to capture

  • all of the alarms’ stage change events,
  • the alarms’ transitions to particular states,
  • and state change events of the alarms with certain prefixed names.

Matched events mean that the rule invokes downstream automations to process the alarm’s state change event. This solution uses an AWS Step function to orchestrate repeated alarm notification workflow.

In this solution, we will enable repeated alarm notification by applying a specific tag on the CloudWatch alarm resources. Within the Step Function, a Lambda function can query the tags of the triggered alarm and only process further when the specific tag <key:value> is present. Moreover, this approach lets you create a centralized view of all of the alarms with repeated alarm notification enabled by creating a tag-based resource group. The resource group is included as an optional part of this solution.

Solution Architecture

This solution is deployed as an AWS Cloud Development Kit (CDK) application that deploys the resources highlighted within the blue rectangle in the following diagram to your AWS account. These resources are:

  • An EventBridge rule to capture all of the alarms’ state change events.
  • A Lambda function to check the alarm’s tag, describe the alarm’s current state, and send notifications to existing SNS alarm actions on the alarm.
  • A Step Function state machine with a wait task, the previously mentioned Lambda function as a task, and a choice task.
  • Two AWS Identity and Access Management (IAM) roles used for EventBridge to invoke the step function and for Lambda to perform the required actions respectively.
  • (Optional) A tag-based resource group including all of the CloudWatch alarms with the feature enablement tag.

Architecture is explained further in the post.

This solution works as follows:

  1. CloudWatch alarm is triggered and goes into the ALARM state.
  2. CloudWatch alarm sends the first alarm notification to the associated SNS alarm actions.
  3. CloudWatch Alarms service sends an alarm state change event which triggers the EventBridge rule. The rule pattern used is shown as follows, which captures all of the alarms’ state changes to the ALARM event.

EventBridge rule pattern used to capture CloudWatch alarms’ transition to ALARM state event

  1. With a match event, the EventBridge rule invokes the Step Function target.
  2. Once the Step Function starts execution, it first enters a Wait state (“Wait X Seconds” as shown in the following figure). The wait period can be configured in the CDK application and passed to the state machine definition.
  3. Then, the Step Function enters the Lambda Invocation task (“Check alarm tag and status” in the figure below). The Lambda invocation task:
    1. Checks if the alarm has the specific tag key and value (e.g., RepeatedAlarm:true). If not, the function exits
    2. Checks the alarm’s current state by performing a DescribeAlarms API with the alarm name.
    3. Publishes the existing alarm’s status returned from the DescribeAlarms API call to all the SNS topics subscribed on the alarm
    4. Returns the alarm’s current state together with the original received event back to the Step Function.
    5. The Choice state (“Is alarm still in ALARM state?” in the figure below) checks the alarm state returned by the Lambda function and directs the workflow to go back to the Wait state if the alarm state is ‘ALARM’ otherwise it ends the step function’s execution.

Figure explained previously.

The repeated notification for an alarm within the workflow above stops when:

  • The alarm transitions to a non-ALARM state.
  • The alarm is deleted.
  • A specific tag is removed from the alarm.

Procedures

Now, let’s deploy the solution and see how it works.

Prerequisites

Step 1: Deploy solution using AWS CDK

Before you can deploy a CDK application, make sure that you have the AWS CDK CLI installed and AWS account bootstrapped, as describe here. Then, run the following command from your terminal to download the solution code and deploy:

git clone https://github.com/aws-samples/amazon-cloudwatch-alarms-repeated-notification-cdk.git
cd amazon-cloudwatch-alarms-repeated-notification-cdk
npm install
npm run build
cdk bootstrap #Required for first time CDK deployment
cdk deploy --parameters RepeatedNotificationPeriod=300 --parameters TagForRepeatedNotification=RepeatedAlarm:true --parameters RequireResourceGroup=false

With the “cdk deploy” command, you can also configure the following parameters:

  • RepeatedNotificationPeriod: The time in seconds between two consecutive notifications from an alarm. The default is set to 300 in the CDK code.
  • TagForRepeatedNotification: The tag used to enable repeated notification on an alarm. It must be in a key:value pair. The default for this parameter is RepeatedAlarm:true.
  • RequireResourceGroup: Whether or not to create a tag-based resource group to monitor all of the CloudWatch Alarms with repeated notification enabled. Allowed values: true/false.

Step 2: Wait for the deployment to finish

Because this is a new deployment, you will see a summary of IAM resources created in the target account. These IAM resources are used by the components in the solution. No change is performed to any existing IAM resources in your account. You can review the change and accept by entering “y” to continue the deployment.

Before solution is actually deployed to the account, CDK CLI tool shows a summary of IAM resources to be created and ask your acceptance.

Then, you will see the progress of the deployment from your terminal. Wait for it to finish. You can also see the progress of the deployment from the CloudFormation.

Step 3: Test the solution

Once the deployment completes, you can test the solution on an alarm by applying the tag that you used.

  • Find a test alarm with a state that is in the ALARM state and has the SNS alarm actions associated.
  • Apply the tag on the selected alarm with the following AWS CLI command:
aws cloudwatch tag-resource --resource-arn arn:aws:cloudwatch:<region>:<account_id>:alarm:<alarm_name> --tags Key=RepeatedAlarm,Value=true

  • Manually set the alarm state to OK by using the set-alarm-state CLI command:
aws cloudwatch set-alarm-state --alarm-name <alarm_name> --state-value OK --state-reason "test"

  • Wait for the next alarm evaluation. For a standard alarm, it will re-evaluate within one minute and transition to its actual state.
  • Verify that you received the ALARM notification every five minutes. The repeated notification will have a subject similar to the following:

Repeated alarm notification has a subject of “ALARM: <alarm-name> remains in ALARM state in <region>”” width=”669″ height=”54″></p>
<h3>Step 4: View all of the alarms that have repeated notification enabled</h3>
<p><a href=AWS Resource Groups lets you search and group AWS resources based on tag. In this post, I will show you how to use this to have a centralized view of all of the alarms with repeated notification enabled.

  • Go to the Resource Groups & Tag Editor console.

In AWS console, search for service called “Resource Groups & Tag Editor”

  • If you select “true” for RequireResourceGroup when deploying CDK code, then you will see a tag-based resource named “repeatedAlarmsGroup”.

Under Saved resource groups, you should see a tag-based resource group named “repeatedAlarmsGroup”

  • You can now view all of the alarms with repeated notification enabled.

You can see the details of the “repeatedAlarmsGroup” and a list of CloudWatch alarms in this region which has the repeated notification tag

Step 5: Disable repeated notification and untag the alarm

Run the following CLI command to untag the CloudWatch alarm. You should see the alarm disappear from the resource group created in the previous step as well:

aws cloudwatch untag-resource --resource-arn arn:aws:cloudwatch:<region>:<account_id>:alarm:<alarm_name> --tag-keys RepeatedAlarm

Further reading

In  April 2021, we provided support for  cross-region event routing in EventBridge With the launch of this feature, you only need to deploy this solution in one of the supported destination regions to process repeated notification workflow across alarms in any commercial AWS Region. You can choose to deploy this solution in any one of the supported destination regions listed here. The solution is shown in the following diagram:

“CloudWatch Alarm State Change” events can be aggerated into a single supported destination region by using cross-region EventBus, and then processed by the solution in the destination region

This framework lets you centralize alarm state change events from any commercial region to a single supported region. This significantly reduces operation overhead related to resource management and troubleshooting.

You can also use  Amazon EventBridge to capture alarm state change  events and orchestrate downstream workflows a to perform more advanced alarm processing tasks utilizing various targets supported by Amazon EventBridge. For example you can enrich/format/pretty-print the alarm message or execute playbooks with a Lambda Function target or an SSM automation.

Cleanup

To avoid additional infrastructure costs from the examples described in this post, ensure to delete all of the resources created. You can simply clean up the resources by running the following command:

cd amazon-cloudwatch-alarms-repeated-notification-cdk
cdk destroy

In addition, the Lambda function created in this solution will log to CloudWatch Log group with the prefix “/aws/lambda/RepeatedCloudWatchAlarm”. Make sure to delete the log group to avoid CloudWatch Log storage charges.

Conclusion

In this post, I’ve provided you with a solution that enables repeated notifications on CloudWatch Alarms utilizing the alarm’s state change event via Amazon EventBridge and AWS Step Function. With this solution, hopefully you won’t miss any mission critical alarms and improve the response time of an incident. The same framework can also be extended to handle more advanced alarm processing tasks.

About the authors

Sarah Luo

Sarah Luo is an AWS Premium Support Engineer based in Sydney, Australia. Sarah is a subject matter expert in Amazon CloudWatch service, and loves helping customer to utilize AWS monitoring services following the best practice. She enjoys using technology to create solutions that address customer needs.

Jie Dong

Jie Dong is an AWS Cloud Architect based in Sydney, Australia. Jie is passionate about automation, and loves to develop solutions to help customer improve productivity. Event-driven system and serverless framework are his expertise. In his own time, Jie loves to work on building smart home and explore new smart home gadgets.

Nimita Shrivastava

Nimita is a Sr. Cloud Support Engineer at AWS. She helps customers solve complex networking problems, and provides tailored solutions and best practices for various unique scenarios. Nimita holds a Masters in Telecommunication Systems Management from Northeastern University in Boston, specializing in Computer Networking.