Customers have valuable metrics emitted to their logs. Examples include web server response times, slow queries, purchases by partners, custom application metrics, and cache hits or misses. This data has unrealized potential value for increasing observability. Consumed by Amazon CloudWatch Logs and extracted using metric filters, customers can translate this data into actual CloudWatch metrics, then create alarms and dashboards, as well as leverage anomaly detection.

In this post, we will create and ingest custom logs into CloudWatch Logs, and showcase:

  1. How to use CloudWatch Logs Insights to query your log data
  2. How to build a metric filter to turn your log data into numerical metrics
  3. How to visualize the CloudWatch metrics in a timeseries dashboard

Sample Solution

Here we launched a sample WordPress site using an Amazon Machine Image (AMI) from the AWS Marketplace. This AMI includes a NGINX server, where we can customize the access logs, and send them to CloudWatch Logs. While the process for creating a new running EC2 instance from an AWS Marketplace image is outside of the scope for our post, we will outline the procedure once you have one created. The procedures are as follows:

  1. SSH into the EC2 instance. The default user name will be “bitnami”. On Mac or Linux the log in command will be similar to this:
    ssh -i cert.pem [email protected]

  2. Now modify the NGINX configuration file to generate custom logs for every user’s hit to the application. These commands will take you to the correct file:
    cd ~/stack/nginx/
    vim ./conf/nginx.conf

  3. A sample of the modifications are shown as follows. These logs utilize information such as status of request, size of the response body, time required for the server to respond, and many more.
    http {
    …
    log_format trace '$remote_addr - $remote_user [$time_local] "$request" ' '$status $body_bytes_sent ' '$request_time ' '$upstream_response_time ' '$upstream_connect_time ' '$upstream_header_time '; access_log /opt/bitnami/nginx/logs/wpress_access.log trace; …
    }

  4. Reload the NGINX server to apply configuration changes.
    sudo ./sbin/nginx -s reload

  5. Install the CloudWatch agent. Note that, although this server itself is built on top of an EC2 instance, the approach works equally well if you are operating a server on-premises.
    cd ~
    wget https://s3.amazonaws.com/amazoncloudwatch-agent/debian/amd64/latest/amazon-cloudwatch-agent.deb
    sudo dpkg -i -E ./amazon-cloudwatch-agent.deb

  6. Create a new CloudWatch agent configuration with the built-in wizard. When prompted to monitor any log files, select yes and enter the location of custom access log file.
    sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard

  7. Start the CloudWatch agent. The CloudWatch agent is closely monitoring any changes made to the access log file, aggregating the changes, and sending them to the CloudWatch service on AWS to prepare for further analysis. The CloudWatch agent is truncate-aware and watches logs in real-time for additions.
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json

Architecture diagram showing the flow of data from the CloudWatch agent into CloudWatch Logs

Note that this procedure is taken from a specific Linux-based AMI. Similar steps can be taken on other Linux distributions, or even with Microsoft Windows Server. These steps are a guide for this specific AMI.

Querying with CloudWatch Logs Insights

CloudWatch Logs Insights lets you interactively query and analyze your log data. The query can be applied to single or multiple log groups of your choice, and within an absolute or relative timeframe of your interest. This is especially helpful if you would like to identify potential root causes when an incident occurs.

A sample CloudWatch Logs Insights query in the console

fields @timestamp, @message | parse @message "* - * [*] \"*\" * * " as ip, user, time, request, statusCode, size | filter statusCode like /4\d\d/ | sort @timestamp desc | limit 10 | display request, statusCode

You can display the queried results in a timeseries graph, which shows the number of requests performed at each point in time. You can also expand the displayed requests to see which fields available, and the value for each field.

CloudWatch Logs Insights is shown here with a timeseries graph of results

Creating the Metric Filter

Metric filters can be created on top of CloudWatch Logs log groups, and are the way to convert the log data into measurable, actionable CloudWatch metrics. You can define patterns for the incoming log data, and filter according to the terms that you define. As in the following example, we define the pattern for WordPress access logs, and filter on 400 level status code.

[host, , user, timestamp, request, statusCode=4*, size, request_time, response_time, connect_time, header_time]

Note that the metric filter is different from a log insights query, where the experience is interactive and provides immediate search results for the user to investigate. No automatic action can be invoked from an insights query. Metric filters, on the other hand, will generate metric data in the form of a time series. This lets you create alarms that integrate into your ITSM processes, execute AWS Lambda functions, or even create anomaly detection models.

Aside from defining the filter pattern, another important aspect is the value for your metric. For example, if you would like to count the occurrences of logs, you can put “1” into the metric value field. Alternatively, a variable input like “$request_time” is also supported. Extracting values from your log data is an especially powerful way of gaining observability into your workload’s health and operations.

Here you see an example of a metric filter being created from within the CloudWatch console.

The console experience of creating a metric filter is just one approach, and can be replaced with the AWS CLI or AWS CloudFormation

Note that the metric filter is not retroactive, meaning that when you create the metric filter, it will not generate metrics from historical log data. Only events that happened after the creation of the metric filter will be converted to CloudWatch metrics. Therefore, be mindful of what can be utilized within your log data, and create your metric filters early!

Exploring the Time Series Dashboard

After creating your metric filter and leaving the site to run for a period of time, you can visualize how the metric value changes over time within the CloudWatch dashboard. In the following image, the “average” aggregation function is used on metric, over a period of one minute. Here, you can see the WordPress site takes on average 0.026 s to respond to customers’ requests, but, at 18:30pm and 18:38pm, there’s a jump and dip on response time, both indicating anomalies.

Visualizing spikes in application performance with the CloudWatch console is shown in this image

From another client error graph, as shown in the following image, we can see the total number of requests, generated within one minute, for different error codes. Here, we are using the “sum” function aggregate metrics. At both time frames with unusual average request times, there are number of faulted requests generated. These requests, since faulted, can be crafted to stall the application, or, if it is requesting a non-existent file, be immediately rejected by the server.

A visual example of 400-type response codes returned from our NGINX server

To further investigate into underlying faults, we can utilize CloudWatch Logs Insights and select the corresponding time periods, as in the following image. From this graph, we can select the corresponding time period, identify the requests made, and the malicious IP that is trying to attack the WordPress site. From there, it is possible to block this address by creating a “Deny” rule with a Network ACL (or other approaches, such as firewall appliances or AWS Network Firewall).

Example of searching through CloudWatch Logs Insights for 400-type response codes, complete with the offending IP address

Next Steps

Creating metric filters to convert CloudWatch Logs into quantifiable metrics can further increase observability into your environment.

You can start by analyzing log data that is on-premises, or in AWS, but has not been utilized, such as application logs, server logs, cache responses. If the logs are on-premises, then you can ingest these logs into CloudWatch with the CloudWatch agent. If the logs are present within CloudWatch logs, then you can analyze the metrics that are helpful for your business. For example, if over 10% of customer requests are failing with a server-side error (i.e., status code 5xx).

With the use case identified, you can create metric filters for the identified scenarios. This opens the door not only for visual, dashboard analysis like we show in this post, but also for further automatic action, such as creating CloudWatch alarms coupling with SNS topics to alert the administrator.

Happy observing!

About the authors

Rich McDonough

Rich McDonough is a Sr. WW CloudOps Specialist Solutions Architect for AWS based in Toronto. His primary focus is Cloud Operations, helping customers scale their use of AWS safely and securely, and guiding customers in their adoption of observability practices and services. Before joining AWS in 2018, he specialized in helping migrate customers into the cloud.

Xiaoxue Xu

Xiaoxue Xu is a Solutions Architect for AWS based in Toronto. She primarily works with Financial Services customers to help secure their workload and design scalable solutions on the AWS Cloud.