Application reliability is critical. Service interruptions result in a negative customer experience, thereby reducing customer trust and business value. One best practice that we have learned at Amazon, is to have a standard mechanism for post-incident analysis. This lets us analyze a system after an incident in order to avoid reoccurrences in the future. These incidents also help us learn more about how our systems and processes work. That knowledge often leads to actions that help other incident scenarios, not just the prevention of a specific reoccurrence. The mechanism is called the Correction of Error (COE) process. Although post-event analysis is part of the COE process, it is different from a postmortem, because the focus is on corrective actions, not just documenting failures. This post will explain why you should start implementing the COE mechanism after an incident, and its components to help you get started.
Why should you do COE?
The COE process consists of a post-event analysis. It is imperative that the negative impact caused by the event be mitigated before the COE process begins. This lets you:
- Deep dive into the sequence of events leading up to the event
- Find the root cause of the problem and identify remediation actions
- Analyze the impact of the incident to the business and our customers
- Identify and track action items that prevent problem re-occurrences
What a COE is not
It is not a process for finding whom to blame for the problem: The purpose of a COE is to facilitate maximum visibility into the areas that are most in need of improvement. Creating a culture that rewards people for surfacing problems will foster greater visibility into the areas that need improvement. Human tendencies lead to repeating actions that are rewarded and limiting actions that are penalized.
It is not a process for giving punishment to employees after the occurrence of a bad event: The purpose of the COE process is to make sure of continuous improvement during the lifecycle of an application. Often times the person with the most knowledge of an event is the one who has the most at stake through its outcome. To incentivize the most complete understanding of the course of events, we must create a culture that rewards full disclosure of events and lets the person closest to the bad outcome to be part of the solution rather than part of the problem.
What are the components of a COE?
Even though this section is the first one, it should be written last. This section provides the context for the entire event. Include details on who was impacted, when, where, and how. Include details of how long it took to discover the problem and summarize both how you mitigated it and how you plan to prevent re-occurrences. Don’t try to fit all of the details here, instead provide basic information about the incident in the summary. The summary should stand alone without requiring the reader to reference other sections. Write the summary as if it were going to travel as an email update to your company’s main stakeholder (such as the CEO).
If you use the AWS System Manager Incident Manager COE template, then you can add any section not included in the tool template here.
Both investigate and document in a quantifiable manner just how big the impact was for the customers and the business. To do that, it is important to gather detailed information about the number of impacted customers, as well as how they were impacted. An outline to describe the impact to the customer looks like the following:
- How many customers were impacted?
- Description of the impact on the customer or the business
- How severe was the impact?, some examples might include
- Were customers overcharged?
- Social Media statistics (sentiment)
- A description of the impact in terms of monetary loss estimates or actual data
- Were there second-order impacts to our customers?
- How are the customers impacted in their business or day-to-day activities due to the failure?
- Reputation: did we delay a customer transaction? or did we lose the customer entirely?
- What non-functional requirements were impacted?
- Security (Confidentiality, Integrity, and Availability), Reliability, Performance Efficiency, Cost Optimization, and Operations Impact
- What were the consequences of the incident?
- Did this incident cause other incidents to occur?
- Did this incident have consequences outside of the operation of the workload? For example:
- Lost opportunity, e.g., did not win a competitive contract, lost X amount of dollars due to not being able to sell
- Unable to meet an SLA, contractual requirement, or regulatory requirement resulting in penalties
- Customer sentiment tracking
- Loss of reputation
- Loss of customer trust
The timeline describes all of the events that occurred during the incident. It covers from the first event related to the incident until the moment that the system returned to normal operations. The timeline shows a clear picture of everything that happened, and all of the events and information that were available during the incident. The quality of the output of the correction of error exercise depends on the quality of the data that is used to build it.
The timeline is recommended to be a bullet list that walks through what happened and when. It is important to compile a detailed timeline. This helps the author and the reviewers understand how the incident was managed. When establishing the timeline, is important to take into account the following:
- Represent the sequence of events in chronological order.
- Be consistent with the time zone of the sequence of events.
- The timeline should focus on the event including the start and end time, not just the team’s perception of the event.
- The timeline should start with the first trigger that led to the problem (e.g., a bad deployment), not just when your team got notified.
- Make sure any gaps in your timeline of more than 10-15 minutes are clearly explained in the future sections of the COE.
- Each entry in the timeline should clearly follow from the previous one. The timeline tells a story to others about the incident.
- When documenting times, be sure to include a time zone, and make sure that you’re using it correctly (e.g., PDT vs. PST). Better yet, either use UTC or omit the middle letter of the time zone (e.g., “PT”).
- Use data support as much as possible. Link to external items and images or artifacts (Dashboards, or any other source of info) where possible.
This section contains the metrics that define the impact, how you determined the problem and how you are set up to monitor the event. If these metrics are absent, then that’s a red flag—and likely a good action item to include in the later sections.
Once we have a timeline, start asking questions to analyze the event and start identifying key aspects of the incident. Questions should focus on the key aspects of the incident, such as the following:
- When did you learn there was customer impact?
- How did you learn there was customer impact?
- Reported by the customer
- Identified by monitoring and/or alerting
- How can we reduce the time-to-detect in half?
- What was the underlying cause of the customer impact?
- Was an internal activity (maintenance window)happening during the incident?
- Was the MW properly announced?
- Did the service owner know about the MW?
- Is there a task in the backlog to remove the dependency?
- Why wasn’t it implemented?
- How can we reduce the time-to-diagnose in half?
- When did customer impact return to pre-incident levels?
- How does the system owner know that the system is properly restored?
- How did you determine where and how to mitigate the problem?
- How can we reduce the time-to-mitigate in half?
The AWS System Manager Incident Manager COE template contains some of these questions. If you use the template, then you can add additional questions in the Comments section of each key aspect.
- To prevent future occurrences of the event, we must completely understand not only what happened, but why it happened, as well as why the underlying causes occurred.
- Only once we dig down multiple levels do we have enough information to truly prevent the event from occurring again.
- The mechanism we use at Amazon, to determine the root cause, is called, “The 5 Whys.”
The 5 Whys
The five whys is useful, as it provides a consistent approach to understanding causes. It helps overcome assumptions that we might have. The process is easy to learn and use, either as an individual or with a whole team. The five whys should be applied in a blame-free way, focused on finding the “why” rather than blaming “who”. If you use the AWS System Manager Incident Manager COE template, then the five whys section is included as part of the Incident questions under the Prevention section.
Use the five whys technique to make sure that you have identified the actual causes of a problem. You may need to ask more than five whys to find the causes, and you should consider if a cause could have been prevented. For example, if you see “human error” as a root cause in RCA, it may be indicating a lack of checking or fail-safe mechanisms. Therefore, you should always ask why the human error was possible.
- Identify the problem.
- (First Why) Why did this problem happen?
- (Second Why) Ask Why to whatever was the answer to the First Why documented in step 2.
- (Third Why) Ask Why to whatever was the answer to the Second Why documented in step 3.
- (Fourth Why) Ask Why to whatever was the answer to the Third Why documented in step 4.
- (Fifth Why) Ask Why to whatever was the answer to the Fourth Why documented in step 5.
- Continue this process until you’re confident that you have uncovered the complete chain of events that allowed this event to occur, regardless of how many Whys it takes.
- Create a plan to remediate each of the answers to the Whys. This will remediate the symptoms of the event, as well as the chain of events that allowed the event to take place.
- The problem is X.
- #1 Why did X happen?
- X happened because this action took place.
- #2 Why did that action take place?
- That action took place because there was a typo.
- #3 Why was a typo allowed?
- The typo was allowed because there was no validation.
- #4 Why was there no validation of the code?
- The current process does not validate code.
- #5 Why doesn’t the current process validate code?
- The shell used to input the code does not have that feature.
After each Why is answered, consider the following questions to determine whether the process must continue. If necessary, repeat the process using the reason as the problem. Stop when you are confident that you have found the root causes.
- Is the reason for the Why the underlying root cause?
- Could the reason have been detected before it happened?
- Is the reason human error? If so, why was it possible?
In the Example above, there are five takeaways that must be addressed:
- The effect of X Action must be remediated
- If the command must be re-invoked, then the typo needs to be avoided.
- Code validation must be implemented.
- The current process should be amended to ensure the validation of code
- Any shell used for code must validate code.
The action items are the main result of the COE process. The goal is to identify actionable activities that improve either the prevention, diagnosis, or resolution of the root cause of the same problem in the future. Each action item must state its priority, who is the person responsible, and a due date for when it will be finished. Good actions items are specific and achievable within a specific timeline. For example, an internal SLA might be 14 days and COEs have on average 6-8 action items.
In this section, you can reference other COEs or documentation relevant to the current one. This helps give context into related assets and when we have a series of events.
In this post, we discussed the importance of having a correction of error process and its main components. To start implementing your own process, we recommend using this post as a reference and the AWS Systems Manager Incident Manager COE template. The best way to learn and get better is by practice, so we encourage you to start preparing and to work on your first incident.