Saying that Amazon’s servers handle a lot of traffic is an understatement. Publicly available information estimates something like 180 million unique visitors per month in the US. Amazon reported over seven million orders on Black Friday 2017. The amount of traffic we process is just staggering, and the complexity of the underlying infrastructure is mind boggling.
When errors occur, and they do, distressingly often, we follow a four-step process:
- Identification. Figure out what the heck is going on. Understand it well enough so that you can move on to:
- Mitigation. Stabilize the service so that the error no longer occurs. This might take the form of increasing capacity, rolling back a recent change, making an emergency patch, throttling a noisy client service, etc.
- Correction. Modify code or processes to prevent that specific error from occurring again.
- Understanding. Study the event more deeply to understand how our processes allowed that error to occur, and how we can improve our processes to prevent similar errors in the future. The output from this step is a Correction of Errors document, or COE.
The first three steps are pretty standard fare: things you would expect any team that is running mission-critical services to do. The last, though, is surprisingly rare in the industry. Some companies say that they do postmortem investigations and such, but often it’s just lip service. More ominously, many times that investigation is an exercise in assigning blame for the error as a prerequisite to disciplinary action. Nobody wants to make a mistake because somebody will end up being blamed, and possibly fired. Error investigation becomes an exercise in CYA.
The COE process at Amazon is different. The purpose really is to dive deep so that we fully understand how we made the error, and how we can prevent making the same type of error in the future. The guidelines for writing a COE specifically discourage identifying individuals by name. The format is well specified. It seeks to answer the following questions:
- What happened?
- How did it affect customers?
- How did you identify the error?
- How did you fix the error, once identified.
- Why did the error occur?
- What did you learn from this incident?
- What will you do to prevent this from happening again?
These are hard questions to answer. The “why” section is especially difficult because you have to keep asking “why” until you get to the root cause of the problem. It’s much harder than it seems at first look.
This process of investigation to find the root cause has spawned a new verb phrase: “to root cause.” As in “I need to root cause that problem.” That drives me bonkers. I refuse to use that particular jargon, and I’ve been known to express my displeasure at its use. I fear, though, that I’m fighting a battle that’s already been lost.
COE reviews can be uncomfortable because the reviewers are very sharp and adept at asking probing questions. A review can be especially uncomfortable if the reviewers detect that the investigation wasn’t thorough, or that essential information was omitted. Trying to hide mistakes or deflect responsibility can get one in trouble. On the other hand, a thorough investigation followed by a COE that shows a deep understanding of the error and meaningful action items to prevent similar occurrences can reflect very positively on the person or people involved.
Writing a COE is, fundamentally, an exercise in taking ownership of one’s mistakes, learning from them, and passing that knowledge on to others.
Amazon understands that errors happen. One of our Leadership Principles is “Bias for Action,” which advocates calculated risk taking. Nevertheless, errors can be costly. The COE process is an attempt to gain something positive from those occurrences so that we can avoid having to pay for them again. The COE document is an opportunity for the people involved to demonstrate other Leadership Principles: ownership, their ability to dive deep and understand things at a fundamental level, and their insistence on the highest standards.
In a very real sense, making a mistake can be a positive thing for the company and for your career. I’m not saying that you should strive to make mistakes (after all, another Leadership Principle is that leaders “Are right, a lot”), but responding positively to the occasional mistake can be a Good Thing.