I’m fascinated by failure and what we can learn from it. Failure becomes inevitable when we're constantly enhancing our services at scale and speed. At OVO we know that we can improve our customer experience and reduce the number of outages we have by continually investing in building a strong post-mortem culture.
Before I get started, let's define what a post-mortem is. I’ll explore what’s different with OVO in the next few sections, but to get things started, here's Google’s definition from their SRE book; ‘a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring’.
This blog aims to explore OVO’s post-mortem culture further, and provide some insight into the way we learn from our mistakes to continually make our services even better for our customers.
Our approach to post-mortems
Our approach to post-mortems can be broken down in to six principles. Each individual component of the process appears relatively simple, but as a whole, requires significant time and effort to drive the right outcome for our customers. It’s important to appreciate that you can’t build this culture overnight - it’s something you need to invest in and improve over time.
We’ve based our principles on those shared by post-mortem pioneers PagerDuty and Google, and applied them to our environment:
- Invest time into finding out what happened
- Focus on process and the situation that the failure occurred in
- Write down our findings
- Take actions to improve, focusing on reliability
- Share our learnings as wide as possible
- Make sure it doesn’t happen again
We have a standard way of capturing output from our post-mortems and we record and track all actions using JIRA. Our post-mortem template is available here so take a look and use it if you're looking for a template to get started with. We originally took Google's template and we've built on it from there. I'm always open to hearing how we can improve this, so if you have any ideas or suggestions, please comment on there!
We want teams to be flexible with their triggers for running a post-mortem (by triggers, I mean the events that occur that warrant running a post-mortem). We use post-mortems to highlight areas for improvements - our engineering teams fully own their services and we trust them to make the right call in improving reliability. However, we do set the expectation that post-mortems are scheduled as part of the incident closure process, and are held as close to the resolution of the incident as possible - and should be held for all major service outages (for example, one of our websites being unavailable).
Keep it consequenceless
Our post-mortems focus on what matters most - customer impact - and everything that contributes to it.
To create the right environment where we can explore every contributor to failure, we ensure that post-mortems are consequenceless. 'Blameless' is the term used by the majority of organisations, but we have a slightly different perspective. We always encourage teams to focus predominantly on technical and process failure, but where necessary, they should take a consequenceless look at the role a person or group of people played. This only works as we always assume that people have acted in the best intentions, whether in incident response or initial service design.
This approach means that we understand where can improve every part of the process, whether that's people, process or technology; doing so in a safe space with no repercussions.
What we've learned so far
You can't mandate if teams should run a post-mortem
At first we tried to force teams to run post-mortems regardless of the nature of the failure. That didn’t take in to consideration if they’d quickly recovered from it, and already taken action to improve reliability. We found that teams who were forced to participate were less engaged with the process, and actions taken were not always implemented. The process is there as a tool to help teams learn from failures, big or small - it’s important to let teams recognise where they feel a post-mortem is necessary, and support them through the process. By promoting team ownership, we get the most out of our post-mortems, improve reliability of the services we're building, and improve our customer experience.
Sharing the output from a post-mortem is crucial
One of our post-mortem principles is to make sure that the same failure never happens again. You can't ensure this if you're not sharing the output from your post-mortems far and wide for others to read and learn from. Sharing helps other teams consider potential future failures, whether designing new services or running existing ones. We share post-mortems with our engineering community largely via Slack (which are then added to a central repository for future reference), and we will soon make the post-mortem output public to the wider organisation and our customers. Transparency is one of our core values as an organisation, and building a strong post-mortem culture supports this.
Bringing all of this to life is really difficult!
We have a small team who, amongst a number of other responsibilities, help champion our post-mortem culture and make sure that output is made visible. The post-mortem process, by nature, means it needs dedicated time and effort to be successful. Therefore it's important that they are run well, we deliver the actions/improvements we agreed to, and that we iteratively improve the process.
Hopefully this article has given you some insight into how we're doing things at OVO and has got you thinking about how failure is sometimes a good thing. I'm really interested in hearing from anyone who has experiences/lessons to share about building a post-mortem culture, so please get in touch either by leaving a comment below, or through Twitter @OVOTechTeam / @lukebriscoe.