OVO Tech Blog

Wizards, Goblins and Service Outages

Introduction

Andy Hall


coaching technology tooling

Wizards, Goblins and Service Outages

Posted by Andy Hall on .
Featured

coaching technology tooling

Wizards, Goblins and Service Outages

Posted by Andy Hall on .

The service management coaches have just finished running the first round of Major Incident simulations, beginning with our Production Support team.

In this post, I'll give a brief overview of our approach to these sessions and what Major Incident Management coaching has to do with Dungeons & Dragons.

Responding to a Major Incident

OVO has a methodology for effectively managing critical service disruptions: it exists as a best-practice model for running a Major Incident response team, prioritising swift recovery and good communication.

We have the challenge of taking the general methodology and making it real for our autonomous teams, helping each to form a service-specific best practice.

With our day-to-day focus on building great technology, it can be easy to forget about the non-technical competencies that produce a consummate Major Incident response. How do you foster the development of these skills before you need them?

Hawks are attacking your data centre!

Using some principles cribbed from D&D and other pen and paper role-playing games, we run a simulation of a Major Incident scenario.

Each session consists of the following participants...

  • A small group of Players take up roles within the Major Incident response team.
  • A Dungeon Master runs the simulation. More on this role later.
  • An Observer records feedback on the simulation as it transpires.

In our first scenario we chose to focus on coordination, communication and decision-making skills. The technical aspects of the simulation are therefore abstracted and fed to the Players via a remote Tech Lead.

The scenario unfolds with a series of information injects that fall into the following categories...

  • Events indicate when an attribute of the Major Incident itself has changed.
  • Technical Reports contain information from the Tech Lead on the state of the technical investigation and recovery effort.
  • Interruptions represent a non-technical event that may help or hinder the Players.
  • What Now? injects provide a pause, allowing the Players to make decisions and take actions. They also afford the Dungeon Master control over the pace of the simulation, and key actions and events have taken place. They is key to maintaining the coherence of the simulation.

The simulation runs for around thirty minutes. We follow it up with a with a retrospective, during which we encourage each Player to talk to each other about how they felt during the simulation. We use this opportunity to work in feedback from the Observer.

The role of Dungeon Master

As the Dungeon Master, your job is to guide the players through the simulation and ensure they get the greatest opportunity possible to learn from it.

Introduce the cell (the simulated world in which the simulation takes place), so that the Players understand how things will work. Be aware that some people can feel under pressure if they view the simulation as an assessment. Describe who is in the room and what their purpose is.

Just as in a real Major Incident, the Players will rarely have all the facts at their disposal. The scenario is constructed to allow more information to emerge over time. The beginning of the scenario can be the most difficult as a result, especially as some Players can feel uncomfortable with low levels of detail.

Disclosing just enough detail

Too much detail can make things too easy for the players, or open the scenario up to an immersion-breaking level of scrutiny.

Too little detail can result in Player frustration and threaten the coherence of the simulation, especially if the Players lack a key piece of information upon which downstream events rely.

Your aim is to provide the Players with the right level of information for them to fully participate, whilst making sure their full participation is required. Try to keep things moving and understand which pieces of information are critical to disclose before moving on.

Non Player Characters

Non Player Characters can be used to fulfill roles within the Major Incident response team that are not fulfilled by a Player, or to represent other actors required by the scenario. You can use them as an in-scenario tool for helping or hindering the Players.

If the scenario requires a specific piece of information to be discovered, or an action or decision on the part of the Players to progress, an NPC can be more a natural way of making this happen.

If your Players are doing too well, introducing an NPC can be a way of shaking things up or increasing the challenge.

More on Incident Management

Below are some resources that have informed OVO's Major Incident response best practice.

Google's SRE guide has a great section on managing incidents. (text).

DevOps TV showcases PagerDuty's guide to being an Incident Commander (video).

Kepner-Tregoe has an well-rounded approach to Incident mapping for postmortems (video) that's worth a watch.

Andy Hall

View Comments...