A couple of weeks ago I went to SRECon17 in Dublin to learn some more about what it is, why companies do it, and to talk to some of the people who have been involved in leading the SRE charge over the last couple of years.
Site Reliability Engineering (SRE) was created back in 2003 by Google. If you’re completely new to the concept, you can get a fairly good idea about what SRE is by reading this interview, and if you’re really in to it, I’d recommend reading the SRE book (it's completely free to read online). I took pages and pages of notes whilst I was at the three day conference and I took a lot from talking to SREs and SRE managers about their journey so far - and in this blog I’ll try and explain some of the key things I found during the week. I’ll also include a number of links to the talks I found most interesting.
One of the things I picked up early on is that SRE is a concept based on company values. Narayan Desai spoke about the ‘Care and Feeding of SRE’ and how from his experience, SRE at Google is built on three core values that everyone buys in to as part of their tech culture. I’m massively paraphrasing here, but watch the talk for yourself and you’ll get a better idea:
1. Reliability is paramount. If your services aren’t available, then features aren’t important.
2. Make precise promises. Be good at making and keeping promises - if you say you’re going to hit a certain service level objective, make sure you do it, and make sure you measure them accurately
3. Assume the best intentions. I'm pretty sure everyone has services that when you look at them, you wonder what was going through the mind of the person developing them in the past - always assume the services were built with the best intentions, run consequenceless postmortems, and accept that things will go wrong.
Two other talks I really enjoyed and took a lot from, but will write a follow up blog on soon, were ‘Gamifying Reliability Excellence’ by Danny Lawrence from LinkedIn, and ‘Incident Management & ChatOps’ by Daniella Niyonkuru from Shopify.
The standout talk of the conference for me was a monitoring 101 delivered by Theo Schlossnagle from Circonus. Everything he spoke about was so relevant to what we’re looking at here at OVO with our Production Operations team, so I recommend you check out what he had to say. Theo had 10 key rules for monitoring a production service which he explores in his talk, but here are my key takeaways:
Monitor outside the tech stack - i.e. monitor things that are important to the business, not just to your tech services. Business information matters!
Alerts require documentation - this is pretty simple, right? We're not talking old school runbooks, but something simple contained within the alert that explains what it is, a link on how to fix, and who to escalate to if needed
Then this rule - my favourite - Something is better than nothing. Theo said in the talk ‘don’t let perfect be the enemy of good’ which I really liked - start somewhere and iterate!
We created a Production Operations team recently at OVO that is building something we're calling our Production Operations Centre (think NOC, but service-focused). The purpose of the POC is to give us a really quick insight into how we're performing service reliability-wise, that is aligned to our key customer journeys - but also triggers smart alerts when action is required. A lot of the data we need is already being used at a lower level by our full-stack engineering teams, and our objective is to aggregate this into one place that paints a picture of overall OVO service health at any moment. Theo's talk at SRECon gave me a different perspective on monitoring and alerting, and I took a lot of inspiration and good advice from SRE's and SRE managers who have gone through all of this before.
Check back soon for an update on how we're getting on...!