OVO Tech Blog
OVO Tech Blog

Our journey navigating the technosphere

Tom Verran
Author

Since joining OVO in 2017 I've worked on the Propositions Team, bringing you OVO Beyond.

Share


Tags


How we use the TICK Stack

Monitoring is one of the many things teams at OVO energy have autonomy over choosing - we have teams using CloudWatch, DataDog and New Relic (among others!) but on the Propositions Team we've gone for the TICK Stack, an open source collection of services by InfluxData, written in Go.

It took me a little while to fully understand the roles each of the components play in a typical setup, so I thought I'd go through each in turn and explain briefly how we use each of them.

Telegraf

Telegraf is an agent that runs on or alongside your app servers to collect metrics. It supports a number of protocols - we're using its CloudWatch plugin to relay CloudWatch metrics and its StatsD protocol support to collect metrics from Kong, an API gateway.

You can run Telegraf in a Docker container but if you do you'll need to share various /device files with the container if you want to obtain system metrics pertaining to the underlying machine running Docker, as opposed to the Docker container itself. I followed this blog post by Jacob Tomlinson to do this.

We run one instance of Telegraf per EC2 instance running our apps, and then one extra one that is solely concerned with relaying CloudWatch metrics which don't correspond to any one instance.

InfluxDB

InfluxDB is the time series database that stores the metrics. This has been an almost entirely set up and forget system for us due to our data retention requirements being very simple (we keep all our metrics for a month). If you're used to Graphite's automatic downsampling of old data you'll find you need to set up continuous queries to do this yourself, which seems a pity.

Kapacitor

Kapacitor is described as a "stream processing engine" but we use it exclusively for alerting. It polls InfluxDB and then runs the resulting stream of metrics through a series of processing steps to transform the data & trigger alarms, which we post to Slack.

These processing steps are defined using a custom language known as TICKScript, which looks like this (Adapted from the docs):

stream
    .eval(lambda: "errors" / "total").as('error_percent')
|alert()
    .warn(lambda: "error_percent" > 10)
    .message('error rate is {{ index .Fields "error_percent" }}')
    .slack()
    .stageChangesOnly()

TICKScripts consist of a series of nodes which transform data, chained together with the pipe operator. In addition to nodes there's support for a subset of Go's string & maths functions within lambda expressions, details of which can be found here.

There's a not insignificant learning curve to writing TICKScripts, I was tripped up by some of the functions available within lambda expressions having the same names as nodes which operate on data streams (e.g. the min function & min node) but it is important to remember that they're essentially unrelated!

We use Kapacitor's support for TICKscript templates to create reusable alarms which we can then stamp out instances for as we add new services to our collection. This ability to easily parameterise and reuse alarms is a bonus relative to CloudWatch, where alarm reuse isn't common due to the inelegant machinery you need to grapple with to do so (like nested stacks or template includes).

Chronograf

Chronograf is a web app for creating dashboards based on InfluxDB data. It appears to be the least mature part of the TICK stack but is under very active development and it has improved enormously since we first adopted it. We'd originally intended to use Grafana as our dashboard but have found Chronograf good enough and its integrations with Kapacitor and Telegraf are bonuses.

Unlike Grafana, Chronograf lacks the ability to easily manage its configuration in code but we're hoping to have a go at solving this during one of the Open Source Friday events we have periodically at OVO Energy.

Relative to other monitoring solutions

I'll probably always lean towards hosted monitoring solutions over self hosted ones to avoid existential questions over who monitors the monitoring but the TICK stack has definitely been the best self hosted monitoring experience I've had, being simple to set up and almost entirely trouble free.

If you're exclusively using AWS and your alerting needs are very simple then CloudWatch remains the pragmatic choice, but if you find yourself frequently frustrated by its clunkiness or missing data points I'd definitely recommend scheduling in some time to try out the TICK Stack.

DataDog for me continues to offer the most polished experience but at a very steep price of $15/host/month I'd find it hard to justify when running large numbers of hosts.

Tom Verran
Author

Tom Verran

Since joining OVO in 2017 I've worked on the Propositions Team, bringing you OVO Beyond.

View Comments