OVO Tech Blog

Hosting your own TICK Stack

Introduction

Erica Giordo


Hosting your own TICK Stack

Posted by Erica Giordo on .
Featured

Hosting your own TICK Stack

Posted by Erica Giordo on .

On the top of product objectives here at OVO we also have tech ones: last half in the Cross-Sell team we worked to migrate all our services from EC2 and ElasticBeanstalk to AWS ECS Fargate. This has allowed us to push Docker images to ECR (AWS Docker container registry) and run our containers much more easily, without having to care about managing bare metal servers and scaling.

Among the services that we had to migrate was our monitoring stack, precariously hosted on a single EC2 instance. In OVO we have a big culture for monitoring system performance and in our team we decided to go for the TICK Stack. If you don't know what the TICK Stack is, my team mate Tom gave a great explanation in his blog post, so his article is a good introduction to what we'll be going through - a step by step guide to setting up your own "in-house" TICK Stack.

homer_monitoring

ECS: Fargate or EC2?

As you may already know, in ECS it is possible to choose between 2 launch types, Fargate or EC2. While with EC2 you remain responsible for managing the servers underlying the ECS cluster, Fargate is an entirely managed platform to run applications.

When we were migrating our monitoring stack there was no option for persistent storage in Fargate: this was a deal breaker, given that one of the main TICK components is InfluxDB, a time-series database. To give ourselves the possibility to use Fargate we looked into hosted InfluxDB services, but unfortunately Kapacitor requires full admin access to InfluxDB, limiting our options. As such, we decided to opt for an ECS-EC2 task, so that we could use the facilities provided by EC2 to store our data.

ECR + ECS

The TICK Stack is made of 4 components: InfluxDB, Kapacitor, Telegraf and Chronograf and each application has a configuration file. To run everything in ECS we created 4 Docker images, one for each component, and pushed them to ECR. We did this so that we could bake custom configuration options (e.g. environment, S3 buckets) into each image at build time.

We then created 2 main tasks to be run in our ECS service, one for InfluxDB and one that grouped together Kapacitor, Chronograf and Telegraf. This is to make sure that InfluxDB would be the first component to be up and running and allow the rest of the system to be healthy; also, in this way we could avoid redeploying InfluxDB each time we update one of the other components of the stack.

Here you can find the configuration files for the tasks. As you can see there are some template vars in the files: this is because we use Terraform to build our infrastructure.

But (you could ask yourself) the task IPs are dynamic, so how can
Kapacitor or Chronograf talk to InfluxDB? As you can see in the ecs-params files, there are 2 Service Discovery entries, a friendly DNS name alternative to task IPs. The private_dns_namespace will make InfluxDB discoverable to the other task under a url of your own choice instead of an anonymous IP (for us it's influxdb.monitoring), while the public_dns_namespace for the other task allows us to access Chronograf from a public url (but restricted to our VPC). Be aware that'll take some time for the public DNS name to get attached to the task IP (around 5 mins for us), but in that period of time Chronograf is still be accessible from its IP.

ENI and EC2 type

At this point we have to be make an important choice: which type of EC2 instance shall we use? We need 3 elastic network interfaces (ENIs): one for the instance itself, one for InfluxDB and one for the Kapacitor, Telegraf and Chronograf. Each EC2 instance offers a specific number of ENIs - you can find the complete list here. Remember also to choose one with enough memory for InfluxDB (in our case we're on a generous r5 instance).

IMPORTANT: be careful when choosing an r5a instance, at the time of writing it sadly doesn't offer 3 ENIs as mentioned in the documentation!

Deployment

To deploy our services we use CircleCI. The build is split into the following steps:

  • creating an image for each component and pushing it to ECR
  • deploying InfluxDB in UAT (and eventually in PROD)
  • deploying Kapacitor, Telegraf and Chronograf in UAT (and eventually in PROD)

Screen-Shot-2019-03-14-at-09.23.25

It's not really clear from the image above, but a constraint to deploy InfluxDB would be having published its Docker image to ECR: the same thing should happen for the other images to deploy Kapacitor, Telegraf and Chronograf.

The command to run each task is really easy:

ecs-cli compose --project-name influxdb service up
        --cluster monitoring
        --create-log-groups
        --timeout 10
        --deployment-min-healthy-percent 0
        --enable-service-discovery

and

ecs-cli compose --project-name tick service up
        --cluster monitoring
        --create-log-groups
        --timeout 10
        --deployment-min-healthy-percent 0
        --enable-service-discovery

Et voilà! Job done! 🏆

It is also important to mention that if any vulnerability is raised in any of the 4 Docker images, we can easily bump up the version of the component to the patched one, recreate the Docker image and redeploy the affected task. 🗝

Volatile tasks: EBS

That's great, but we said we didn't choose Fargate 'cause the tasks are not persistent..so how are we approaching it?!

Well, being on ECS-EC2 allows us to expose a volume - as you can see here - and persist it in case of an inexplicable EC2 instance death with AWS EBS (Elastic Block Store) that Amazon describes as a "persistent block storage volumes for use with Amazon EC2 instances". We built the entire infrastructure with Terraform, so if you're using it like us, you can find the EBS creation here. The only thing to be aware of is that, if the instance dies and a new one is brought up with autoscaling, the EBS volume has to be reattached (but not formatted as when attaching it for the first time!).

Here's our script:

#!/bin/bash
set -e
yes | yum install jq aws-cli
echo 'Waiting for volume to be detached..'
aws ec2 wait --region ${var.aws-region} volume-available --volume-ids ${aws_ebs_volume.infludxdb.id}
echo 'Volume has been detached.'
INSTANCE_ID=`curl http://169.254.169.254/latest/meta-data/instance-id`
aws ec2 --region ${var.aws-region} attach-volume --device /dev/xvdg --instance-id $INSTANCE_ID --volume-id ${aws_ebs_volume.infludxdb.id}
aws ec2 wait --region ${var.aws-region} volume-in-use --volume-ids ${aws_ebs_volume.infludxdb.id}
echo 'Volume has been attached.'
sleep 3
VOLUME_PATH=`realpath /dev/xvdg`
echo 'Volume path: ' + $VOLUME_PATH
# if /dev/xvdg has already been formatted with ext4, don't do it again
if  [[ -z `file -s $VOLUME_PATH | grep ext4` ]]; then
  echo 'Formatting..'
  yes | mkfs -t ext4 $VOLUME_PATH
fi
mkdir -p /data/influxdb
echo "$VOLUME_PATH /data/influxdb ext4 defaults,nofail 0 2" >> /etc/fstab
mount -a
echo ECS_CLUSTER=${aws_ecs_cluster.ecs-cluster.name} >> /etc/ecs/ecs.config
service docker restart
EOF

Summing it up

The journey to host our own TICK Stack might seem absolutely easy reading this blog post, but we experienced lots of failures along the way. At the time I'm writing this post our monitoring has been up and running on ECS for 3 months and reports metrics for 8 different services. If you still mean to go through this path, hopefully this post will save you from some pain.

And remember, as Xzibit suggests, you can always achieve more..even though we didn't go that far :)

xzibit

A big thank you to my colleagues Chris Birchall and Jon Potter for reviewing this article and a really special one to my team mate Tom Verran!

Erica Giordo

View Comments...