OVO Tech Blog

Catching security vulnerabilities in the build pipeline

Introduction

Matthew Daley

Matthew Daley

Backend services development using Clojure and the cloud.


docker security

Catching security vulnerabilities in the build pipeline

Posted by Matthew Daley on .
Featured

docker security

Catching security vulnerabilities in the build pipeline

Posted by Matthew Daley on .

At OVO Energy we take cyber-security very seriously. We are keen to ensure that all of our services and infrastructure are set up to be as resistant as possible to attack. One area we’ve recently decided to improve in is the area of security for Docker containers. In this article I’ll describe a mechanism for checking security violations in containers against the public list of known vulnerabilities and how we integrated it into our continuous delivery pipeline.

We run lots of our services in the cloud, either in the Google Cloud Platform (GCP) or in Amazon Web Services (AWS). In my team, we use Kubernetes, building all of our applications as Docker containers. We keep these Docker containers in the GCP Container Registry. In Container Registry, Google has recently added an alpha feature called Vulnerability Scanning.

Once this is turned on, any images uploaded to the registry are automatically scanned for known security vulnerabilities and exposures. Also, as new vulnerabilities are discovered, containers are checked to see if they are affected. This is a really useful feature! It allows to get early warning of critical security flaws so that we can update our containers and hence avoid risk. This checking isn’t instant, it takes a while to operate and by that time our containers will have been deployed to production. We wanted to go one step further; we wanted to build this checking into our build pipeline so that we get even earlier warning of any problems, before any deployments.

Remember the Heartbleed vulnerability first made public in April 2014? This was one of the worst security issues ever found in Linux systems, based on a flaw in OpenSSL, the implementation of SSL/TLS encryption used for securing web site traffic. It made over 600,000 websites vulnerable to remote attackers. We don’t want problems like this creeping into our systems!

With vulnerability checking built into our pipeline we can stop these things dead before anything is deployed to a test or production environment! OK, this only applies to known vulnerabilities but, as soon as a new high importance flaw get put into the CVE system, we will be made aware when our pipeline stops deploying.

The CVE System

At this point, it’s worth explaining where Google gets this list of vulnerabilities from. All of this information is collated in the Common Vulnerabilities and Exposures (CVE) system. The system is operated by the MITRE Corporation, under the auspices of the US Department of Homeland Security. It is the worldwide, authoritative source of public CVE information and it consolidates information from many CNAs (CVE Numbering Authorities).

Two examples of high risk level CVEs are below. Don’t worry about understanding the technical wording; I just wanted to point out how complex and obtuse they can seem whilst, at the same time, being very important to deal with:

  • CVE-2017-8804, high risk level, glibc, The xdr_bytes and xdr_string functions in the GNU C Library (aka glibc or libc6) 2.25 mishandle failures of buffer deserialization, which allows remote attackers to cause a denial of service (virtual memory allocation, or memory consumption if an overcommit setting is not used) via a crafted UDP packet to port 111, a related issue to CVE-2017-8779.
  • CVE-2016-2779, high risk level, util-linux, runuser in util-linux allows local users to escape to the parent session via a crafted TIOCSTI ioctl call, which pushes characters to the terminal's input buffer.

So, without deep experience as a security researcher (or hacker perhaps?) it’s hard to understand what all of these vulnerabilities mean. They delve deep into the innards of operating systems, into realms that are beyond what I, as a mere application developer, can easily understand! As a responsible technology business, we certainly need to take them seriously and do our utmost to build systems that avoid as many issues as possible. Fortunately OVO Energy has some highly skilled, resident security experts to help us when we really need to understand the impact of a flaw.

After-the-fact Vulnerability checking

So, back to Google’s Container Registry. We’ve found the alpha vulnerability scanning capability to be very useful. For example, when we first turned it on, we were shocked by the number of high and critical risk level vulnerabilities in our containers. We quickly updated to new base images and, according to Google, all of the issues went away. Or did they really? Google says that we have no vulnerabilities. However, based on what I talk about next, I think they have accepted some issues and have decided to give a clean bill of health. This is just supposition on my part but it does tie up with the evidence (below).

The key issue for us is that we want to have containers checked early in our CI/CD process, just after a container image has been constructed, not after it has been deployed. With the Google approach, because there is a substantial delay in scanning, the vulnerabilities don’t appear until well after a container has been deployed into production.

We decided to investigate whether we could build CVE checking into our build process.

Vulnerability checking in the pipeline

After much googling I discovered CoreOS and the Clair program. Clair is an open-source project that allows the static analysis of Docker (and other) containers against the CVE list of known vulnerabilities. For more details you can visit the home page of the project.

A standard way to set this up is to integrate it “directly into a container registry such that the registry is responsible for interacting with Clair on behalf of the user.” Hmm… I suppose it is conceivable that this is what Google has done?

I wanted to improve on this by making vulnerability checking work directly in our Circle CI-based CI/CD pipeline. I discovered that three components were necessary to make this work:

  • The Clair database, run inside a Docker container (the Docker container arminc/clair-db:latest) The image is rebuilt on Travis every day to incorporate the newest security information.
  • An application server container that provides the interface to the database and that, I imagine, plays a core part in the scanning process (arminc/clair-local-scan:v2.0.1).
  • A command line tool called clair-scanner which can be used to invoke a scan against a docker container image.

For more details of these components see clair-local-scan on github. Many thanks go to arminc for making this possible!

Here’s how to integrate into to the Circle CI Pipeline. Below is the yaml for the Circle CI pipeline configuration for one of our microservices, with only the important bits shown:

version: 2
jobs:
  …
  <unit and acceptance testing, static code analysis>
  ...

  build_image:
    docker:
      - image: docker:17.11.0-ce
    steps:
      …
      <build the image and put into repository>
      …
      - store_artifacts:
          path: /pushed_images.txt

  validate_image:
    docker:
      - image: docker:17.11.0-ce
      - image: <gcp container repository>/clair-scanner:latest
        auth:
           <our auth>
    steps:
      - setup_remote_docker:
          version: 17.11.0-ce
      - attach_workspace:
          at: /workspace
      - checkout
      - run:
          name: Check for docker vulnerabilities
          command: |
            docker run -p 5432:5432 -d --name db arminc/clair-db:latest
            docker run -p 6060:6060 --link db:postgres -d --name clair arminc/clair-local-scan:v2.0.1
            echo $<our key> | docker login -u _json_key --password-stdin https://eu.gcr.io
            docker run -v /var/run/docker.sock:/var/run/docker.sock -d --name clair-scanner <gcp container repository>/clair-scanner:latest tail -f /dev/null
            clair_ip=`docker exec -it clair hostname -i | grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+'`
            scanner_ip=`docker exec -it clair-scanner hostname -i | grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+'`
            img=`cat /workspace/pushed_images.txt`
            echo "img = $img"
            echo "clair_ip = $clair_ip"
            echo "scanner_ip = $scanner_ip"
            docker pull $img
            docker cp .circleci/whitelist.yml clair-scanner:/whitelist.yml 
            docker exec -it clair-scanner clair-scanner --ip ${scanner_ip} --clair=http://${clair_ip}:6060 -t High -w /whitelist.yml $img
   …
   <the rest of the pipeline stages>
   ….

workflows:
  version: 2
  <service-name>:
    jobs:
      - unit_test
      - lint
      - build_image:
          requires:
            - unit_test
            - lint
      - validate_image:
          requires: 
            - build_image
      …
      <the rest of the pipeline>
      …

The build_image stage of the process creates a docker image and pushes this to the container repository. It also writes the full details of the container into the file pushed_images.txt.

The validate_image stage of the process, works as follows:

  • The latest clair-db is obtained and run. This is the database that contains the up-to-date list of CVE issues.
  • The clair-local-scan container is run, pointed at the running clair-db.
  • Docker login is used to log into our Google container registry.
  • Docker run is used to run the custom-built clair-scanner container. Two points to note are: (1) the -v option makes sure that the container can access docker already running on the host machine; this allows the clair-scanner to download the docker image that is to be analysed. (2) to keep the container running the trick of making it tail -f /dev/null has been used. Cheeky!
  • The scanning process needs certain IP addresses on its command line, so these are obtained next from the relevant containers. Getting this to work caused no end of trouble; hence the hacky looking using of grep to extract the ip address. OK, you’ll say that this is what one gets from hostname -i, but there were strange issues with null characters, CR and LF in the variable name that screwed up the process later until I put this in place.
  • The name of the image to be checked is obtained from the pushed_images.txt file from the previous build container pipeline stage.
  • The image is pulled down to the host.
  • A whitelist of allowed violations is pushed to the clair-scanner container (more on this later).
  • Finally, the clair-scanner command is run. In this case it has been configured to use the whitelist and only to worry about violations that are of High or greater severity (Critical and DefCon1 (!) are the higher levels).

The clair-scanner is an image created by us. It takes the golang command line tool clair-scanner and builds it into a very simple docker image using a dockerfile and a small script. You can see the definition of the image on github at clair-scanner-docker.

The Result

When code is checked in and the CI/CD process runs, the result of running the validate_image stage will be something like this:

Screen-Shot-2018-03-30-at-12.23.33

This shows a list of all the vulnerabilities found for the container. These start at the most severe level and go downwards towards the lowest level, Negligible. In this case note that the third item of medium severity level has been marked as Approved. In fact all medium and lower level vulnerabilities will be marked in this way this because we are only interested in failing the build for High or greater level problems.

But why are the High level problems marked as Approved too? If you look up the details of these two issues, you will find that one doesn’t have a solution yet and one has a solution but it hasn’t been incorporated into the latest debian release. So, we can’t update our images yet to remove the problems. In our case, we have decided that we can live with them. I suspect that this might also the case for the vulnerability scanning process used by Google, although they don’t seem to tell you about this in their current UI.

The way we have marked these two vulnerabilities as OK is through whitelist.yml a file that sits in the .circleci build directory alongside the config.yml that contains the complete CI/CD process. It is passed to the clair-scanner during the scanning process. Here’s its contents:

generalwhitelist:
  CVE-2017-8804: glibc
  CVE-2016-2779: util-linux

As things are configured, our validate_image stage passes and, all being well with the other stages, the build will be deployed to production.

Suppose a new violation of High or greater level appears in a new build? What will happen is that the new violation will appear in the violations list and, as the clair-scanner returns a non-zero status, the build will fail. We’ll then have to update our application container to remove the violation or decide that we can accept it for the moment.

So, when a high level vulnerability appears it will not get into our live systems without being scrutinised. We will no longer be in the dark!

What we have done here is just a start and I’m sure it could be improved on. However, we think this will be very useful in our quest to make our applications more and more secure over time.

Matthew Daley

Matthew Daley

Backend services development using Clojure and the cloud.

View Comments...