OVO Tech Blog
OVO Tech Blog

Our journey navigating the technosphere

Share


Tags


OVO Tech Blog

How we test our infrastructure-as-code at OVO

Abstract

As a centralised production engineering, our team writes a LOT of infrastructure-as-code (IaC). At OVO, there exist a wide variety of projects, from purely frontend to full stack to data science, which make use of multiple cloud providers and PaaS tooling. Given how our team vision has always been to build reusable, generic, battle-tested modules that can support these projects and be used across the organisation in a standardised and reliable manner, you can start to see how we end up writing a lot of IaC.

Due to the volume of IaC our team generates, it was extremely important to us to ensure that we follow some solid, robust development practices, and ensure that all this IaC is well tested and does exactly what it was intended to do with no bugs before we push it out to our end users. Our language of choice for all this IaC is Terraform given it’s developer-friendly and cloud agnostic syntax, as well as an easy to follow development workflow.

This blog will talk through the journey we took to robustly test all our Terraform and give an outline of what we have accomplished and hope that it provides you with an inspiration for what you can do in your team!

Aim

Initially we wanted to validate our Terraform via some basic static checks, provided by Terraform itself as well as other open source tooling, to check the maintainability and security of our code. The choices here were fairly easy to pick out.

Besides doing all the static checks that are available, we also wanted to validate the functionality of our Terraform modules and ensure they fulfil the intended purpose via a robust integration testing strategy. In order to do this, we needed to ensure:

Additionally, when performing our integration tests it was important to test whether our modules work as expected in combination with each other. This was important so that we can build our reusable building blocks and then test that when we put those blocks together, we get the desired results.

As you can see, we set ourselves a hard target but one that we knew we could solve provided we chose the right tooling and implemented the right strategy!

Static Checks

Before we dive into the intricacies of our integration testing solution as demanded by the blurb above, let’s briefly discuss the static checks we carry out in our pipeline before we even get to the integration testing.

Static Terraform Checks

This stage carries out the standard validation commands provided by Terraform itself:

Terraform Validate

This is the quickest and cheapest way to check whether the configuration is valid. We can very quickly find out whether we have defined the right inputs, outputs, resource and module dependencies and ensure the code is in a valid state before it is checked against a live environment.

Terraform Fmt

This will allow us to ensure our Terraform IaC is written in the standard style, with the right spacing, indentation etc. to ensure the best readability and maintainability.

It’s become part of our standard workflow to run fmt and validate locally first before pushing any changes through the pipeline.

Checkov SAST

Checkov is an open source command line utility (approved by the OVO Security Engineering team) which we use to make sure our IaC has all the correct security configurations and settings. If for example, an AWS S3 bucket created in our modules is missing encryption or is exposed publicly, this will be flagged by Checkov. To run this locally or in our pipeline, we simply need to install the binary and run the CLI as follows:

checkov -d my_module_dir --quiet --output cli --framework terraform --download-external-modules false

If a Checkov finding is flagged, then we will go back and correct this, but if it is intended in terms of functionality then we can easily log an exception in the code itself to remind us why that security configuration was not put in place. As an example, in the following AWS load balancer we want to disable deletion protection, hence we have logged an exception for the appropriate finding ID:

resource "aws_lb" "network" {
  #checkov:skip=CKV_AWS_150: Needs to be deletable so deletion protection is disabled

Tflint

Tflint is a pluggable linter that we make use of to ensure we can enforce custom static rules e.g. the enforcement of certain tags on all our resources. There are out of the box rules available as well for various platforms such as AWS which provide us information about configuration errors, deprecated syntax and enforce various best practices. It’s an upgrade on the standard terraform validate but still a cheap, quick and static way to ensure our IaC is up to scratch.

To run tflint just install the binary and, as an example, run the following commands:

tflint --init --config=terraform/.tflint.hcl
cd terraform/modules/my_module
terraform get
tflint --config=../../../terraform/.tflint.hcl

Integration Testing

InSpec

Our first investigated tool regarding infrastructure integration testing was InSpec. This was a tool that was fairly well known in terms of infrastructure testing and one that a few members of our team had previously dabbled with before. The workflow was fairly simple:

As an example of what our InSpec profiles would look like, we can look at the example below which is validating our module for an AWS S3 bucket:

control "my_bucket" do
    BUCKET_NAME = params['my_bucket_id']['value']
    VERSIONING = params['my_bucket_versioning']['value']

    only_if { BUCKET_NAME != "" }
    impact CRITICAL
    title "My Bucket private and versioning check"
    desc 'Check to see if bucket is private and versioning disabled.'
    describe aws_s3_bucket(BUCKET_NAME) do
        it { should exist }
        it { should have_default_encryption_enabled }
        its('region') { should eq 'eu-west-1' }
        
        it { should_not have_versioning_enabled}
        it { should_not have_access_logging_enabled }
        it { should_not be_public }
        its('bucket_acl.count') { should eq 1 }
    end
end

The params was read from the terraform output after the apply had run successfully and then as you can see above, we do a number of checks to ensure the S3 bucket has been provisioned correctly with the right configuration.

InSpec worked fairly well for us initially, until we ran in to the following limitations:

Since the above limitations were quite significant, we had to pivot and explore alternatives!

Terratest

The option we eventually settled on was Terratest which we had considered initially, but then put on the back-burner due to the entry barrier of needing a fairly decent knowledge of Go in order to start writing tests. However we took this challenge on headfirst and began writing tests for a couple of our modules which were untestable with InSpec.

The workflow of Terratest was fairly similar to InSpec:

The improvement in developer experience here comes from the fact that as a developer, the above is accomplished by simply running go test and the library handles the plan, apply and destroy as intended by the test code

A fairly simple test for our S3 bucket module would then look like the following:

package test

import (
  "testing"

  "github.com/gruntwork-io/terratest/modules/terraform"
  teststructure "github.com/gruntwork-io/terratest/modules/test-structure"
  "github.com/stretchr/testify/assert"
)

func TestBasicUsageExample(t *testing.T) {

  // Copy the example usage of our Terraform module to a temp directory
  dst := teststructure.CopyTerraformFolderToTemp(t, "..", "examples/basic-usage")

  // Add a deferred terraform destroy to run at the end of the test
  defer terraform.Destroy(t, terraformOptions)

  // Run a terraform apply pointing at our temp directory
  terraform.InitAndApply(t,, &terraform.Options{
    TerraformDir: dst,
  })

  // Check the output of our terraform apply
  output := terraform.Output(t, terraformOptions, "bucket_id")
  // Check to ensure the output matches a value that we are expecting
  assert.Equal(t, "my-test-bucket", output)
}

As you can see, the test is fairly simple in what it does, with the apply, destroy and assert accomplishing what we had done with InSpec. Due to the fact that the Terraform commands are wrapped as statements in the Terratest library itself, we don’t need to run those commands via the CLI as separate stages in our pipeline, making our pipeline cleaner. Note the defer statement in front of the terraform destroy which essentially means that it will wait till the end of the test to run that particular command.

In order to run the test via the Go CLI all you need to do is run the following command:

go test -v -timeout 30m

The other major advantage of Terratest is that given it’s just a Go library, we can use other additional Go libraries for various cloud providers and PaaS tooling (namely Cloudflare) and couple them with these tests, which solves the key limitations of InSpec listed above. For example you can find the Go AWS SDK as well as the Cloudflare SDK, which are extremely comprehensive as they are maintained by the vendors themselves.

As an example, here is a test that checks for a Cloudflare rate limiting firewall rule with a combination of the Terratest and Cloudflare libraries:

package test

// Import the various libraries for Terratest and Cloudflare
import (
  "context"
  "testing"

  cloudflare "github.com/cloudflare/cloudflare-go"
  "github.com/gruntwork-io/terratest/modules/terraform"
  teststructure "github.com/gruntwork-io/terratest/modules/test-structure"
  "github.com/stretchr/testify/assert"
)

// Directory where our Cloudflare module is kept
const rateLimitingModuleName string = "modules/uri_rate_limit"

func TestCloudflareRateLimiting(t *testing.T) {
  dst := teststructure.CopyTerraformFolderToTemp(t, "..", rateLimitingModuleName)

  // Placeholder value a Cloudflare auth token
  // Appropriate logic should be inserted to fetch this token securely
  cloudflareAuthToken := "CLOUDFLARE_AUTH_TOKEN"

  // Set various Terraform options including module inputs
  // and environment variables that can be consumed by Terraform
  opts := &terraform.Options{
    TerraformDir: dst,
    Vars: map[string]interface{}{
      "zone_id":                                 "CLOUDFLARE_ZONE_ID",
      "rate_limiting_threshold":                 60,
      "rate_limiting_period":                    60,
    },
    EnvVars: map[string]string{
      "CLOUDFLARE_API_TOKEN": cloudflareAuthToken,
    },
  }

  defer teststructure.RunTestStage(t, "destroy_terraform", func() {
    terraform.Destroy(t, opts)
  })

  // Run a Terraform apply idempotently
  // i.e. running an immediate plan after the apply should show no changes
  teststructure.RunTestStage(t, "apply_terraform", func() {
    terraform.Init(t, opts)
    terraform.ApplyAndIdempotent(t, opts)
  })

  teststructure.RunTestStage(t, "rateLimiting", func() {
    // Get the ID of the firewall rule created as an output
    rateLimitingRuleID := terraform.Output(t, opts, "rate_limiting_rule_id")

    // Initialise a new Cloudflare client with the fetched API token
    api, err := cloudflare.NewWithAPIToken(cloudflareAuthToken)
    if err != nil {
      t.Fatal(err)
    }
    ctx := context.Background()

    // Get the firewall rule as an object using the outputted ID
    rule, err := api.RateLimit(ctx, helpers.CloudflareZoneID, rateLimitingRuleID)
    if err != nil {
      t.Fatal(err)
    }

    // Check the attributes of the firewall rule match what we expect
    assert.Equal(t, 60, rule.Threshold)
    assert.Equal(t, 60, rule.Period)
  })
}

As you can see above, the following steps are performed:

Terratest is looking like an absolute win. All this functionality with the added benefit of being able to add and wrap in our own custom helper functions and libraries.

Beautiful!

Testing our Examples

If you notice the above code snippets that make use of Terratest we are actually testing our module in two different manners. Whereas the second example is copying the module source into a temporary directory and passing the input parameters inline during the test, the first makes use of some example Terraform that we wrote which will consume our built modules:

dst := teststructure.CopyTerraformFolderToTemp(t, "..", "examples/basic-usage")

Since the examples in most cases will be the point where our users will begin consuming our modules, by simply copy-pasting them and replacing the input variables, it’s quite important to ensure that we test these examples. The example code already has a link back to our module source, and once we copy it to a temporary directory, we call Terratest as usual to execute terraform plan and ensure we get back a success.

Multiple Version Testing

If you harken back to the aim we set out at the start of this article, one of our requirements was to be able to test if the modules work for the different versions as set out in the module constraints i.e. version of Terraform itself, provider versions etc. This is where we really start to get funky with our Go skills.

Let’s examine the following version constraints for a module that makes use of the AWS terraform provider:

terraform {
  required_version = "~> 1.3.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = ">= 4.46.0"
    }
  }
}

There are a number of different versions of Terraform and the AWS provider for which our module should work. To be explicit, the following versions:

Terraform:

AWS Provider:

But how would we know if our module will work for each of these versions without running at least a terraform plan with each of these explicitly versions set? The answer lies with Terratest and some clever custom functions that we've put together. Let’s break it down with the case where we test for multiple versions of the AWS provider where we do the following:

Step 1

Get the list of all available versions of the AWS provider by querying the Hashicorp releases API

Step 2

Examine the version constraint in the required_providers block for aws making use of the github.com/hashicorp/hcl/v2 library to parse the correct .tf HCL file and read in the content we want

Step 3

Compare the outputs of the above 2 steps and get the list of versions from step 1 which fit the constraint specified in step 2, which will give us a narrowed down list of the versions we need to test for according to our constraint (as seen above for the AWS provider)

Step 4

Loop through of each of these versions and copy our module into a new temporary directory for each version we want to test. In each folder dynamically replace the version to be the exact version we want to test and not any ranges specified so it looks something like this:

terraform {
  required_version = "..."

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      # The below line is what is done by Step 4 to specify an explicit version
      version = "4.46.0"
    }
  }
}

Step 5

Execute a terraform plan in each directory and if successful you know your module works for all the versions specified by your constraint!

Putting it all together

Steps 1 to 4 have all been implemented via custom helper functions that the team has written and we call each of these whenever we want to test a module for a particular dependency and it’s version constraints. An example of such a test that will execute all the above will look like the following:

package test

import (
  "os"
  "testing"

  "github.com/gruntwork-io/terratest/modules/terraform"
  teststructure "github.com/gruntwork-io/terratest/modules/test-structure"
  helpers "github.com/ovotech/team-cppe/shared-resources/libs/test-helpers"
)

func TestAwsProviderVersions(t *testing.T) {
  constraint := helpers.GetProviderConstraint(t, "../../my_module_folder", "aws")
  available := helpers.GetAvailableVersions(t, "terraform-provider-aws")
  testVers := helpers.GetMatchingVersions(t, constraint, available)

  for _, version := range testVers {
    version := version
    t.Run(version, func(t *testing.T) {
      t.Parallel()

      tempDir := teststructure.CopyTerraformFolderToTemp(t, "..", "examples/basic-usage")
      helpers.UpdateProviderVersion(t, tempDir, "aws", version, "hashicorp/aws")
      terraform.InitAndPlan(t, &terraform.Options{
        TerraformDir: tempDir,
      })
    })
  }
}

As you can see on line 13-15, we make use of custom helper functions to execute steps 1-3 described in the process above, to get back the explicit list of versions we want to test. Once we have that, it’s just a case of looping through them to execute the remaining steps with the provider version updated in each temporary folder, followed by a terraform plan pointing at the temporary directory. The helper functions used in these tests can all be found in the open source repository for your own reference.

Never again will we put out a module that doesn’t work for the versions we claim it works for. Terratest and Go for the win!

Closing Thoughts

As you can see from the above, since we began, our IaC testing strategy and methodology has evolved quite significantly to become quite robust and all-encompassing. The result of which is that we have published over 20 reusable modules over the last few months and have had less than 5 bugs reported back to us! Success? I think so.

Having said all the above, there are always a couple of improvements that we can make:

Hope you have found this blog interesting and useful and it has hopefully inspired you to dive deeper into your own infrastructure testing strategy to make it a bit more interrogative and robust. In order to get you started on your way, we have open sourced this repository which contains code samples demonstrating each of the testing steps we have discussed above, along with appropriate instructions and any helper resources / libraries that we’ve written to help enable this.

Happy testing!

View Comments