OVO Tech Blog
OVO Tech Blog

Our journey navigating the technosphere

Mike Brooks



Complexity in Infrastructure as Code

As a high level concept Infrastructure as Code (IaC) is a well established practice that is helping to improve the consistency and quality of infrastructure deployments. However IaC, just like all other code, can be overly complex, use poor conventions, and can build tech debt.

As a high level concept Infrastructure as Code (IaC) is a well established practice that is helping to improve the consistency and quality of infrastructure deployments.

However IaC, just like all other code, can be overly complex, use poor conventions, and can build tech debt. As with all code, these flaws make applications much more difficult to maintain and make bugs more likely.

One of the most common areas where complexity creeps into IaC is when handling multiple deployments of the same infrastructure definitions. How to handle multiple deployments or environments is an early design decision in most IaC applications and choosing the wrong approach can cause a great deal of additional complexity and design issues later on.

This write up focuses on Terraform but the principles apply to other IaC languages.

Problems with one common approach

The most common strategy I see is something like:

├── README.md
├── environments
│   ├── dev
│   │   ├── backend.tf
│   │   └── main.tf
│   ├── production
│   │   ├── backend.tf
│   │   └── main.tf
│   └── staging
│       ├── backend.tf
│       └── main.tf
└── modules
   ├── serviceA
   │   └── foo.tf
   └── serviceB
       └── bar.tf

The theory here is good, by isolating the key logic into modules you can reduce complexity and duplication. The problem is the multiple, duplicate copies of main.tf.

In most cases development begins with each main.tf containing just calls to modules, each version is identical except for any environmental properties or intentional differences.

There are a couple of problems with this approach though, firstly, moving changes between environments is what I call version control by copy and paste. By which I mean when a new change is ready to be released it must be copied from the dev file to the staging file and up to the production code. This opens you up to all sorts of mistakes and problems, it also makes automated releases tricky.

More importantly, .tf is a code file in Terraform, so each of the environments has its own code file. Code files contain logic and bugs, even if most of the logic is in modules there is still scope for every environment to have different logic. So every push to a different environment is untested. It's equivalent to having a different main(args[]) method in your code for every different environment.

A simpler, cleaner approach

Instead, different environments should be specified entirely with variables; not code. Either by properties file, environment variables or some other mechanism.

I'm not suggesting there is no logical differences between environments. When using multiple environments as part of your testing and development you should strive to make sure your environments are as similar as possible. However in everything but simple deployments there will be differences. For example there might be different security, cost or performance concerns.

Logical differences should be achieved with conditional logic rather than completely separate code files. This means that when you do have logical differences in your environments it is obvious to anybody reading the code what paths are possible and what trigger is in play.

An alternative approach I would recommend is:

├── README.md
├── main.tf
├── backend.tf
├── modules
│   ├── serviceA
│   │   └── foo.tf
│   └── serviceB
│       └── bar.tf
└── properties
   ├── dev.tfvars
   ├── production.tfvars
   └── staging.tfvars

In this case there is only one code definition for all environments, the only difference between environments are the .tfvars property files. In Terraform .tfvars is the extension used for flat property files that do not contain any resources or logic. An example of one of these .tfvars files looks like:

region = "eu-west-1"
network_cidr = ""
log_endpoint = "prd.log.example.com"
multi_az = true

This also means if you want to create new environments all you have to do is produce a new set of properties. Going one step further, in many cases you can automate the creation of these properties and have an unlimited number of deployments.

This structure does require a few changes in how you use Terraform, some of the common ones are as follows.

How to run terraform actions on different environments

Terraform allows you to specify variable files when running actions like plan or apply so your command might look something like:

 # To run against staging
 terraform apply \

 # To run against production
 terraform apply \

It is now almost trivial to update different environments without having to switch directory or copy and paste code changes. This also means it's much easier to apply changes to different environments as part of a deployment pipeline.

How to have conditional logic

Terraform allows you to use variables to toggle resources or change their behaviour. However, as with all code you should think about every conditional you introduce, it's a potential source of bugs.

In particular conditions should not be used for version control, e.g. to disable a resource because it's not ready for production, that’s what git is for.

What conditional logic can be good for is long term, planned, architectural differences. For example in your dev environment you might choose to run in a single AZ, toggling off additional NAT gateways:

resource "aws_nat_gateway" "gw-a" {
 allocation_id = "${aws_eip.nat-a.id}"
 subnet_id     = "${aws_subnet.public-a.id}"

resource "aws_nat_gateway" "gw-b" {
 count         = "${var.is_multiaz}" 
 allocation_id = "${aws_eip.nat-b.id}"
 subnet_id     = "${aws_subnet.public-b.id}"

The important part here is the line count = "${var.is_multiaz}" , count is a flag that can be set on almost any Terraform resource and allows you to create multiple copies of a single resource. In this case by setting the value to 0 you can indicate that the resource is disabled, or set the value to 1 to enable it.

Or to turn off the multi-az feature of an RDS database:

resource "aws_db_instance" "default" {
 allocated_storage    = 10
 storage_type         = "gp2"
 engine               = "mysql"
 engine_version       = "5.7"
 instance_class       = "db.t2.micro"
 name                 = "mydb"
 username             = "foo"
 password             = "foobarbaz"
 parameter_group_name = "default.mysql5.7"
 multi_az             = "${var.is_multiaz}" #Only multi-az in prod

In this case the resource parameter multi_az takes a boolean (or a 0/1 integer) to control behaviour.

How to work with remote backends

Due to how Terraform loads its backend configuration, there is a limitation that prevents you from using variables and .tfvars files during the init stage. From Terraform's documentation:

Only one backend may be specified and the configuration may not contain interpolations. Terraform will validate this.

-- https://www.terraform.io/docs/backends/config.html

This leads many people to creating one backend file for each of their environments. This is not necessary though. You can still use a single file to represent different backends. This is how:

The backend.tf file:

terraform {
 backend "s3" {
   key    = "path/to/my/key"

Then, to load the backend use:

export ENV_NAME=production

terraform init \
 -reconfigure \
 -backend-config="bucket=company-name-${ENV_NAME}-terraform-state" \

The combined result of the backend file and runtime inputs is the equivalent of a static backend.tf that looks like:

terraform {
 backend "s3" {
   key      = "path/to/my/key"
   bucket   = "company-name-production-terraform-state"
   profile  = "production"

Also, note the use of the -reconfigure flag on the init command, this prevents state leaking between different environments when running against multiple backends in the same session.

What about terraform workspaces?

Terraform has the concept of workspaces, these could be a way to manage multiple environments for some teams. However there are a couple of issues that rule them out for general cases, e.g. the documentation says:

In particular, organizations commonly want to create a strong separation between multiple deployments of the same infrastructure serving different development stages (e.g. staging vs. production) or different internal teams. In this case, the backend used for each deployment often belongs to that deployment, with different credentials and access controls. Named workspaces are not a suitable isolation mechanism for this scenario.

-- https://www.terraform.io/docs/state/workspaces.html


This blog has presented a view on structuring Terraform for multiple environments, as this is a particularly error-prone area. There are no doubt other approaches that work and there are definitely other problems with complexity that people encounter in IaC.

The key takeaway then is that IaC is code, and you should treat it that way. There are countless blogs, books and tutorials on writing clean code. Many, if not most of the principles apply just as much to infrastructure definitions as they do to applications. Using these principles will help make your infrastructure a lot more reliable, extendable, and usable.

I'm a member of the Production Engineering team here at OVO, you can read more about what we do and how we work and check our vacancies page.


Mike Brooks

View Comments