Introduction
In the computing world, blue green is an established model of deployment usually associated with software applications. The idea being you can run two versions of code side-by-side and switch traffic to a new version without downtime, leaving the old version in place in case rollback is required. If applied to cloud infrastructure, the model can also test the process of creating entire stacks or individual resources, uncovering issues that would otherwise only be discovered at the worst possible moment. The ephemeral nature of blue green resources can lead to some nice benefits, such as limiting the age of service account keys.
Blue green is different to A/B testing and canary releases, both of which tend to be used to switch certain percentages of users to, for example, a new UI component or app version. Both old and new versions are in use at the same time, and metrics are monitored to determine whether the new version can be deemed a success. The intention with blue green on the other hand, is for all traffic to be directed to the new version, with the old version left in place should rollback be required.
Surprisingly, the term isn’t derived from photosynthesizing bacteria of the same name. Instead it appears to have emerged as the de facto colours of the two side-by-side versions (mentioned above), in order to avoid any hierarchy that may be perceived between them. This may otherwise be the case with “A” appearing better than “B” for example. Netflix have even used their own branded colours in their renaming to the “red black” model, so perhaps you can too?
Infrastructure-as-Code (IaC) tools often provide the means to achieve blue green even without code duplication, as I'll cover later, but this doesn’t imply blue green operations will just work straight out of the box.
In this blog post I'll be covering several phases of a journey I recently embarked on, taking existing IaC (Terraform in this case) to a blue green model.
Phase 1: Determine The Scope
For some major cloud providers, it's possible to Terraform the entire account/project (even the account/project itself), and therefore possible to fit everything into a blue green model. Not all resources are such a great fit for blue green, though, such as monitoring resources, or datastores with the risk of split-brain.
A pattern we've adopted at OVO is to split IaC into permanent and ephemeral resources; the latter containing the resources we can fit into blue green. Being aware of which resources fit into blue green, or at least, being flexible and accepting that resources may need moving out of blue green at a later stage, is crucial to determine scope. Ephemeral resources can use attributes of permanent resources when required using a terraform_remote_state datasource.
The number of resources falling into blue green scope could be a low proportion of the overall estate. For example, a single Kubernetes cluster. Adopting blue green in this example can still be worthwhile, providing a clean and simple way of creating virtually identical clusters.
Phase 2: Preparing IaC
Terraform workspaces are the key to enabling blue green. You'll already be using one (called "default") even if you've never heard of them before.
In short, workspaces allow you to operate multiple Terraform states within the same backend, making it very easy to switch between workspaces/states on the command line before running operations like plan, apply and destroy. Creating a new stack from a new workspace at this point will most likely result in errors if resources in different workspaces aren’t unique.
Terraform can interpolate the current workspace in resource attributes. For example, setting this:
name = "prod-cluster-${terraform.workspace}"
..in Terraform would result in:
name = "prod-cluster-green"
..in the green workspace. It's this mechanism that enables the separation of blue and green resources using the same IaC code.
Some resources can be handled in interesting ways by cloud providers when it comes to resource deletion. Providers may prevent some resources from being deleted at all, or may hang onto names of resources despite them (supposedly) having been destroyed. These scenarios will need workarounds for blue green.
A good example of the scenario above is the GCP KMS keyring, which can't be destroyed in GCP. If a keyring was present in a Terraform state, and a terraform destroy command issued to destroy it, Terraform would remove the keyring from the state file, but deliberately not attempt any deletion in GCP. The next attempt at creating the keyring resource with a terraform apply will return an error as a keyring with that name already exists.
A common workaround to the aforementioned error is to use the "random" Terraform provider. The provider can generate a random string, which can then be injected into attributes to make resources unique. This does however lead to dangling resources building up over time. The other approach is to move the resources in question out of the blue green model into permanent IaC.
If resources are to be moved out of ephemeral and into permanent IaC, several options are available:
- Resource code can be moved, and those changes applied in their respective repos (this will lead to resource recreation).
- Resources can be removed from one state manually by issuing
terraform state rm <resource_name>
, then imported into the other withterraform import <terraform_id> <resource_id>
(resource code will still need removing/adding to their respective repos, and ultimately terraform will report a no-op when running a terraform plan in each repo). - Move resources directly between two states with a single command,
terraform state mv -state-out=other.tfstate <resource_name> <resource_name>
(again, resources will need removing/adding to the two sets of terraform code).
Note: when moving resources between states, regardless of whether it’s directly or via state rm
and import
commands, make sure other people don’t run any terraform apply commands until after the terraform code changes have been merged to the main branch. If they do, they’ll destroy the resource(s).
Phase 3: Gitops
Gitops is a paradigm that allows Kubernetes resources to be defined in a single Git repository. While not essential, I chose to implement Gitops (ArgoCD in this case) for the value it adds in automatically deploying resources into a new Kubernetes cluster. When ArgoCD has started up in Kubernetes, it'll apply the resources defined in the Gitops repo. This is in stark contrast to alternative methods such as applying via kubectl, which can still be scripted, but will likely be a process that's either disjointed from the creation of the cluster or very complex in its automation.
Phase 4: Create A New Workspace
Creating a new workspace with Terraform is straightforward. For example:
$ terraform workspace new prod-green
When this runs, Terraform will create a new workspace-specific state file in the remote backend, and switch the user over to the workspace in question. The command only needs to be run once for each workspace being used. The state will initially be empty, ready for all those lovely resources to be created:
$ terraform apply
This should result in Terraform reporting Plan: x to add, 0 to change, 0 to destroy.
, since no resources exist yet.
It makes sense to be aware of application workloads before creating the new stack. When resources are created, duplicate workloads will be running simultaneously. This could be fine if using asynchronous messaging services or running APIs, but it's worth checking the impact of the duplication. Cronjobs should also be checked where applicable to ensure they're idempotent.
Phase 5: Switching DNS
DNS switching can be performed so requests reach APIs in the new stack. Note: this will only apply to stateless APIs; switching stateful APIs is more involved, requiring additional loadbalancing and session draining.
One solution for DNS switching is to separate DNS Terraform into a separate module/directory. It's controlled entirely separately with its own backend and state, so when switching between blue and green only DNS records are involved; no other resources can get in the way.
The DNS Terraform can be workspaced into blue green. However, this scenario is peculiar in that both workspaces need to be able to manipulate the same DNS record, which can be achieved by running terraform import
into both workspaces.
Tip: with a single resource being present in multiple terraform states, there’s a risk that an accidental terraform destroy on one workspace can delete the DNS record (especially if the DNS terraform directory is alongside other terraform directories). To mitigate this risk, a create_before_destroy
lifecycle can be placed on the resource, seeing as the DNS record should never be destroyed in either workspace.
Static IP addresses can be reserved in the cloud provider in the ephemeral resource Terraform. Seeing as the DNS Terraform itself is workspaced, it can retrieve the IP address according to the workspace currently in use (e.g. if workspace = green, retrieve the green static IP address).
Switching over to green, for example, is as simple as selecting the new workspace and running apply:
$ cd dns
$ terraform workspace select prod-green
$ terraform apply
..and that's DNS switched over! As soon as the DNS has replicated, requests should reach the API in the new stack. If there's an issue, the switch can be reversed with the same commands but selecting the prod-blue workspace first.
If any checks exist and pass against the new (green) stack, the old (blue) stack can be destroyed:
$ cd ../terraform
$ terraform workspace select prod-blue
$ terraform destroy
At this point, Terraform will proceed to destroy all resources in the old (blue) stack. Despite these resources no longer being required, it's still important to ensure it happens cleanly. If the destroy process fails, resources may be left in a state that they can't be restored from, which may be required quickly in an emergency.
Once the destroy operation has completed, the workspace itself will remain intact, just with an empty state, ready for the stack to be created again.
Conclusion
I've covered the various phases I experienced in retrofitting existing IaC into a blue green model, from choosing which resources to fit into the model, through to ways of interacting with Terraform in order to achieve a working solution
Terraform workspaces can be a scary beast if you've not had the pleasure of working with them before, and having to convert a large amount of code that maps to real-life resources can be a daunting prospect. Bear in mind that even before telling Terraform to create a new workspace, you'll already be operating in one, called "default". Separate any resources that clearly don't fit into the blue green model (e.g. databases), then workspace all the remaining things. In the worst case, you'll discover some resources need moving outside the model.
"Perfect is the enemy of good" is an aphorism often used in computing. I recommend approaching blue green in the same way. When executed well and fully automated, it can be a very powerful tool, but most of the benefit comes from being able to run terraform apply and terraform destroy cleanly on two workspaces, even if the CI processes pulling the strings are manual.
Whether you’ve automated blue green or operating manually, it’ll give you the opportunity to roll back changes very quickly, and ensure you (or the cloud provider) don’t have any bugs in the resource creation process. This will give you time to iron those issues out, rather than encountering them during a major incident. Finally, it’ll allow you to treat your entire stacks (with some exceptions, e.g. databases) with the “cattle vs pets” analogy. Need to recover from a huge data centre meltdown? Blue green is your friend!
Disclaimer: I would love to be able to take credit for establishing blue green infrastructure at OVO, but that credit should go to other brilliant OVO engineers who've built the model up to what it is today.