We created a model for automatically delivering infrastructure changes with robust security practices, and used to it build a secure Terraform CI/CD solution for AWS at OVO.
Almost everything under the hood at OVO Energy runs on the Kaluza platform, and developers across dozens of teams in OVO work to keep the platform running smoothly. We ship new features and improvements regularly, with many of them contributing to our Plan Zero mission, which helps all our customers on their way towards zero carbon energy.
As we build and scale the Kaluza platform to support our growth, security is one of our most important considerations. With this being a cloud-native platform built on AWS and GCP, we roll out infrastructure changes regularly and iteratively. To keep this process well-managed, we implement the practice of Infrastructure as Code across OVO; and for the vast majority of our platform, this means quite a bit of Terraform.
Terraform is a powerful tool for declaring the various managed cloud services we use across the platform. By managing the vast majority of our infrastructure as Terraform configurations, our infrastructure changes can be automated through a Continuous Integration and Continuous Delivery (CI/CD) pipeline. This not only serves as an important improvement to the quality-of-life of software engineers working on the Kaluza platform, but also provides consistency and visibility of infrastructure changes happening across our platform for other interested parties, such as security engineers and site-reliability engineers (SREs).
However, as with many good practices in the land of DevOps, automated change delivery can be a double-edged sword, and we have to make sure the delivery pipelines always work in accordance with robust security practices. This is why during the last year, the Kaluza Security Engineering team at OVO undertook a review of our CI/CD pipelines.
Based on the key security challenges we identified during this review, we created a model for CI/CD pipelines delivering infrastructure changes which implements robust security practices; and using this model, we developed a secure CI/CD solution for Terraform with AWS Developer Tools, which is now used widely across our platform. In this blog post we will share our experience along the way.
What we found
Following the culture of Technical Autonomy at OVO, teams working on the Kaluza platform can choose to use their preferred CI/CD solution for their Infrastructure as Code changes, with guidance and support from Security Engineers and Production Engineers. Most Kaluza teams have chosen popular hosted CI/CD platforms to deliver their changes.
These hosted solutions offer a set of very convenient features, such as access controls integrated with source code repositories, reusable modules for common actions which can be shared between pipelines, and integrated secret management for containers running the CI/CD jobs. However, after reviewing their security features, we realised that in the areas of network security, access management, and change control, there remained significant risk when we use these solutions to deliver Infrastructure as Code changes.
As with many good practices in the land of DevOps, automated change delivery can be a double-edged sword, and we have to make sure the delivery pipelines always work in accordance with robust security practices.
Working with strict network access controls
For most CI/CD pipelines delivering infrastructure changes within the Kaluza platform, their containers run on platforms managed by an external provider. We realised that none of these popular providers uses a fixed list of egress IP address ranges for traffic leaving their CI/CD containers. Instead, they are either completely ephemeral or are pooled within very broad IP ranges.
Furthermore, on our platform’s end, we cannot verify whether any ingress traffic from a CI/CD platform’s IP ranges did come from a container running one of our jobs, or instead they came from a container launched by another customer using the same managed platform for their CI/CD workloads.
As illustrated in the diagram below, these factors present us with two challenges from a security perspective:
Firstly, the majority of changes to our AWS and GCP environments are applied by Terraform calling the cloud environment APIs using its service credentials. By default, these API requests can be made from anywhere on the internet, and it is not always practical to apply conditions in IAM policies to restrict what source IP ranges can call these APIs. However, for service credentials with the most sensitive permissions, being able to refuse API requests from unexpected source IP ranges would be a useful defense in depth measure.
Secondly, some components in our platform are declared through Terraform, but are not managed through AWS or GCP APIs. Examples of this type include managed Kubernetes control planes and database endpoints. Following the defense in depth principle, both of these are powerful administrative interfaces, which require additional network access control beyond their native authentication layers.
To provide a sufficient level of security for managing these interfaces from a CI/CD pipeline, we have to establish SSH tunnels to bastion hosts as part of these CI/CD jobs, which not only complicates the CI/CD setup, but also creates the headache of having to manage the extra SSH credentials and to keep the bastion hosts themselves secure.
Managing privileged credentials for cloud environments
In order to make changes through AWS and GCP APIs on our behalf, many CI/CD job containers need access to privileged service credentials assigned from these environments, and these service credentials often need to be stored within the CI/CD platform’s chosen secret management solution and made accessible to containers.
To reduce the risk of storing these privileged credentials, engineers at OVO do a very good job at both minimising access to secrets stored within our CI/CD environments, and having them rotated regularly.
However, the most ideal scenario would be one where there aren’t any credentials stored: the cloud environment will simply establish a short-lived session with the CI/CD container, which will only last as long as the job itself. This is not yet available from the existing CI/CD solutions for our platform.
When using a typical CI/CD solution for delivering infrastructure changes, an attacker can achieve privilege escalation using production credentials, without actually being able to merge changes to a protected production branch.
Enforcing two-person rule
With our Technical Autonomy culture, each team within the Kaluza platform takes ownership over the part of the platform they are responsible for. When a team needs to make changes to their infrastructure, an engineer within the team will write up a pull request (PR) for the underlying Infrastructure as Code, which is automatically checked by the CI stage. The PR is then reviewed by another engineer within the team, before being merged and applied automatically by the CD stage; as shown in the diagram below.
By applying branch protection rules on a repository for Infrastructure as Code and using a CI/CD pipeline to deliver the corresponding changes, we can implement and enforce a two-person rule: any production changes must be initiated jointly by at least two engineers in each team.
However, in many CI/CD solutions, an attacker can actually circumvent the two-person rule by taking advantage of how jobs are specified in Git:
- Most solutions feature a pipeline or job specification file;
- This specification is hosted within the Git repository itself;
- The container runtime configured in the job specification generally can access all secrets associated with the repository (or the group of users managing it), whether non-production or production;
- CI/CD solutions generally execute the version of the specification file in the triggering branch, and this design detail often cannot be configured;
- The job specification can therefore be modified and executed by pushing to an unprotected feature branch, without being subject to the two-person rule;
- This means that when using a typical CI/CD solution for delivering infrastructure changes, an attacker can achieve privilege escalation using production credentials, without actually being able to merge changes to a protected production branch.
Modelling an ideal CI/CD solution
Having studied the security challenges of CI/CD platforms and solutions we used, we charted out the requirements for a CI/CD solution which overcomes these challenges. These requirements can be summarised as follows:
Native support for private networking:
- CI/CD job containers launched by the solution should have the ability to connect to interfaces within our internal networks through private networking.
- An external platform simply using fixed egress IP ranges would not be sufficient, unless these IP ranges are dedicated to our containers.
If static credentials must be stored, they need to be protected:
- For CI/CD containers running outside our cloud environments, the storage of privileged cloud environments credentials can be unavoidable.
- We need to ensure that manual access to these credentials are scoped to the bare minimum and audited in detail; and these credentials need to be regularly and automatically rotated.
Modifications to the behaviour of the CI/CD pipeline for production should require review:
- If a CI/CD solution stores the pipeline or job specification within the Git repository it pulls changes from, we should be able to instruct the pipeline to always execute the specification stored in a protected production branch, changes to which require another person to review and approve.
Building a secure solution
As security engineers, we went looking for a CI/CD solution which satisfies as many of our security requirements discussed above as possible. Such a solution may offer fewer features and less usability than what’s provided by our existing CI/CD platforms, and hence will not replace our existing platforms in shipping application changes. If it is used as a separate CI/CD pipeline for delivering just infrastructure changes however, it would be much easier for us to find the right trade-off point between features and security.
After searching through a range of solutions available on the market, we arrived at AWS Developer Tools, which includes the two managed services our new CI/CD pipeline for Infrastructure as Code will be primarily based on: AWS CodeBuild and AWS CodePipeline.
These two services are similar to the concepts of “jobs” and “workflows” on most existing CI/CD platforms. However, we chose to base our new solution on these services, because they have the necessary features to solve our three security challenges in turn:
- Containers launched by CodeBuild can run within a designated subnet in our AWS VPC.
- A dedicated Security Group can be attached automatically to each container running CI/CD jobs, whose traffic is then accepted only by internal network resources we want to manage through Terraform.
- In common with many other AWS services, CodeBuild containers can run with a dedicated IAM role. The container runtime will be assigned short-lived credentials to assume this role.
- This means that there are no static IAM credentials which we will have to manage or rotate.
- Containers at different stages of the workflow can use different roles, each with least-privilege. For example, there is no need for the container running
terraform planto have write access to AWS, as a security fail-safe.
- Permissions assigned to the role can also go beyond IAM policies: being IAM principals, this role can be written directly into fine-grained resource policies on AWS.
Pinned job specification files:
- CodeBuild allows us to statically define the CI/CD pipeline specification through the AWS API, using Terraform. The existing specification file can only be modified by applying changes through the CI/CD pipeline itself (outside of established break-glass processes), which in production requires review under the two-person rule.
Design of the pipeline
The design of the CI/CD pipeline for Infrastructure as Code is shown in the diagram below, which is itself created and managed via Terraform in each of the AWS environments. After the initial bootstrap process, the pipeline in each environment can control and deliver changes to itself. This practically makes the pipeline self-hosting (changes to the pipeline will take effect at its next execution).
Setups within production and non-production environments are separate but identical, with production CI/CD pipelines only running on merges to protected production branches. Integrity of S3 buckets storing intermediate artifacts are reinforced through denying bucket policies.
We will explain each step of the CodePipeline below:
tf_source: Due to a current limitation on which GitHub branches can trigger CodePipeline releases, we have set up this standalone CodeBuild job outside of the CodePipeline. Its sole responsibility is to receive PR change notifications from GitHub via webhooks, and then stage the correct revision of the repository in the protected S3 Source Bucket, for use in the rest of the pipeline.
source: This stage responds to changes of the staged revision in the source bucket, and runs the pipeline on changes. We use Terraform state locking to ensure ordered delivery of Infrastructure as Code changes that are triggered concurrently.
terraform_plan: Running with read-only privileges, this stage plans Terraform changes based on configurations in the staged revision. It will produce a deterministic plan which will be later applied following human approval. If it fails, a Slack notification will be sent through a custom integration running on AWS Lambda so that engineers can investigate:
For production environments, this stage also runs as a standalone task on every commit to open PRs, which provides a preview on what changes will be made in production if the PR is later merged and applied.
review_terraform_plan: Once a plan succeeds, it enters a manual approval stage in the pipeline, which automatically pauses the pipeline run until the author of the PR manually reviews changes logged in the
terraform_planstage. If every change in the plan is expected, the author then “approves” the change within AWS CodePipeline:
With non-production and production CI/CD pipelines operating separately in their respective AWS environments in our design, this approval step does not concern change control. It is simply a halt for us to manually review infrastructure changes that will be executed.
terraform_apply:Once the author has approved the change in the previous step, planned Terraform changes will be applied by this stage deterministically into the AWS environment. If the plan failed to be applied for any reason, a similar error notification will be made to Slack as done during the
terraform_planstep. If everything works as expected however, a notification of success will be sent to Slack:
Our next steps
While our new CI/CD solution for Infrastructure as Code using AWS Developer Tools will only work for Kaluza teams using AWS, the increasing feature parity between the AWS and GCP platforms means that an equivalent CI/CD solution for GCP meeting all three security requirements we set out in this blog is quickly becoming feasible. We hope to build such a solution in the near future.
While this solution is now widely used for delivering infrastructure changes within Kaluza, it does not mean that it will become the only solution. We are still committed to the Technical Autonomy of teams choosing the CI/CD tool they work with the best. The security engineering team continues to work with all teams, and helps them make use of new security features provided by new or existing solutions, all to keep the Kaluza platform secure.