OVO is always looking to empower our engineers, helping them to innovate and adopt new technologies that let us solve customers’ problems. However as technologies mature and best practices evolve, early decisions have started to show their age. In this post I’m going to go over a high level view of the problems we found and how we solved them, and in the future will dive deeper into how we built some of these tools.
Like a lot of companies that started early with AWS, we built up a massive monolithic single account containing a spider’s web of linked dependencies:
- Our production and non-prod websites and the services that drive them lived there
- Experimenting and learning projects lived there too
- Big shared network VPCs lived there: occasionally a maze of VPN’s would unexpectedly lead to the public internet
- DNS lived there
- Most people needed admin to get anything done
- It was difficult to understand our costs and easy to lose track of unused resources or know who was responsible for them
- It had many different data stores: it was hard to tell what was important data and what was being backed up.
With all of this complexity, it was easy to break things making changes to resources. People tapped into other team’s services with best intentions, creating surprising dependencies to uncover later. It grew quickly into a burden of technical debt that needed a lot of work to repay.
As a business we gained a lot from this experience: early cloud adoption meant comfort with new cloud paradigms, and it enabled us to move fast and learn to leverage the potential of AWS and GCP along with other cloud platforms and services. It also helped us learn what not to do, and how to iterate to prevent creating more technical debt.
We’ve done this by splitting out our AWS usage into our current multi-account model. We’ve fixed the legacy problems of one large account, and built tools to let developers move even faster without accruing the kind of debt we generated from using a single account.
We now have:
- A multi account structure with individual accounts for teams, environments or services.
- A dashboard monitoring all the accounts for security risks and best practice
- Automated account vending
- A detailed cloud cost explorer
- and more secure, productive developers
A multi account structure
There are a lot of benefits to breaking into multiple accounts. Clear and obvious separation at AWS API level is probably the most direct benefit. One of our core tenants to our AWS platform is that we don’t want to restrict developers: allowing them to be administrators of their own accounts means not being blocked as soon as well intentioned security policies are too tight.
Accounts running only one service or product reduces the liability of using admin roles, and stops a temporary DynamoDB solution being quietly built into a dependency for someone else's service unless you explicitly create them a role. It also means a limited blast radius for when things do go wrong - through error or external malice. It’s much better to lose a single service or team account than the whole business’s resources being impacted if things do go drastically wrong.
Being able to allow for different environments has been another important change: How often do people use the Administrator Access role “just to test quickly before sorting out permissions properly”? A separate non-prod account for the team or product is the place to build safely before deploying in production. Having resources divided as a minimum into subsets of team and environment gives us a strong platform we can be confident in.
We’ve also been able to split out high-value accounts like Security and services that require cross-account trust like the account vending process. This allows them to be protected appropriately by only allowing very limited access, or no access at all to them. In a similar vein, access can be removed for prod environments for human users entirely and only allow deployment through pipelines.
The same benefits apply to networking - explicit peering to different accounts is required rather than just sharing a large VPC, protecting team resources and giving clarity on what each network connection is reliant on. We’ve leveraged Transit Gateway too to allow all our accounts to use one VPN if they need to hit on-prem data as Dan discussed in his cloud networking blog post, and provide teams with a Terraform module as a default VPC if they want to use it. Here’s a deep dive by Chris into what it looks like.
Finally, using AWS Organizations enables the use of Service Control Policy - similar to Windows Group Policy, it lets us set rules across groups of accounts based on organizational units. We use it to protect security resources created during new account setup, suspend all API access to disabled accounts, and disable access to regions we don’t use to prevent accidental or malicious resources being created.
A security dashboard
Moving from one to many accounts had the potential to just spread the problems we had in one account and give issues more places to hide: Monitoring all the accounts for security
risks and best practice was a clear priority.
We started with the CIS benchmark criteria, monitoring every account in the organization through an “Infosec” read-only role to see how the teams were getting on. This quickly exposed a problem - not every team had this role in their accounts! We began to add OVO practice expectations alongside the CIS benchmarks, such as having an Infosec role in the account. Over time this evolved with our expectations of the teams, adding more requirements that were monitored.
One way that the Security team drove engagement was gamifying the process: Passing a test is worth 1 point, failing is worth -1. Using these scores in a league table of product teams definitely motivated security fixes and increased engagement with security awareness and maintenance. However, starting each new account with red crosses across the board and a lot of work to do wasn’t encouraging teams to leverage the benefits of multi-accounts, which leads us to..
Automated account vending
Creating accounts manually is arduous - especially when they have to meet the guidelines outlined above. Giving teams fresh, unconfigured accounts was dumping a large amount of work in their laps before they had even started developing and was a disincentive for working in separate accounts where possible.
We’re now at the point where our Tech Support team can create an account for a user from just a user’s email and requested account name and deliver a fully compliant accounts in minutes. The removal of burden of configuring a list of AWS services to secure an account for the product teams really helps drive the multi-account approach. Accounts on demand mean teams can now try spikes for new ideas and technologies totally separately from their mature services, collaborate easily on cross-team projects, or delete production entirely to really be confident in their infrastructure as code.
A detailed cloud cost explorer
With the flattening of spend across many accounts, being able to search and group spend became an important dataset that needed visibility. We built a cloud cost explorer to help AWS users, product managers and Finance understand what and who was generating cost in AWS and how our bill broke down by aggregating spend per team and AWS service. Spend is much easier to understand - and see! We can filter and visualize by team, account, service and environment. It also adds depth in security: we caught a case unauthorised access that was caught quickly because of this monitoring and its added granularity.
and more secure, productive engineers
These tools are designed to increase our security, visibility and ownership for developers as defaults, allowing them to focus on development and adding value to the business.
All of these improvements in isolation are small and iterative, but combine to create a layered approach that prevent the difficulties that creep in over time. Sometimes these difficulties are trade-offs when you want to use a new service: for example, to set up continuous deployment through a hosted CI/CD tool, an external service needs enough rights to effectively have admin access to an account.
Even as a one-off for your service or team, in a single large account decisions like this become a multiplicative problem of many services and many users: every change increases the risk to your account by a factor of what’s come before, and any external service like our CI provider being compromised would impact all our services.
With a multi-account model, a product or service can live in it’s own account with no other services - a highly limited blast radius that just isn’t possible without a dedicated account. With automated account creation, we can always create a new isolated environment every time this need comes up, and with our teams writing infrastructure as code it’s little burden to move.
Is a multi-account model right for me?
So - when should you be thinking about a multi account model? If you’re working in a business with more than one developer, I’d say right now. There’s no large cost overhead from AWS, and with some planning it should save a lot more time and heartache than it costs in admin. If you’re in the single monolith phase, the earlier you start breaking it down the smaller the technical debt is. Thinking about how you can isolate key resources as early as possible will save a lot of work down the road. Making small steps and moving the engineering culture towards this approach takes time, but brings long term rewards. Our custom tools have been built and evolved over our journey of fixing earlier problems and give us a lot more flexibility and visibility of what we want in a light, cheap footprint.
As we’ve developed these tools and progressed in breaking down our accounts, AWS have released their version of a baseline multi-account setup automating account creation in Control Tower and dashboard in Security Hub. We did trial a predecessor of Control Tower (Landing Zone) and felt it would be a step backwards in terms of what we’d built, but as a one-size-fits-all solution it implements most of the ideas discussed. It’s difficult to customise some elements, and full reliance on AWS Config rules ups the price too. It’s a bit bloated using a full stack of all-AWS products to achieve what it sets out to do; but sets up a best practice multi account structure with automated account creation. As you can’t pick and choose features from it easily, it trades speed of implementation against cost... which brings us nicely back to the start - sometimes moving quickly puts a lot of work in the backlog for the future to unpick big monolithic messes!