This is a question that every team should be asking themselves. There is no correct answer, but you should understand why it is an important question and how the answer affects your team.
In the case where you have implemented your environments using Infrastructure as code (IaC) there is normally one main reason why you haven't deleted your environment: You do not trust your IaC enough to bring everything back up correctly after a complete loss.
Infrastructure as code is an investment
Implementing IaC is a slow, long winded process. There are tools that will help you write the code and the more experience you have with it the faster you will be. But writing code to define your infrastructure will almost always be a significant amount slower than clicking around creating resources in a web console. This means you're putting extra effort into something now in the hope of future reward. Fundamentally this makes IaC an investment, often a major one.
Why you should make this investment can be a tricky question and is outside the scope of this post. The important thing here is that whatever your reasons, once you've started using IaC you should make sure that it justifies the cost and that you maximise returns, as you would with any other investment.
For IaC to be of any value it has to represent your infrastructure, accurately and reliably. However, experience has shown that there are many reasons why what's in your live infrastructure can drift from your static IaC definition in sometimes very subtle ways.
How does drift develop in IaC?
Consider a simple IaC project that creates an EC2 instance.
Let's say its been running for about half an hour. As a developer you would have no concern about tearing this environment down and rebuilding it. You might even do it at the end of every day to save on costs.
You probably have questions about how any state is preserved, or you might be reluctant to repeat any manual steps, e.g. installing the application onto the instance. But by practicing the regular delete and create when your infrastructure resources are new you address these issues while they are simple and easy to understand. There is no risk and you can experiment with the best options. As you've just written the code, it's purpose and dependencies are fresh in your mind. You can make sure that any state is backed up to a safe, seperate place and you automate any manual steps. In short, you make a robust IaC definition that you trust.
If you keep following this practice, regularly tearing down and rebuilding your entire system then you get a quick feedback loop for issues in the accuracy of your IaC when they're easy to fix. If you're working on a shared environment you might go as far as creating nightly or weekly jobs that tear down all resources at the end of a day and build it again the next. This has the extra benefit of cost efficiency as you're not paying for servers when your developers aren’t using them. As your infrastructure grows you can be sure that all of its complexity is understood and that it's accurately recorded in your IaC.
However there is another path that often gets taken; at the end of your day you decide it’s too much effort to delete the development environment and recreate it the next day, you're in the middle of working on some configuration that hasn't been added to IaC yet, or you're not certain how to deal with an intermittent deployment bug. These should be major warning signs, your code is no longer an accurate representation of your infrastructure. A day or two of delay is not necessarily a huge problem but what often happens is these problems never get addressed, some other urgent bug comes up and you move on, promising to come back to it and suddenly you've lost the ability to run your code in its entirety. Your IaC has just lost a lot of it's value.
Imagine the equivalent in an application, a developer on your team emails you to say that "with my latest commit it's no longer possible to restart the application because we'll permanently lose customer data". I hope there are no teams out there that would accept this situation. IaC is no different, if you cannot delete your infrastructure and easily create it from scratch, it’s broken.
It gets worse; a single issue like this will breed further issues. As a developer writing a new resource, you are no longer able to test a full recreate so you are unable to test if adding your latest resource has broken something else. About all you can be sure of is that applying your update works okay for the existing environments. This mentality spreads and the reality of your infrastructure drifts further and further from your code.
The fact that update operations often still mostly work even when create is broken exacerbates the problem, it lowers the priority of fixing the bugs in create and gives you confidence that your IaC is still useful. After all you're using it every day to apply automated updates!
Every time you layer another new change on top of an existing environment you're risking more drift. Often bugs are non-deterministic which means not only do your environments drift from the code they drift from each other. You've just created some snowflake environments.
As the number of issues goes up, the cost of creating an environment will go up as well, and with this the value of your existing environments will go up. Worst case, they end up as precious resources that are fiercely guarded by their owners. All this serves to increase the pressure when you inevitably need to create or restore an environment using an operation that you know is broken and you haven't tried for weeks or months.
By this time the developers have gotten more and more separated from the original, underlying problems. The cost to fix the problems goes up and up and the IaC becomes less and less trusted. Creating an environment becomes a huge manual task wrapped in documentation to coerce your code into a live environment, all that investment in IaC in the first place has been wasted.
Essentially the more difficult it is to create an environment, the less frequently you do it, which in turn makes it even more difficult. The whole process becomes a vicious cycle.
Often at this point there's talk of starting again and rewriting in a better tool or style. You'll read the blurb of a new IaC tool or service that will promise to fix all of your manual steps and make your life easy, some tools like Terraform even promise to help you identity drift. The important part has been missed though, it's how you work that is important not which tool you use.
Running your latest change against the infrastructure and verifying the output is a good test of your code, but as with application code its important that you run the whole test suite, making sure you haven't introduced regression elsewhere. Deleting, recreating and verifying your whole infrastructure acts as a full suite of regression tests.
There's a whole other conversation about how exactly you verify your infrastructure. There's a growing number of tools that allow you to run tests and checks on newly created environments, these can certainly help. Keep it simple to begin with though, complex, automated verification is not required. The most cost and time effective verification you can do will be through deleting and recreating an environment that's actually in use.
Our environments havent been deleted in months, is it too late?
Although the cost and effort to perform this activity will go up the longer you leave it, it's never too late to restore your IaC's accuracy. Though it's impossible to prescribe a universal method, as a rough guide some steps you might follow are:
- Identify a small section of your infrastructure with relatively few dependencies or moving parts,
- Review your code that defines this section,
- Delete the resources and attempt to recreate them using only your IaC,
- Deal with any issues and correct them in your IaC, don't just document workarounds,
- Once you're happy, protect your work. Schedule a nightly job to repeat the delete and recreate that you've just fixed,
- Expand the scope of your work to include the next section of infrastructure and repeat.
For large environments this will no doubt be a big task that you may never finish but even a partial improvement is not wasted. Every time you fix something or prove that your IaC works for another piece you increase your confidence in your tooling and ability to recover from partial outages. You also identify common mistakes that might be present elsewhere and will learn how to avoid them in new code.
What about downtime in production?
Fully deleting and recreating your production infrastructure is no doubt a radical step. No matter how good your IaC is this sort of action will take time, for your instances to start, your database to restore from backup and your caches to fill. Essentially, it's a quick way to annoy your customers.
There are alternatives to a full delete and recreate though. Over the last few years software architecture has transitioned from monolith to modular, particularly with micro service and event driven architectures. There's no reason your IaC should not mirror this, allowing you to tear down small subsets of your infrastructure for a small amount of time and recreate them. This also starts to touch on the world of chaos engineering and has the added benefit of helping you to understand the dependencies and resilience of your architecture.
This is not to say that you should never try and recreate your whole production environment. As discussed in my last post Complexity in Infrastructure as Code there will always be differences between your environments so you should make sure you test every case, not just the easy ones.
There's definitely no simple, universal strategy here but the effort involved in planning and assessing this is a useful exercise in itself. Planning to delete your production environment is also something that can be tied into DR planning, for example investigate creating a whole new production replica in a different cloud region, account or even service, then fail over before deleting the original.
Summary
Here in the OVO production engineering team we're firm believers in IaC and the benefits it brings to the quality and reliability of infrastructure. But as with all technology you can't just add it once and expect it to keep delivering value. You need to understand and utilise the benefits it brings you, continually assessing to make sure you're getting the best value you can from it.
Infrastructure as code is a big investment, it is important that you do not waste it. Deleting and recreating your environments is a very important part of testing that your IaC stays accurate and useful and that your investment is being protected.
I'm a member of the Production Engineering team here at OVO, you can read more about what we do and how we work and check our vacancies page.