Speed is the ultimate weapon. At Flex, we need to move fast and not break things, because we deliver critical services to our customers - home heating and electric vehicle charging. Our opponent here is climate change and we know we need to win.
In the Assets team we are a mixture of data scientists and engineers working on the backend of the platform which controls these devices while delivering energy flexibility services.
We want to share our experience of how using shadow deployments has increased our velocity and enabled us to iterate quickly in a highly complex data environment.
Emma-Ashley - Engineer: Follow me back to the halcyon days of December 2019 when we were all five years younger. We had a mountain to climb in the assets team with major changes to make our device optimization pipeline.
Releases of critical services in this kafka based pipeline were nerve-racking, requiring all hands on deck to ensure no customer disruption. Sure, we have dev environments where we have a series of mock batteries to simulate data from the devices, but it is almost impossible to make these truly reflect production traffic.
The mountain came in the form of a kafka schema change that would affect our entire pipeline. We needed a way to make this change and ensure no disruption to customers. We also needed it quickly.
What was the solution?
We would make each service publish to two topics - the existing 'live' topic serving customers and a 'shadow' topic that would include the change. We could then deploy a 'shadow' version of the next service in the pipeline that would read from the shadow topic.
This lets us find issues that might arise from the change, using real data but without jeopardising customer comfort.
Although it required some initial investment, our shadow deployment paid for itself immediately, catching multiple errors that - had we released - would have caused major system outages, lots of unhappy customers and time spent on rollbacks and post-mortems. Armed with shadow deployment, we were able to deploy and test bug fixes quickly and safely away from the fire of a live incident, releasing to production only when we were confident that our iteration was solid.
How did we do this?
Tomasz - Engineer: When I was a child, I often imagined that I was driving the car and not my parents when I was in it. Sure there was an adult behind the wheel, but I was able to make a decision and see if their decision was the same. I would have no impact on the car, but I was able to test if I still remembered the right path to my grandparents. This is our shadow deployment, the original service is still driving it, while the same service with changes is deployed as a shadow.
Let’s see how we deploy it. Our topics both for consuming and producing come from environment variables, same applies to a consumer group. This means we can deploy a service with a different configuration. Kafka is using a consumer group to deliver all messages to consumers working together, to distribute the workload. With a different consumer group our shadow deployment reads all messages, while not taking any away from the production service.
We are using Helm and Kubernetes, which means we define a template for our deployment, the only difference is a release name and environment variables. That means once it’s configured, we don’t have to worry that our shadow deployment will diverge from production.
All good so far, but what if we have an urgent bug to fix on production and we released shadow? Not a problem. Our main branch keeps production code, when we deploy in shadow, we do it from the shadow branch. That means we merge a PR once we have proved everything is working.
To build and deploy our changes we use CircleCI, it has filtering on branch names which means the flow for shadow deployment is active on the shadow branch. It let’s us deploy shadow to dev and production environments. Once we are happy with it, we have a step which removes the deployment from Kubernetes. It doesn’t make sense to use resources once we have deployed the latest version as a production service.
How did this speed us up?
One year later and our shadow deployments have accelerated us more than we imagined.
We found they speed us up when it comes to mitigating risk in data science endeavors.
Ioanna - Data scientist: Working on user facing data science products is exciting and always extremely insightful. Our models interact with customer data and make informed decisions for the benefit of the customer. Shadow deployments are a way to test our models with real time data, including all the anomalies that this could bring, and their performance without risking customer comfort. Shadow deployments help us make data driven decisions on when or how a model could be released to production.
Sukhil - Data Scientist: Shadow deployments can help us implement services that learn from past behaviour of users and forecast into the future, such as predicting the state of charge of an EV when it is first plugged in. Doing this helps us plan and optimise how much energy users need. Having real data going through this service without having our decision making process altered allows us to monitor model accuracy before implementation. This is invaluable to an experimental machine learning pipeline since it greatly reduces the risk we expose to our system in the form of an inaccurate model, allowing quicker further iteration.
Is it all perfect?
Tomasz: While a shadow deployment allows us to test changes safely, there are still things to consider.
Each shadow deployment adds pressure on the Kafka broker, it’s important we delete it as soon as we validate our changes.
Shadow deployments cannot influence incoming data, therefore we cannot test algorithms that would change incoming data.
In the future we would like to introduce A/B testing to our services to control a fraction of our devices with new code. A/B testing and shadow deployments focus on two different aspects, so it will not replace the shadows, but give us another tool in our toolbox.
We don’t use shadow deployments for every change. For simple changes it is unnecessary. We use it only to mitigate large risks.
Shadow deployments may sound like it slows our release a bit for better safety. However, the time taken to slow down when we reach the corner is saved in fixing less bugs and faster feedback, accelerating us on the other side. If our services were not affecting customers' mobility and heat, we probably wouldn’t have to focus so much on making our deployments safe.
Start your engines - Some closing thoughts from our product manager.
Flexibility is going to play a critical role in the transition to zero carbon energy. As more and more households switch to electric heating and electric vehicles, we will increasingly rely on flexibility services to help balance the grid and take advantage of renewable energy. Through delivering domestic flexibility services, we also have an opportunity to reduce the cost of ownership of electric devices, which can help to further accelerate progress towards net zero.
At Kaluza we are building the intelligent energy platform of the future, and we need to move fast every day. The landscape of zero carbon energy is highly complex and emergent. As more and more renewables and electric devices come online, and new flexibility markets open up, it is critical that we are able to evolve and iterate on our platform at pace, and without disrupting customers who rely on us for their electric heat and transportation.
Shadow deployments enable us to rapidly test changes to our platform using live data and with zero impact on our customers. This capability empowers us to test ideas quickly and to gather meaningful data that we can use to drive the development of our platform. We have a huge mountain to climb in order to decarbonise our lives, and with every single shadow deployment, we are ultimately trying to get one step closer to a zero carbon future.