We recently completed a sizable piece of work to roll out a brand new service, the PSR (Priority Service Register) service. This post will go into how the service was designed, rolled out, and lessons learned from the process.
This post is written by and targeted at developers. It is helpful but not essential to have some knowledge of the following:
- Cloud service providers (AWS, GCP...)
- Version control
Firstly, a bit of background.
What is the Priority Service Register?
First things first, what is the Priority Service Register, and why do OVO need a service to support it?
The Priority Service Register is a register for vulnerable customers, so that they can be supported in vulnerable situations. For example, a person with electrical medical equipment may need knowledge of planned outages.
All energy suppliers are required to provide this service. You can find out more about the Priority Service Register here.
Why did we need a new service?
We already provide a PSR service to our customers. However, we are in the process of migrating our members to a new platform that allows us to better serve them. New platforms are built iteratively, and so to enable the migration of all customers to the new platform, we would need to provide the PSR on this new platform.
The existing service could be used to provision customers on the PSR, but a longer term solution was required. Furthermore, because the existing service was tied into our CRM platform, we wanted to create the service in a more robust and scalable environment.
Architecting a microservice
These were the high-level requirements for the service:
- Provide a means for customers to be registered to the PSR
- Provide a means for arbitrary teams to access customer PSR data
- Support existing data sharing processes, which export customer data to third-parties, to enable service providers outside of OVO to support the customer, e.g. local councils
- Enable our data teams to report on customers on the PSR
From these requirements, we structured the following high-level architecture:
- Provide a POST API for adding people to the PSR
- Provide a GET API for retrieving customer PSR information
- Publish events when a customer is added to the PSR, or if their existing PSR entry changes (behind the scenes, this would fulfill the requirement to enable reporting, as events are automatically consumed and stored in Google BigQuery)
- Run a scheduled cron job weekly to export customer data to industry
One of the key objectives for the developers was to make this as maintainable as possible, whilst also allowing this to be packaged as a service that could be deployed by anyone.
On maintainability, we decided to develop the PSR service in monorepo format, whilst breaking the separate components into their own microservices. A monorepo is a single repository containing multiple related parts. For the PSR service this makes perfect sense, as it allows us to package the service as one entity, without adding unnecessary complexity and coupling in the separate parts of said service.
The individual components were:
psr-api, which handles creation and retrieval of PSR data
psr-data-sink-2, which ingests messages from Kafka to supply customer information in the creation of industry exports. These were split by "domain", from our two customer platforms
psr-reads-export, which submits data to industry on which customers require the regular meter reading service
psr-codes-export, which submits data on which priority services a customer needs in general
psr-migration, which handles cases where a customer is migrated between our two billing platforms, or back again, to ensure their PSR data is mapped correctly
Most of these components shared data, for example the schema for a PSR. Rather than define this everywhere and keep them in sync across the microservices, it made sense to reference them in one place. We have a core Golang library that we also reference, and found that with Golang's "early-access" module support, constanly having to keep this up to date caused many issues.
The other benefit of this approach was in the CI pipeline. At one point, we had 5 or 6 almost identical CircleCI configurations amongst the microservices, as well as separate-yet-similar Helm, Terraform, and Docker configurations. Combining these all into one yielded much less duplicated boilerplate, and an ease in rolling out any improvements or patches to these components. Given the individual parts wouldn't provide much value without one another, it made sense to deploy them "as one".
On deployment by anyone, we wanted our architecting process to be centered around the principle that any other team could pick up, understand, and deploy the service with little hassle. For this, we would us a combination of Terraform and Helm to automatically provision the service for whoever wanted to pick it up - more details on these later.
As for platform, we chose Google Cloud Platform, and specifically its Kubernetes solution. This would allow us to create Dockerised images that we could deploy, the added benefit being that by creating Docker containers as the output of our build process, we could package our services in a manner that makes them transferable should we ever need to change to a different platform.
The final key decision would be which language to create the microservice in. Our one hard requirement was that it had to be supported on GCP. We decided on Golang, as it was a language we've had some experience with, is well supported by Google, and gives us everything we need.
Infrastructure as code
One of the first steps we take when spinning up a new service is to define the infrastructure, and "scaffolding" around the service we're building. If you are creating a service that could be spun up by any team without them needing to know step-by-step setups for each tiny component, it makes most sense to define your infrastructure as code, or IaC.
You can imagine IaC as you defining what you want, then giving it to some provider to do all the legwork for you. A good IaC provider will handle creation from scratch, incremental upgrades, and teardown of your services.
IaC can go beyond just defining infrastructure for individual services, and could be used to define all of the infrastructure for your organisation. A system with as complex an infrastructure as Facebook or Amazon.com could be defined entirely using IaC - and they are for the most part.
IaC provides a number of other benefits:
- The service can be replicated with an exact configuration any time it is deployed
- The provisioning code can be version controlled with (or separate to) the main service repository
- The service is documented and understandable by anyone who can understand the configuration files - that said, it should by no means be your only source of documentation!
- Should your deployment break or become lost or destroyed, it can be rebuilt quickly
For us, that came in two parts.
Helm for Kubernetes
Our target system, Kubernetes, has many working parts, the scope of which is well beyond this post. Consider that you need at least a cluster and some nodes. Consider further that for a complex application with REST endpoints, communication with an event bus, and cron jobs running what are essentially lambdas, then you have a significant number of moving parts.
We used Helm to define the infrastructure we wanted to set up in Kubernetes. Helm allows you to define things such as containers, pods, and cron schedules. Coupled with this are your configuration values, which can be defined separately for production and development environments. And because a few of our microservices were very similar (for example we had two event consumers), once you've written one set of Helm files, it's simple to port them for reuse elsewhere.
Helm allows you to automatically tag components with the current release. This allows us to publish images with specific tags. We tied this into our CI pipeline, so that when we can automatically point Kubernetes to the latest released image. This isn't best practice, as it will continuously tag different images with the same label, in this case "LATEST". Ideally you'd Git hash this, or some other unique identifier, to allow rollback and auditing. Lessons learned!
As an example, this is what the Helm charts looked like for one of the microservices, which defined the REST API endpoints for retrieving or creating PSRs:
configmap.yaml- defines the dynamic values we want to configure in other Helm files, e.g. environment name
deployment.yaml- defines the high-level deployment, including the Docker image the container would use, as well as health and liveness checks
horizontal_pod_autoscaler.yaml- defines how pods should autoscale to handle extra demand
schema_registry.yaml- defines a secret, allowing access to our schema registry for events
service.yaml- defines how the service should be exposed to the world: ports and protocols
service_account.yaml- defines a service account that can interact with GCP
Once values are passed in, it's really as simple as building our Docker file, pushing it to Google's container registry, referencing it in Helm, then running a couple of Helm commands to deploy it. We take this one step further by defining these steps in CircleCI, which also allows us to plug into our version controlling and release processes.
The infrastructure for the application is defined. But we can take this further.
Helm allows us to define and create our infrastructure specifically in Kubernetes. Terraform allows us to define and create our infrastructure almost anywhere. In fact, you can even Terraform your Helm release, as we did in this service.
Terraform can go as far as defining the architecture and services for an entire organisation. Imagine the steps it would take to terraform a planet - stabilising the atmosphere, creating a balanced ecosystem, creating water, plant-life, and so on; Terraform itself has the power to do that for your cloud infrastructure.
As I'm raving perhaps a bit too hard about it, disclaimer - other IaC tools do exist. We just happen to like Terraform.
This is where having the application logic defined in a monorepo comes in handy. At the base of our monorepo, we defined a single Terraform folder, containing Terraform files for the parts we cared about. To demonstrate it's usefulness, here's a rundown of the files we defined, and their purpose:
variables.tf- defines configurable variables, to be configured by
nonprod.tfvars, for production and UAT environments respectively
google_dns.tf- defines our desired DNS configuration, including the DNS mapping for our service
google_kms.tf- defines our key management settings
google_monitoring.tf- defines how we want logging and monitoring set up
google_storage_bucket.tf- defines the storage buckets we need
helm_release.tf- defines our Helm release and deployment
main.tf- ties everything together, defining high-level concepts such as our cloud platform provider, in this instance GCP
pagerduty.tf- defines how things should be alerted to team members on call, to respond to urgent issues with the service
With all this defined, when
terraform plan is run, it will show exactly what Terraform plans to create or modify (or destroy, though we don't run this). Running
terraform apply has Terraform carry out the plan, automatically provisioning your requested services. Terraform maintains the current state of the system, allowing it to act appropriately, only performing the required actions.
Terraform is not a magic solution for spinning up services instantly, but it is very powerful. One thing it can't do is write your services for you.
Golang is a relatively new programming language to our team, and yet has been quickly adopted as our language of choice.
Golang promotes a functional programming approach, though "objects" are supported. Typically you will end up with some hybrid of the two, passing simple struct pointers around to mutate states. One of the downsides of this hybrid approach is that no two libraries seem to do things the same way. Some expect you to pass pointers to empty slices (arrays), while some will require a slice with the correct memory pre-assigned to it. We found this while working with one particular library, which caused a lot of stress, particularly as the functions were not well documented.
Yes, it takes a while to get used to these nuances. But once you're there, it's really simple to spin things up quickly. Go does things its way, and it forces you to do it too. You can't argue that this doesn't create consistency in code, as well as enforcing solid design principles, but we've had a few cases of "that's a stupid way of doing things" uttered on the team (toned down from what we actually said).
As a team, we'd actually built up a "core library" of common functionality we could reuse across services. The benefits of this approach are files with good test coverage, a significant reduction in boilerplate, and a standardised way of approaching problems. With this core, we were able to use a standard set of tools for accessing the likes of Datastore and Apache Kafka. As a side-effect, team members were able to work on separate parts of the service, port code into our core library where necessary, then share knowledge with other developers. This meant we only ever needed to solve a problem once, and didn't need to refactor the same boilerplate logic across multiple subrepos whenever a minor issue was solved or performance improvement made.
Go has mature libraries for most things Google, including GCP, so we were mostly covered and able to use Google-developed packages for the majority of tasks. We also used open-source libraries from the likes of Uber and LinkedIn. OVO itself runs regular open-source days to help contribute to these projects, and the usefulness and power of the open-source community is something we endeavor to be a part of.
Go isn't the finished package. It is missing a few features - for example, it's module support is limited and considered beta right now, and generics are a long touted missing feature. If you were thinking of giving Go a... go, I can recommend A Tour Of Go, which takes you through all the essentials, whilst also providing a playground to mess around in as you learn.
Lessons learnt and "do's and don'ts"
The project in total lasted one quarter, and was a great success for our team, having come from relative noobs in the IaC and Golang realms. And it's worth noting off the bat - we could have done this any way we wanted. One of the most fantastic things about working at OVO is the freedom to do things the way you want. We weren't forced to use a particular language, and we as a team were responsible for ensuring the product we delivered was quality.
With that said, here are a few lessons learnt from the project:
- Do - Design the system at the start. This was a big project, and developers at OVO aren't "specialists", so there isn't a technical architect to lead on this. With a complex system, it's important early on to visualise all the parts you will need, especially dependencies and shared resources.
- Do - Pick a language the team is comfortable with. Anyone on the team could have vetoed Golang early on if they didn't like it, or didn't feel confident enough using it. None of us were experts, but we committed to using Golang from the off, having done enough research and having enough experience in the team to be confident in it being able to do what we needed.
- Do - Read the manual. One of the reasons for choosing the technologies that we did was the quality of the documentation. Nothing is perfect, but Terraform, Helm, and Golang are all mature and well-documented enough for an inexperienced team to get up-and-running with. There were some Golang modules we binned off for being poorly explained and confusing, so if something is poorly documented, chances are it's not very good!
- Do - Maintain a solid structure in your monorepo. Honestly, our CircleCI file is long, and the
Dockerfilefor each microservice is almost identical. Tech debt happens, and we'll look at ways to improve this. Despite this, the structure we had, both in terms of file structure and the structure of our Terraform and Helm configurations, was well defined and easy to maintain. Each sub-repo looked similar, which made them easy to browse, and reduced build errors by providing a consistent "template".
- Don't - Treat IaC as a silver bullet. IaC is powerful, and it's well worth knowing about and learning how to use. But it can't do everything, and it won't suddenly cut your project delivery time in half. Done wrong, it can cause a lot of hassle, which goes back to "read the manual".
- Don't - Put everything in separate repos if they're related or coupled. For a service such as this, it made sense to put everything in one monorepo. But one thing I didn't mention (deliberately or otherwise), is that the decision to use a monorepo came near the middle of the project, once we'd suffered the pain of creating Terraform configurations and CI pipelines for the 6 independent microservices. You can save a lot of time by bundling the boilerplate into one place.
- Don't - Neglect documentation. IaC is NOT a replacement for documentation. It should supplement existing, human-readable documentation. What it does document are the guts of the service, the bits most users aren't going to care about unless they need to maintain it. Even then, don't assume the structure you've created is easy to understand to someone completely new to the project. We had the philosophy of "what if we had to hand this over to another team to manage?" from the start, which helped drive our development processes.
This has been a review of one particular monorepo project at OVO, and has hopefully provided some insight and starting blocks for those looking to better architect their cloud solutions.
We are constantly learning and trying new technologies at OVO. The lessons learned and retrospective above are from a team of developers who started out with little to no knowledge of creating a project of this scale and complexity with the tools we wanted to use. As such, this is by no means a "definitive guide", but hopefully helps avoid at least one or two gotchas.
As I've stated throughout, none of the solutions provided above are silver bullets. The team sat down and refined the requirements of the service early on to determine what we would need to build. The decision to create a monorepo came after experiencing pain with multiple separate repositories. The key takeaway is to always use the right tools for the job; it may sound obvious, but it's surprising how often people stick to something out of familiarity or loyalty.
All in all, the tools we used above are highly recommended, and resulted in a service which is live today, delivered on time, and is robust and trustworthy. Here's to the next one!
Gopher image by Renee French, licensed under Creative Commons 3.0 Attributions license.