OVO Tech Blog

Eating the data elephant

Introduction

Katie Russell

Katie Russell

Data Platforms, Data Science and Analytics @ OVO Energy


data analytics gcp kafka

Eating the data elephant

Posted by Katie Russell on .
Featured

data analytics gcp kafka

Eating the data elephant

Posted by Katie Russell on .

OVO has acquired SSE Retail Services, expanding our customer base from 1.5m to 5m. We’ll consolidate onto single platforms, so we’ll have multiple technical challenges. We will closely safeguard customer experience, and security of our customer’s data. We’ll also need to move volumes of data, which is a challenge in its own right, and the topic of this post.

Here is OVO’s data migration backstory and our plan for success with SSE.

3 years ago, we were scheduled to migrate thousands of OVO customers to our new in-house energy management platform, within months. Without intervention, this would break most business decisioning and compliance processes, or leave dozens/hundreds of people combining data by hand, as we would have customer data split across source systems.


To give you an idea of scale, back then we had around 10 teams, running ~40 live services, and our data centres were growing by 6TBs annually.


To solve this ‘split data problem’, we had a choice to either merge migrated customers' data “backwards” into our existing reporting data warehouse (let’s call this OldDB), or to merge data “forwards” for customers not yet migrated, with datasets in Google BigQuery (NewDB)


Data for our new in-house energy management platform is published to Kafka. We use Kafka Connect to ingest this in real-time, and make it available more broadly via Google’s BigQuery.

MIS is OVO’s existing data warehouse, an on-prem Microsoft SQL Server which ingests data from OVO’s existing billing systems and agent platform, Gentrack and Salesforce.


We decided to merge forwards, providing the added benefit of cloud and BigQuery’s massive SQL engine and limitless scale,  creating combined views across all our customer data. This would enable reporting and ensure that analysts didn’t have to combine data by hand.

There have been a lot of learnings since then, we’ll focus on two here.  Let's use OldDB and NewDB to refer to the old and new data stores, respectively.

Learning One: the split data problem isn’t confined to decisioning, compliance, etc, crucial though these are, but applies to pretty much every business function that makes a retail business work, such as marketing, operations, customer care...

With hindsight we should have known this. A data driven business, like OVO, first demands high fidelity reporting. Then it seeks opportunities to operationalise outcomes from that reported data. In OVO, we had already found that whenever our OldDB failed to complete a daily refresh, several notable business processes were impacted. It was inevitable that services would have been built off OldDB, because it was the only place to get good clean joined data. It just wasn’t designed for these dependencies.  

In taking the decision to merge the customer data into NewDB, we evolved, but didn’t solve, this problem. We’d created clean, joined data for our entire customer base, regardless of source system, underpinning reporting and analytics, but we couldn’t prevent costly integrations being built on NewDB to support operational processes. Various parts of the NewDB hadn’t been designed to support such integrations, and there was also a technical concern about a lack of replay-ability of data in NewDB.

So it leads to a question: where is the right place to integrate data across systems?

One option is integrating directly i.e. system to system. But many integrations need data from multiple sources, meaning that joining logic has to be implemented. If this is done multiple times for multiple integrations we have a consistency problem and it’s duplicated effort. It also doesn’t scale - taken from here:

The point to point data pipelines problem. For each additional system, double the number of necessary integrations!

Our current proposal is that a central real time data pipeline (we chose Kafka) forms the backbone of our data platform, which we strive to make first choice for integrations, operations, reporting and analytics. We establish and publish to a common data model in Kafka. We thus avoid a ‘point-to-point integrations’ problem as the number and type of our systems grow. As new sources systems map into the data model, existing consumer integrations benefit with little additional effort.  And we avoid building a data warehouse monster, OldDB 2.0.


Monolithic architecture source systems can be supported by this data platform, we can replicate data using change data capture, i.e. standard database replication technology.



On to Learning Two: if you mirror a legacy database across, it will linger for a long time because the impetus to do something about it is gone. Worse, so will the dependencies.

To effectively decommission OldDB and many associated dependencies and regulatory and legal requirements is a project in itself, even when there are no live customers depending on that system.

For SSE, there is no plan to map the target state data back to the incumbent databases. We wouldn’t know where to start.

We will take only the SSE customer data that is needed for specific use cases, and map that forward, to the common data model, in the central data platform. Any long term dependencies on that data will link to data from the common model, not the mirrored data, meaning that dependencies are insulated from the migration of customer data itself.

For example, our in-house energy management platform needs a base set of data. Other data needs (e.g. more history, different data fields) arise from other data needs - e.g.  regulatory reporting, marketing, operations, and agent service.

We are tackling each need independently and in priority order but with reuse of the data already replicated across, or already in the platform, for other use cases. Like how to eat an elephant - bite by bite.

Thanks to Lisa P for the elephant photo from the book Look! There's Elmer by David McKee

This is all a work in progress, and the things we are thinking about actively now include:

  • How to manage the long tail of other dependencies on the legacy system?
  • Defining a way to track progress (beyond the macro progress, of use cases enabled)
  • Appropriate testing, monitoring and alerts
  • Generalising technology components to increase efficiency gains
  • How to track time spent on each data use case / data item
  • How to avoid data backchannels, a compelling but dangerous solution to the split data problem

Getting this right benefits more than just SSE's customers and a smooth migration path for them - it benefits any further customer bases we migrate onto our in-house developed energy management platform.

If you’d like to join this effort, we are hiring. And we are interested in any comments you have.... (there is a comment form below)

Thanks to: Robert Mackenzie, whose various comments have been reformed into many of these words, and James Hendry, who is pivotal to keeping these plates spinning with us.

Finally, kudos to all OVO data peoples who have had to put up with this:

You are stars.

Other articles we like

https://martinfowler.com/articles/data-monolith-to-mesh.html

https://medium.com/avalia-systems-blog/how-to-eat-a-monolithic-elephant-e87570e8603f

Katie Russell

Katie Russell

Data Platforms, Data Science and Analytics @ OVO Energy

View Comments...