Data is important. It can be one of the most important assets of a company. The massive loads of data that Facebook has for its users is what it makes it so successful in targeted marketing. One of the main reasons Google has been able to dominate the search engine market is exactly the fact that it has the most data for searches. In general "a new type of network effect takes hold: whoever has the most customers accumulates the most data, learns the best models, wins the most new customers, and so on in a virtuous circle (or a vicious one, if you’re the competition)" as Pedro Domingos states very eloquently in his book.
Data may give you insights and information that seem unrelated to the very context or reason you initially collected that data. Foot traffic data, for example, may allow you to predict iPhone sales.
Creating the technology and the infrastructure though to gather that data, being able to make data driven decisions and getting actionable insights, is far from straightforward. The challenges can be many. How do you consolidate the data? Who owns the data? How can you easily and timely transform your data?
The following blog post will try to briefly describe how we handle data at OVO, what is our infrastructure, our strategy as well as the challenges we have faced and the solutions we have come up with.
Photo of a smart meter. They can send readings with a rate of up to 1 reading per 10 seconds
Gathering the data
The first step in every Big Data system is actually gathering the data in a Data Warehouse, or a Big Data Store (we will use these terms interchangeably). Getting all the data in one place can be a very challenging task. It can vary from extracting data from various data sources, that get dumped in a Big Data Store every night as a batch job, to reacting and storing realtime events using an AWS Lambda. In our case we have gone beyond that. We have automated data ingestion. We accomplished it with a combination of different technologies: Apache Kafka, Avro Messages with a schema registry, Kafka Connect and BigQuery as our Big Data Store.
Apache Kafka is the backbone of our architecture. Every team sends messages to Kafka exposing the events that are occurring, in an event sourcing manner. The messages that are sent have a schema that is stored in the schema registry. Then for every topic (i.e. a place in Kafka where the messages get published) we create a "Connector" instance. The Connector will use the schema to create a table in BigQuery and it will start storing all the messages there. By creating a list of contracts we have automated the process and retained the ownership of the data to its producers.
Ownership of data is something that is often neglected. It is as important as having distinct ownership for your microservices. The entropy law applies in structured data too: if you don't have any owners taking care of its consistency, data has a natural tendency to lose its structure and become more "random". In our case we solved this issue by keeping accountable the initial data producers for the quality of the data they produce. This however does not solve everything. What happens with data that is a consolidation of data from different producers? What happens when you need to create a join of multiple tables and multiple different sources? For this reason we created a Data Engineering team. The Data Engineers (in case you are interested we still have some available vacancies) take care of the quality of the data, which can then be used unworriedly by our BI analysts or our Data Scientists.
ETL in the Big Data era
Transforming Big Data needs some effort, exactly because of the size of the data that needs to be processed. One can argue that creating a Spark job that will run a transformation in the data is not that difficult. However, it is undoubtedly more complex than running an SQL script locally.
The exact purpose of OVO's Platform team (which I am a member) is to build the technology that will make ETL jobs on top of Big Data as easy as writing an SQL or Python script. The BI analysts shouldn't care about how their scripts get scheduled to run at a specific time, about recurrence, or how their tasks get scaled/run in parallel. For this reason we created (and are actually still actively developing) an app that uses Airflow and BigQuery.
Airflow is an incubating Apache project that allows the execution of data pipelines, or tasks that form a DAG, a term that is very common in Airflow semantics. Airflow supports task scheduling inherently and also allows dependencies between the tasks (ie. a task to use the data created from a previous task). The BI analysts will enter their SQL scripts and set the scheduling in a web app. The web app will then use Airflow and the BigQuery API for the scheduling and execution of the script accordingly.
BigQuery handles the scaling of the SQL scripts under the hood, so this is something we do not have to worry about. In the case of running Python scripts, Airflow (which is a Python project itself) supports Celery, so we have a cluster of Celery executors that can handle the scaling of the Python scripts.
Overview of the architecture of project "Athena" (as it is called)
Challenges
Definitely the biggest challenge is security and privacy. In OVO we take security very seriously and with GDPR taking effect in approximately 6 months, security and privacy are more relevant than ever. Moreover, the impact of a data breach grows exponentially if you have all your data consolidated into one place.
We use techniques like masking and stripping personally identifiable information (PII) as well as applying strict rules on authorization.
Mitigating security and privacy issues is not something that you can just do and forget. It is a constant effort and constant assessment of your current processes and practices.
Future Work
BigQuery is amazing, but it has limitations. It is not a real time data store. Since OVO is a real time company, that offers an up-to-the-minute breakdown of your energy consumption, our intention is to be able to perform also real time analytics. We plan to use a separate real time data store and leverage technologies like Dataflow to perform streaming operations.
Interesting things are going to happen so stay tuned!
This blog post is a continuation of a talk I gave some time ago at ScalaXBytes meetup, revisiting some concepts and expanding on others.