This blog is a very one sided affair. So here is a disclaimer: Databases are tools and each tool has a use case. Use the database that make sense for your requirements and your team.
However, consider a SQL database when you next come to develop your event streaming microservice pipeline.
SQL has its weaknesses. However, there is occasionally a propensity to overstate them and prematurely optimise with the use of more modern technologies, such as NoSQL. You can still use SQL databases in your modern tech stack and this blog hopes to show that.
We'll first look at SQLs strong points and why it's not yet outdated. We'll then see some tips from my team at OVO on how to get the best out of your SQL database. Lastly, we'll explore a real world example where these tips were used with great success.
SQL is good
SQL's usage endures and for good reason:
-
Knowledge and community - SQL has been around for 44 years, with that comes a wealth of shared knowledge. An active community and in depth documentation help developers get the most out of their databases.
-
Product maturity - With SQL's age comes maturity. Even some of the newer SQL databases, such as MySQL and PostgreSQL, are very feature rich and mature products.
-
Experience Ubiquity - According to the most recent Stack Overflow Developer Survey over 55% of developers have experience with SQL. Therefore it is likely that the majority of your team, or anyone that joins it, has some experience in SQL, lowering the barrier to developing with it.
-
Usage simplicity - SQL is widely known as an easy language to learn, so even if there are members of your team with no experience, getting everyone up to speed can be a quick and painless process.
-
Language support - Due to its age and ubiquity there are several plugins, libraries and frameworks, for a host of different languages, which allow you to connect your database to your code.
-
Type safety - If you are using a strongly typed language and/or utilising schemas in your event streams (e.g. Apache Avro), then being able to extend that to your database maintains type safety throughout your technology stack.
-
Visibility of data - With other database solutions it can be difficult to get a view of your data outside of its specific use case. Having every column queryable can be very useful for debugging or reporting purposes.
SQL is 'modern'
Here are some reasons why, despite its age, SQL can still hold its own and remains 'modern':
-
It's open source - PostgreSQL and MySQL are both open-source projects and have become increasingly popular in the last 20 years.
-
It's in the cloud - Several cloud providers offer managed SQL solutions (for example, Amazon Aurora and Google CoudSQL), with all the features we've come to expect: high performance, security, automatic scaling, logs/metrics, automatic backups etc.
-
It can store a lot of data - Big data seems to one of the main reasons developers look elsewhere for their database solutions. However, modern SQL databases are performant for most companies' use cases.
With these points in mind, my team at OVO have recently been adhering to the following rule:
Use SQL until we can prove it won't meet requirements.
A year later, we haven't found a single use case where SQL wasn't good enough. During this time we have come up with a few guidelines for implementing performant SQL databases in event-streaming microservices:
Our Tips
-
One database per microservice -
This may seem counterintuitive to some, especially for cases where multiple services need access to the same data. However, in an age where storage is 'cheap' and speed is king, the negatives are largely mitigated. Even though this won't have a direct impact on the performance of your services, it is essential in order to employ the following tips. -
Store only what you need -
Slim down your data to only what your service requires. Avoid columns in your schema just because they are there in the incoming data, or because they may be useful later. Have an automated process that cleanups rows that have become unneeded. -
Avoid long-lived transactions -
A common task is to fetch data from your database, combine it with other data and then perform an update. To mitigate race conditions and maintain data integrity this is often done in a transaction. It is very common for this transaction to become the bottleneck of your application. Special consideration is required to build services that don't rely on such transactions. However, with microservices, where each service (ideally) has a single concern, this is much easier to achieve. -
Tailor your tables for how you use them -
Don't just follow the schema of your incoming data. Build your tables for how they are being queried. Don't generate IDs where a multi-column primary key will do. Don't save a timestamp where you only want a date. Empower your database to do work for you.
In practice
So let's see a real world example where we can employ these tips.
The team is in charge of ingesting all smart meter reads. We receive a million reads a day, which flow through 30+ Scala microservices, connected with Kafka. All reads are stored in a PostgreSQL database, hosted in Google CloudSQL. Our role is to process these reads and produce projections, estimations and, most importantly for this blog, usage. Let's define usage as "one meter read minus a previous read from that meter".
Recently, we received a requirement to produce usage using reads exactly 365 days apart. We must ensure that this usage is from a single meter and a single customer (meters can be change customers and vice versa). Furthermore, for each customer, we should not produce usage that uses older reads than usage we have already produced.
Infrastructure
We have a new concern, so let's create a new service to handle it. Reads are going to come in one by one, so in order to combine them we'll need to store them. So let's follow Tip 1 and create a new database. At this point this seems wasteful, we already have a database that contains all the data we could need. Hopefully, the benefits of this decision will become clear as we build and improve our solution.
Schemas
Let's define our table schemas, keeping Tip 2 in mind. Here's our incoming read event:
{
"event_id" : "123456",
"meter_id" : "meter1",
"customer_id" : "customerA",
"reading_time" : "2018-12-08T00:00:00.000",
"reading" : 1000.0,
"created_at" : "2018-12-08T02:00:00.000",
"region_id" : "SW"
}
Let's follow Tip 2 immediately and only store what we need. event_id
may seem useful but we don't need it to calculate usage, so we're not going to include it in out schema.
Table "public.meter_readings"
Column | Type
--------------+-----------------------
meter_id | character varying(14)
customer_id | character varying(10)
reading_time | timestamp
reading | numeric(16,8)
Without event_id
we need a primary key. For a given timestamp we'll only receive one reading for a meter, so the following seems sensible:
"meter_readings_pkey" PRIMARY KEY, btree (meter_id, reading_time)
Let's also look at the outgoing usage event:
{
"meter_id" : "meter1",
"customer_id" : "customerA",
"from_reading" : 1000.0,
"from_reading_date" : "2017-12-08",
"to_reading" : 2000.0,
"to_reading_date" : "2018-12-08",
}
To ensure we don't publish events using older reads we'll have to store this somewhere as well. Considering Tip 2 again, we actually only need two pieces of information: customer_id
and to_reading_date
.
Table "public.usage_effective_froms"
Column | Type
-----------------+-----------------------
customer_id | character varying(10)
to_reading_date | date
Furthermore, given that we're only interested in the latest read date we've used, let's use customer_id
as our primary key.
"usage_effective_froms_pkey" PRIMARY KEY, btree (customer_id)
We've successfully slimmed down our data, less data to move means it can move faster. Reusing our existing database would have made this very difficult as that database may have extra concerns, and may need to store more data.
Service flow
Next let's come up with a flow for this service. For each incoming read, we perform the following:
- Upsert the read into our table.
- Find a read from 365 days ago with matching
meter_id
andcustomer_id
. - Calculate the usage.
- Fetch the previously published
to_reading_date
for the read'scustomer_id
, continue if this read'sreading_time
is equal or later (otherwise skip to step 7). - Upsert
customer_id
andto_reading_date
into the usage table - Publish the usage event.
- Acknowledge the read event.
Keen flow architects will notice that this flow is susceptible to race conditions and may be tempted to wrap the flow in a single transaction. However, our race condition only exists between reads with the same customer_id
, therefore locking the whole table down in a transaction is overkill. let's instead tailor our service with this in mind. Instead, we can design our service to only process one read at a time for a single customer_id
(by partitioning the stream for example).
The need for transactional rollback/retries on a failure is also mitigated as the event won't be acknowledged and will eventually be reprocessed.
Optimising
The team did all of the above and deployed the service to our test environment, only to find that it wasn't running fast enough for our need. We found that the query for Step 2 was the bottleneck. This is what we had written:
SELECT *
FROM meter_readings
WHERE meter_id = read.meter_id
AND customer_id = read.customer_id
AND reading_time::date = (read.reading_time - interval '365 days')::date;
Indexing is a common way to speed up queries, so that is what we tried next:
CREATE INDEX meter_readings_query_index
ON meter_readings (meter_id, customer_id, reading_time::date);
We measured the effect of this index and disappointingly only found a negligible speed increase. This is because as well as performing the query, each event is inserted into the table. As such, maintaining the index was negating the time gained performing the query.
The problem here is that we didn't design the table for its use case, instead we designed it based on the data schema.
Firstly, even though we can get multiple reads a day, our requirements state we just have to use a read from each day. We are unnecessarily storing a timestamp where a date would do, which would also simplify our query.
Secondly, we didn't include customer_id
in our primary key because our domain knowledge tell us it doesn't affect uniqueness. However, it means that we can't use the full primary key in our query and had to define an index.
So we redefined our schema, another action that would not have been possible if we were using this database elsewhere in out pipeline:
Table "public.meter_readings"
Column | Type
--------------+-----------------------
meter_id | character varying(14)
customer_id | character varying(10)
reading_date | date
reading | numeric(16,8)
"meter_readings_pkey" PRIMARY KEY, btree (meter_id, customer_id, reading_date)
And rewrote our query:
SELECT *
FROM meter_readings
WHERE meter_id = read.meter_id
AND customer_id = read.customer_id
AND reading_date = (read.reading_time - interval '365 days')::date;
Now we are directly querying on the primary key. We can remove the index and the penalty for maintaining it on inserting, whilst also keeping query speed.
Deploying these changes saw a 12 fold increase in performance, easily enough for our use case.
Not done yet
If left unchecked, this table would contain over a billion rows with only three years of data. Moreover, we could already see degrading performance with just over a year's worth of reads. So our last action was to ensure we only store the rows the service could use. A daily process deletes any read that is over a year older than the corresponding to_reading_date
in the usage table:
DELETE FROM meter_readings AS mr
USING usage_effective_froms AS uef
WHERE mr.customer_id = uef.customer_id
AND mr.reading_date < (uef.to_reading_date - interval '365 days');
Ultimately, in following our tips, we have successfully met our requirements in a performant and future proof manner.