This post I'll attempt to introduce two concept:
- using bulkheads to make micro services more reliable
- and making time a first class citizen in a distributed system to improve consistency
First, some definitions:
- Bulkhead: a cache of data that allows services to be less reliant on one another. Changes are published (push) rather than requested (pull).
- Consistent View: queries usually join 2 or more data sets. Queries need to be based on the same instant to provides a consistent view of the world.
- Idempotent: applying the same operation repeatedly does not affect the result.
Now lets discuss some different systems:
Small is beautiful: A single database
Simple. Everything in one place. Queries return a consistent view but you need plan ahead and fetch all the data upfront. As a system grows it can be a challenge to scale - both for handling all the reads and writes and also for coordinating many developers making changes at once.
Divide and conquer: Request Based Services
The solution to the scaling issue is to break the system up into logical units so they can be changed at there own pace and be deployed with the needed capacity. The next challenge that arises is reliability. If one system has a issue then all the systems that depend on it are also affected.
Don't call us, we'll call you: Bulkhead Based Services
The solution to the reliability issue is to use bulkheads. Services publish a log of 'outside' data for each other so that they can still make queries even if the other service is not available.
Doctors, Accountants, Detectives and Sea captains all realised long ago that writing events in a log format is a good way to capture facts. It provides the consumer flexibility to decide how to arrange those facts to suit their own needs.
We've solved scaling and reliability but no longer have a simple holistic system. We have to think more about consistency when querying across bulkheads.
Time Aware Systems
'Now' is a relative concept to people or computers in different places. Sharing information is not instantaneous and communication breakdowns are unavoidable. The next challenge for systems where correctness matters is to restore the Consistent View.
When you query a single database it may lock the tables being queried in order to give a consistent answer. The variable that is invisibly managed by a single database is 'WHEN?'. By considering time as a first class citizen in a distributed system we can query our bulkheads "as of" an instant to know we have a consistent view or to be aware that we cannot answer the question correctly with the information we currently hold.
Another advantage is that the bulkheads become time machines that can provide a historical view, a view since a point in time and a full history of changes to an entity in the system. This means you can ask any question you dream up in the future.
Your Data as a crime scene
The truth is, in any reasonably complex system there will be inconsistencies and omissions. Retroactively publishing updates and corrections is an issue for consumers who only have a one dimensional view of time (now). Even systems that don't attempt to record history have to cope with out of order and repeat events if publishers are to maintain 'at least once' semantics.
Managing repeat events
The goal of bulkheads is to decouple systems. It is counter productive to have to coordinate producers and consumers when dealing with duplicate events, some of which should be applied and some ignored. The naive approach is to ignore any event seen before. An elegant solution is to design the data model so that 'ignorable' events have no effect when applied i.e. make then idempotent.
An aside - Entity State Transition
We can side step many issue by publishing the full state of an entity on change. It's a simple solution - the version with the latest timestamp is the winner. It's possible to save all the versions to build a historical view and easy to remove the entity altogether e.g. for gdpr.
However, there are some minor downsides. Fixing the history is impossible. Sharing the whole entity state can be inefficient as the entity models tend to keep growing with time. This adds cognitive load and means bulkheads probably contain much more data than is actually used. This mirrors a trend in the Request based world: REST responses tend to bloat over time and as a result various solutions (e.g. Graphql) have appeared to allow querying for specific attributes on an entity rather than getting everything anyone has ever needed.
Entity, (Attribute, Value), Time
In other words: Who?, What?, When?
If we model our data as a collection of facts in this form we get the following properties:
- 'when?' queries provide a consistent view of the world across bulkheads
- 'what?' queries mean bulkheads can cache only relevant data
- Duplicate out of order events are idempotent
- Corrective out of order events allow history to be fixed without coordination
I hope my ramblings haven't overwhelmed you and perhaps have illuminated what all the hype around Kafka is actually about (publishing data for bulkheads). I find distributed time aware databases fascinating. Most developers use one every single day even if they don't realise (can you guess what it is?) and I think perhaps one day users may expect the same luxury too.
Two tools that allow you to bulkhead data in a time aware fashion are Datomic and Crux. Datomic has some really interesting properties especially if you require extreme correctness (as opposed to availability and scalability). It doesn't let you 'fix' history but this is actually desirable in some contexts. Bi-temporal Databases like Crux have a concept of 'actual time' that allows you to assert facts in the past. I haven't used Crux so can't really comment on it but would like try it out in the future. I recommend checking them out. I'm sure there are more out there and would be interested to hear about them.