Design exerciseDistributing (consistent) data at scale

time to read 4 min | 648 words

imageBefore I start discussing this topic, I want to talk a bit about the speed of light. That pesky limit basically means that there in an inherent lag in passing information between any two points in space. On your daily life, you can mostly ignore it. The human brain is far too slow to perceive it, and even if you are working with computers, you can usually ignore the speed of light for anything less than about 500 miles.

But the speed of light is merely the hard upper limit of our ability to send information from one location to another. In practice, the lag time between any two computers connected to a network is much higher. In fact, if you are a gamer, you are very well acquainted with that fact.

Let’s assume that we have a need to do the following:

  • Hold a mutable state of some kind.
  • Modify it in a consistent manner.
  • Distribute it to many locations, where many is at least hundreds and potentially tens of thousands of locations.

Let’s break this up a bit to its component parts. A mutable state basically means that we can modify it over time, and that these modifications shows up in all locations. The speed of light (and network lag) ensures that we can’t have this happen instantaneously. And, of course, the killer requirement here is that we need to do this in a consistent manner.

I find it really hard to talk about abstract problems concretely so let’s use an example. We have a set of tax rules that determine how certain products should be taxed. The ruleset is big, changes on a regular basis and it is quite important to get things right. Oh, and we also need to do push those rules to tens of thousands of locations that would independently compute taxes for purchases.

In this case, the solution is quite easy. We have a single source of truth that all the locations can pull the tax ruleset from (directly or through mirrors). We’ll also ensure that tax rules are applied only at some future date from their creation, to allow time for all these locations to be updated. If a location hasn’t had an updated ruleset for over a certain amount of time, they can refuse to process any payments until they get an update. This example has really very little for us to do in term of design.

A better example would have multiple actors changing the data as we work with it. An actual example for when this is required can be when we have an ad provider that needs to run ads in many (physical locations). Each location is actually an IoT device that includes a computer, a screen and a camera. Each such device decides independently what ads should be shown (for example, if the camera see a baby stroller, they show ads for baby toys).

I actually had to search hard for an example that would work for this scenario but I think this is a good  one. We obviously have a lot of activity on the ad provider with many people registering and modifying ads settings. We need that portion of our business to be consistent (this is what we are charging people for, after all). However, given the probably high amount of ad devices spread all over, we can’t rely on them always being connected to a central server or even on them being connected at all.

And even if there is no connectivity and nothing new to show, we must show something. Even if we aren’t getting paid for showing the ad, not showing something or (actually worse) showing an error is something that we can’t tolerate.

Given all of these constraints, how do would you build such a system?

More posts in "Design exercise" series:

  1. (19 Dec 2018) Distributing (consistent) data at scale, answer
  2. (18 Dec 2018) Distributing (consistent) data at scale
  3. (26 Nov 2018) A generic network protocol