Distributed in memory cache / storage

time to read 3 min | 572 words

Let us start by defining the difference between cache and storage. A cache may decide to evict items whenever it feels like, a storage will only do so at well defined points.

As a simple example, this is legal for a cache:

PUT "foo", "item data"
result = GET "foo" 
assert result is null

The cache contract means that it "might" preserve values, or it might not, it is the cache choice. A storage contract make the above code illegal. It may never happen. Most cache solutions have a way to specify priorities, including the "do not evict" flag, which makes them into good storage solutions.

For now, we will assume the same API for both storage and cache, and assume that we have  flag to turn it this way or that. We need to figure out what is the contract we are working with in the first place. Distributed & in memory means that we have no practical memory limit and we do not care much for disk access times. Distributed does mean that we have remote calls penalty, but in general it means that we don't have to wait for disk, as we would have to in many DB based solutions.

We will go over the API first. Note that I am using an abstract API (suspiciously similar to the Memcached one, I know), which most cache/storage solution would support.

  • PUT key, data, expiration
    Will fail if item is already in cache
  • GET key
    Will return null if item is not in cache or if expired
  • DEL key
    Delete the key from the cache
  • UPDATE key, data, version
    Update the item if the version matches

We assume that the storage reside on multiple machines, which means that we have several interesting problems to deal with when we deal with the storage .

  1. We are, by definition, working in a threaded environment.
  2. The cache offer no way to lock items
  3. The set of items we need to perform a single operation may reside on different machines.
  4. As a result, Each operation is distinct from any other operation, there is no way to batch several operations into a distinct operation. In other words, we don't have transactions.

Think about the implications of this for a moment, will you?

Let us say that I have the following piece of code:

employee, employee_version = GET "employee#"+id
salary, salary_version = GET "salaray#"+id

# business logic

UPDATE "employee#"+id, employee, employee_version
UPDATE "salary#"+id, salary, salary_version

At any point, we might get interleaving from another client, which will lead to data corruption. Even the safe update version that we have is no good, assume that we fail to update the salary. At that point, we have already updated the employee, now we need to roll it back, but another client might have added their changes in the mean time...

To make things worse, the employee and salary might very well reside on different machines. Trying to build transactions around is possible, but extremely expensive in terms of performance.

We need a better solution, one that build on top of the existing API we have and give us a way to handle this properly.

I will touch on that in a future post.