Let us start by defining the difference between cache and storage. A cache may decide to evict items whenever it feels like, a storage will only do so at well defined points.
As a simple example, this is legal for a cache:
PUT "foo", "item data" result = GET "foo" assert result is null
The cache contract means that it "might" preserve values, or it might not, it is the cache choice. A storage contract make the above code illegal. It may never happen. Most cache solutions have a way to specify priorities, including the "do not evict" flag, which makes them into good storage solutions.
For now, we will assume the same API for both storage and cache, and assume that we have flag to turn it this way or that. We need to figure out what is the contract we are working with in the first place. Distributed & in memory means that we have no practical memory limit and we do not care much for disk access times. Distributed does mean that we have remote calls penalty, but in general it means that we don't have to wait for disk, as we would have to in many DB based solutions.
We will go over the API first. Note that I am using an abstract API (suspiciously similar to the Memcached one, I know), which most cache/storage solution would support.
- PUT key, data, expiration
Will fail if item is already in cache
- GET key
Will return null if item is not in cache or if expired
- DEL key
Delete the key from the cache
- UPDATE key, data, version
Update the item if the version matches
We assume that the storage reside on multiple machines, which means that we have several interesting problems to deal with when we deal with the storage .
- We are, by definition, working in a threaded environment.
- The cache offer no way to lock items
- The set of items we need to perform a single operation may reside on different machines.
- As a result, Each operation is distinct from any other operation, there is no way to batch several operations into a distinct operation. In other words, we don't have transactions.
Think about the implications of this for a moment, will you?
Let us say that I have the following piece of code:
employee, employee_version = GET "employee#"+id salary, salary_version = GET "salaray#"+id # business logic UPDATE "employee#"+id, employee, employee_version UPDATE "salary#"+id, salary, salary_version
At any point, we might get interleaving from another client, which will lead to data corruption. Even the safe update version that we have is no good, assume that we fail to update the salary. At that point, we have already updated the employee, now we need to roll it back, but another client might have added their changes in the mean time...
To make things worse, the employee and salary might very well reside on different machines. Trying to build transactions around is possible, but extremely expensive in terms of performance.
We need a better solution, one that build on top of the existing API we have and give us a way to handle this properly.
I will touch on that in a future post.