That No SQL Thing - Key/Value stores

time to read 5 min | 834 words

The simplest No SQL databases are the Key/Value stores. They are simplest only in terms of their API, because the actual implementation may be quite complex. But let us focus on the API that is exposed to us for a second. Most of the Key/Value stores expose some variation on the following API:

void Put(string key, byte[] data);
byte[] Get(string key);
void Remove(string key);

There are many variations, but that is the basis for everything else. A key value store allows you to store values by key, as simple as that. The value itself is just a blob, as far as the data store is concerned, it just stores it, it doesn’t actually care about the content. In other words, we don’t have a data stored defined schema, but a client defined semantics for understanding what the values are. The benefits of using this approach is that it is very simple to build a key value store, and that it is very easy to scale it. It also tend to have great performance, because the access pattern in key value store can be heavily optimized.

Concurrency – In Key/Value Store, concurrency is only applicable on a single key, and it is usually offered as either optimistic writes or as eventually consistent. In highly scalable systems, optimistic writes are often not possible, because of the cost of verifying that the value haven’t changed (assuming the value may have replicated to other machines), there for, we usually see either a key master (one machine own a key) or the eventual consistency model.

Queries – there really isn’t any way to perform a query in a key value store, except by the key. Even range queries on the key are usually not possible.

Transactions – while it is possible to offer transaction guarantees in a key value store, those are usually only offer in the context of a single key put. It is possible to offer those on multiple keys, but that really doesn’t work when you start thinking about a distributed key value store, where different keys may reside on different machines. Some data stores offer no transaction guarantees.

Schema – key value stores have the following schema Key is a string, Value is a blob :-) beyond that, the client is the one that determines how to parse the data.

Scaling Up – In Key Value stores, there are two major options for scaling, the simplest one would be to shard the entire key space. That means that keys starting in A go to one server, while keys starting with B go to another server. In this system, a key is only stored on a single server. That drastically simplify things like transactions guarantees, but it expose the system for data loss if a single server goes down. At this point, we introduce replication.

Replication – In key value stores, the replication can be done by the store itself or by the client (writing to multiple servers). Replication also introduce the problem of divergent versions. In other words, two servers in the same cluster think that the value of key ‘ABC’ are two different things. Resolving that is a complex issue, the common approaches are to decide that it can’t happen (Scalaris) and reject updates where we can’t ensure non conflict or to accept all updates and ask the client to resolve them for us at a later date (Amazon Dynamo, Rhino DHT).

Usages – Key Value stores shine when you need to access the data by key :-)

More seriously, key based access is actually quite common. Things like user profiles, user sessions, shopping carts, etc. In all those cases, note, we are storing the entire thing as a single value in the data store, that makes it cheap to handle (one request to read, one request to write) easy to handle when you run into concurrency conflict (you only need to resolve a single key).

Because key based queries are practically free, by structuring our data access along keys, we can get significant performance benefit by structuring our applications to fit that need. It turns out that there is quite a lot that you can do with just key/value store. Amazon’s shopping cart runs on a key value store (Amazon Dynamo), so I think you can surmise that this is a highly scalable technique.

Data stores to look at:

  • The Amazon Dynamo paper is one of the best resources on the topic that one can ask.
  • Rhino DHT is a scalable, redundant, zero config, key value store on the .NET platform.