Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 5,969 | Comments: 44,490

filter by tags archive

That No SQL Thing - Key/Value stores


The simplest No SQL databases are the Key/Value stores. They are simplest only in terms of their API, because the actual implementation may be quite complex. But let us focus on the API that is exposed to us for a second. Most of the Key/Value stores expose some variation on the following API:

void Put(string key, byte[] data);
byte[] Get(string key);
void Remove(string key);

There are many variations, but that is the basis for everything else. A key value store allows you to store values by key, as simple as that. The value itself is just a blob, as far as the data store is concerned, it just stores it, it doesn’t actually care about the content. In other words, we don’t have a data stored defined schema, but a client defined semantics for understanding what the values are. The benefits of using this approach is that it is very simple to build a key value store, and that it is very easy to scale it. It also tend to have great performance, because the access pattern in key value store can be heavily optimized.

Concurrency – In Key/Value Store, concurrency is only applicable on a single key, and it is usually offered as either optimistic writes or as eventually consistent. In highly scalable systems, optimistic writes are often not possible, because of the cost of verifying that the value haven’t changed (assuming the value may have replicated to other machines), there for, we usually see either a key master (one machine own a key) or the eventual consistency model.

Queries – there really isn’t any way to perform a query in a key value store, except by the key. Even range queries on the key are usually not possible.

Transactions – while it is possible to offer transaction guarantees in a key value store, those are usually only offer in the context of a single key put. It is possible to offer those on multiple keys, but that really doesn’t work when you start thinking about a distributed key value store, where different keys may reside on different machines. Some data stores offer no transaction guarantees.

Schema – key value stores have the following schema Key is a string, Value is a blob :-) beyond that, the client is the one that determines how to parse the data.

Scaling Up – In Key Value stores, there are two major options for scaling, the simplest one would be to shard the entire key space. That means that keys starting in A go to one server, while keys starting with B go to another server. In this system, a key is only stored on a single server. That drastically simplify things like transactions guarantees, but it expose the system for data loss if a single server goes down. At this point, we introduce replication.

Replication – In key value stores, the replication can be done by the store itself or by the client (writing to multiple servers). Replication also introduce the problem of divergent versions. In other words, two servers in the same cluster think that the value of key ‘ABC’ are two different things. Resolving that is a complex issue, the common approaches are to decide that it can’t happen (Scalaris) and reject updates where we can’t ensure non conflict or to accept all updates and ask the client to resolve them for us at a later date (Amazon Dynamo, Rhino DHT).

Usages – Key Value stores shine when you need to access the data by key :-)

More seriously, key based access is actually quite common. Things like user profiles, user sessions, shopping carts, etc. In all those cases, note, we are storing the entire thing as a single value in the data store, that makes it cheap to handle (one request to read, one request to write) easy to handle when you run into concurrency conflict (you only need to resolve a single key).

Because key based queries are practically free, by structuring our data access along keys, we can get significant performance benefit by structuring our applications to fit that need. It turns out that there is quite a lot that you can do with just key/value store. Amazon’s shopping cart runs on a key value store (Amazon Dynamo), so I think you can surmise that this is a highly scalable technique.

Data stores to look at:

  • The Amazon Dynamo paper is one of the best resources on the topic that one can ask.
  • Rhino DHT is a scalable, redundant, zero config, key value store on the .NET platform.

That No SQL Things – How to evaluate?


In my posts about the No SQL options, I am going to talk about their usage in two major scenarios, first, at at a logical perspective, what kind of API and interface do they us, and second, what kind of scaling capabilities they have.

Almost all data stores need to handle things like:

  • Concurrency
  • Queries
  • Transactions
  • Schema
  • Replication
  • Scaling Up

And I am going to touch on each of those for each option.

One thing that should be made clear upfront is the major difference between performance and scalability, the two are often at odds and usually increasing one would decrease the other.

For performance, we ask: How can we execute the same set of requests, over the same set of data with:

  • less time?
  • less memory?

For scaling, we ask: How can we meet our SLA when:

  • we get a lot more data?
  • we get a lot more requests?

With relational databases, the answer is usually, you don’t scale. The No SQL alternatives are generally quite simple to scale, however.

That No SQL Thing


Probably the worst thing about relational databases is that they are so good in what they are doing. Good enough to conquer the entire market on data storage and hold it for decades.

Wait! That is a bad thing? How?

It is a bad thing because relational databases are appropriate for a wide range of tasks, but not for every task. Yet it is exactly that that caused them to be used in contexts where they are not appropriate. In the last month alone, my strong recommendation for two different client was that they need to switch to a non relational data store because it would greatly simplify the work that they need to do.

That met with some (justified) resistance, predictably. People think that RDBMS are the way to store data. I decided to write a series of blog posts about the topic, trying to explain why you might want to work with a No SQL database.

Relational Databases have the following properties:

  • ACID
  • Relational
  • Table / Row based
  • Rich querying capabilities
  • Foreign keys
  • Schema

Just about any of the No SQL approaches give up on some of those properties., usually, it gives up on all of those properties. But think about how useful an RDBMS is, how flexible it can be. Why give it up?

Indeed, the most common reason to want to move from a RDBMS is running into the RDBMS limitations. In short, RDBMS doesn’t scale. Actually, let me phrase that a little more strongly, RDBMS systems cannot be made to scale.

The problem is inherit into the basic requirements of the relational database system, it must be consistent, to handle things like foreign keys, maintain relations over the entire dataset, etc. The problem, however, is that trying to scale a relational database over a set of machine. At that point, you run head on into the CAP theorem, which state that if consistency is your absolute requirement, you need to give up on either availability or partition tolerance.

In most high scaling environments, it is not feasible to give up on either option, so relational databases are out. That leaves you with the No SQL options, I am going to dedicate a post for each of the following, let me know if I missed something:

  • Key/Value store
    • Key/Value store – sharding
    • Key/Value store - replication
    • Key/Value store – eventually consistent
  • Document Databases
  • Graph Databases
  • Column (Family) Databases

Other databases, namely XML databases and Object databases, exists. Object databases suffer from the same problem regarding CAP as relational databases, and XML databases are basically a special case of a document database.

How to setup dynamic groups in MSBuild without Visual Studio ruining them


One of the really annoying things about VS & MSBuild is that while MSBuild is perfectly capable of doing things like this:

<EmbeddedResource Include="WebUI\**\*.*"/>

Visual Studio would go ahead, resolve the expression and then hard code the current values.

That sort of defeat the way I want to make use of it, which would make it frictionless to add more elements. If you do it in a slightly more complex way, VS can’t resolve this easily, so it does the right thing and compute this at build time, rather than replacing with the hard coded values.

Here is how you do it:

    <Target Name="BeforeBuild">
        <CreateItem Include="WebUI\**\*.*">
            <Output ItemName="EmbeddedResource" TaskParameter="Include" />
        </CreateItem>
    </Target>

I love ConcurrentDictionary!


Not just because it is concurrent, because of this wonderful method:

public class Storage : IStorage
{
    private readonly ConcurrentDictionary<Guid, ConcurrentDictionary<int, List<object>>> values =
        new ConcurrentDictionary<Guid, ConcurrentDictionary<int, List<object>>>();

    public void Store(Guid taskId, int level, object value)
    {
        values.GetOrAdd(taskId, guid => new ConcurrentDictionary<int, List<object>>())
            .GetOrAdd(level, i => new List<object>())
            .Add(value);
    }
}

Doing this with Dictionary is always a pain, but this is an extremely natural way of doing things.

On languages and platforms


I got the following question in email:

Can you blog about why you chose c# and .net over other languages for all your projects. What was it that made you stick around the windows platform? Was it coincidence or a firm decision based on something? Was it because c# was the first language that you learned in your profession and then decided to become a pro in that?

I would expect to see you working on OSS in the lamp stack. You could have displayed your capabilities very well in Java. Just a bit curious…

It started as a coincidence, truth to be told. I was working on what I had available at the time, and that was Windows. I actually made the jump from hobbyist to professional in a C++ course. I wanted to learn C++ because that was where most of the action was at the time. It was a Visual C++ 6.0 course, not something that I was aware of at the time. If it was a GNU/C++ course, I would probably be in a very different position today.

That said, it stopped being a coincidence several years ago, when I made a conscious decision to stick to the .NET platform. That decision was based on several factors. Size of the market, acceptance for deployment, familiarity, future prospects, etc.

In essence, it boiled down to the fact that I really didn’t have any burning issues with the .Net platform, I am familiar with it, what you can do with it and what you can’t. Moving to another platform would introduce a very high cost during the switch, but so far there hadn’t been anything that was enough of a killer feature to convince me to move there.

Note, this post was written while taking an Erlang course.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Production postmortem (5):
    29 Jul 2015 - The evil licensing code
  2. Career planning (6):
    24 Jul 2015 - The immortal choices aren't
  3. API Design (7):
    20 Jul 2015 - We’ll let the users sort it out
  4. What is new in RavenDB 3.5 (3):
    15 Jul 2015 - Exploring data in the dark
  5. The RavenDB Comic Strip (3):
    28 May 2015 - Part III – High availability & sleeping soundly
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats