That No SQL ThingColumn (Family) Databases

time to read 12 min | 2298 words

Column family databases are probably most known because of Google’s BigTable implementation. The are very similar on the surface to relational database, but they are actually quite different beast. Some of the difference is storing data by rows (relational) vs. storing data by columns (column family databases). But a lot of the difference is conceptual in nature. You can’t apply the same sort of solutions that you used in a relational form to a column database.

That is because column databases are not relational, for that matter, they don’t even have what a RDBMS advocate would recognize as tables.

Nitpicker corner: this post is about the concept, I am going to ignore actual implementation details where they don’t illustrate the actual concepts.

Note: If you want more information, I highly recommend this post, explaining about data modeling in a column database.

The following concepts are critical to understand how column databases work:

  • Column family
  • Super columns
  • Column

Columns and super columns in a column database are spare, meaning that they take exactly 0 bytes if they don’t have a value in them. Column families are the nearest thing that we have for a table, since they are about the only thing that you need to define upfront. Unlike a table, however, the only thing that you define in a column family is the name and the key sort options (there is no schema).

Personally, I think that column family databases are probably the best proof of leaky abstractions. Just about everything in CFDB (as I’ll call them from now on) is based around the idea of exposing the actual physical model to the users so they can make efficient use of that.

  • Column families – A column family is how the data is stored on the disk. All the data in a single column family will sit in the same file (actually, set of files, but that is close enough). A column family can contain super columns or columns.
  • A super column is a dictionary, it is a column that contains other columns (but not other super columns).
  • A column is a tuple of name, value and timestamp (I’ll ignore the timestamp and treat it as a key/value pair from now on).

It is important to understand that when schema design in a CFDB is of outmost importance, if you don’t build your schema right, you literally can’t get the data out. CFDB usually offer one of two forms of queries, by key or by key range. This make sense, since a CFDB is meant to be distributed, and the key determine where the actual physical data would be located. This is because the data is stored based on the sort order of the column family, and you have no real way of changing the sorting (except choosing between ascending or descending).

The sort order, unlike in a relational database, isn’t affected by the columns values, but by the column names.

Let assume that in the Users column family, in the row “@ayende”, we have the column “name” set to “Ayende Rahine” and the column “location” set to “Israel”. The CFDB will physically sort them like this in the Users column family file:

@ayende/location = “Israel”
@ayende/name = “Ayende Rahien”

This is because the sort “location” is lower than “name”. If we had a super column involved, for example, in the Friends column family, and the user “@ayende” had two friends, they would be physically stored like this in the Friends column family file:

@ayende/friends/arava= 945
@ayende/friends/rose = 14

Remember that, this property is quite important to understanding how things work in a CFDB. Let us imagine the twitter model, as our example. We need to store: users and tweets. We define three column families:

  • Users – sorted by UTF8
  • Tweets – sorted by Sequential Guid
  • UsersTweets – super column family, sorted by Sequential Guid

  Let us create the user (a note about the notation: I am using named parameters to denote column’s name & value here. The key parameter is the row key, and the column family is Users):

cfdb.Users.Insert(key: “@ayende”, name: “Ayende Rahine”, location: “Israel”, profession: “Wizard”);

You can see a visualization of how below. Note that this doesn’t look at all like how we would typically visualize a row in a relational database.

image

Now let us create a tweet:

var firstTweetKey = “Tweets/” + SequentialGuid.Create();
cfdb.Tweets.Insert(key: firstTweetKey, application: “TweekDeck”, text: “Err, is this on?”, private: true);

var secondTweetKey = “Tweets/” + SequentialGuid.Create();
cfdb.Tweets.Insert(key: secondTweetKey, app: “Twhirl”, version: “1.2”, text: “Well, I guess this is my mandatory hello world”, public: true);

And here is how it actually looks:

image

There are several things to notice here:

  • In this case, the key doesn’t matter, but it does matter that it is sequential, because that will allow us to sort of it later.
  • Both rows have different data columns on them.
  • We don’t actually have any way to associate a user to a tweet.

That last bears some talking about. In a relational database, we would define a column called UserId, and that would give us the ability to link back to the user. Moreover, a relational will allow us to query the tweets by the user id, letting us get the user’s tweets. A CFDB doesn’t give us this option, there is no way to query by column value. For that matter, there is no way to query by column (which is a familiar trick if you are using something like Lucene).

Instead, the only thing that a CFDB gives us is a query by key. In order to answer that question, we need the UsersTweets column family:

cfdb.UsersTweets.Insert(key: “@ayende”,  timeline: { SequentialGuid.Create(): firstTweetKey } );
cfdb.UsersTweets.Insert(key: “@ayende”,  timeline: { SequentialGuid.Create(): secondTweetKey } ); 

On the CFDB, it looks like this:

image

And now we need more explanation about the notation. Here we insert into the UsersTweets column family, to the row with the key: “@ayende”, to the super column timeline two columns, the name of each column is a sequential guid, which means that we can sort by it. What this actually does is create a single row with a single super column, holding two columns, where each column name is a guid, and the value of each column is the key of a row in the Tweets table.

Question: Couldn’t we create a super column in the Users’ column family to store the relationship? Well, yes, we could, but a column family can contain either columns or super columns, it cannot contain both.

Now, in order to get tweets for a user, we need to execute:

var tweetIds = 
      cfdb.UsersTweets.Get(“@ayende”)
.Fetch(“timeline”) .Take(25)
.OrderByDescending() .Select(x=>x.Value); var tweets = cfdb.Tweets.Get(tweetIds);

In essence, we execute two queries, one on the UsersTweets column family, requesting the columns & values in the “timeline” super column in the row keyed “@ayende”, then execute another query against the Tweets column family to get the actual tweets.

Because the data is sorted by the column name, and because we choose to sort in descending order, we get the last 25 tweets for this user.

What would happen if I wanted to show the last 25 tweets overall (for the public timeline)? Well, that is actually very easy, all I need to do is to query the Tweets column family for tweets, ordering them by descending key order.

Nitpicker corner: No, there is not such API for a CFDB for .NET that I know of, I made it up so it would be easier to discuss the topic.

Why is a CFDB so limiting?

You might have noticed how many times I noted differences between RDBMS and a CFDB. I think that it is the CFDB that is the hardest to understand, since it is so close, on the surface to the relational model. But it seems to suffer from so many limitations. No joins, no real querying capability (except by primary key), nothing like the richness that we get from a relational database. Hell, Sqlite or Access gives me more than that. Why is it so limited?

The answer is quite simple. A CFDB is designed to run on a large number of machines, and store huge amount of information. You literally cannot store that amount of data in a relational database, and even multi-machine relational databases, such as Oracle RAC will fall over and die very rapidly on the size of data and queries that a typical CFDB is handling easily.

Do you remember that I noted that CFDB is really all about removing abstractions? CFDB is what happens when you take a database, strip everything away that make it hard to run in on a cluster and see what happens.

The reason that CFDB don’t provide joins is that joins require you to be able to scan the entire data set. That requires either someplace that has a view of the whole database (resulting in a bottleneck and a single point of failure) or actually executing a query over all machines in the cluster. Since that number can be pretty high, we want to avoid that.

CFDB don’t provide a way to query by column or value because that would necessitate either an index of the entire data set (or just in a single column family) which in again, not practical, or running the query on all machines, which is not possible. By limiting queries to just by key, CFDB ensure that they know exactly what node a query can run on. It means that each query is running on a small set of data, making them much cheaper.

It requires a drastically different mode of thinking, and while I don’t have practical experience with CFDB, I would imagine that migrations using them are… unpleasant affairs, but they are one of the ways to get really high scalability out of your data storage.

Waiting expectantly to the commenters who would say that relational databases are the BOMB and that I have no idea what I am talking about and that I should read Codd and that no one really need to use this sort of stuff except maybe Google and even then only because Google has no idea how RDBMS work (except maybe the team that worked on AdWords).

More posts in "That No SQL Thing" series:

  1. (03 Jun 2010) Video
  2. (14 May 2010) Column (Family) Databases
  3. (09 May 2010) Why do I need that again?
  4. (07 May 2010) Scaling Graph Databases
  5. (06 May 2010) Graph databases
  6. (22 Apr 2010) Document Database Migrations
  7. (21 Apr 2010) Modeling Documents in a Document Database
  8. (20 Apr 2010) The relational modeling anti pattern in document databases
  9. (19 Apr 2010) Document Databases – usages