That No SQL Thing – Document Databases

time to read 6 min | 1151 words

A document database is, at its core, a key/value store with one major exception. Instead of just storing any blob in it, a document db requires that the data will be store in a format that the database can understand. The format can be XML, JSON, Binary JSON (MongoDB), or just about anything, as long as the database can understand it.

Why is this such a big thing? Because when the database can understand the format of the data that you send it, it can now do server side operations on that data. In most doc dbs, that means that we can now allow queries on the document data.The known format also means that it is much easier to write tooling for the database, since it is possible to show, display and edit the data.

I am going to use Raven as the example for this post. Raven is a JSON based, REST enabled document database. Documents in Raven use the JSON format, and each document contains both the actual data and an additional metadata information that you can use. Here is an example of a document:

    "name": "ayende",
    "email": "",
    "projects": [
        "rhino mocks",
        "rhino service bus",
        "raven db",
        "rhino persistent hash table",
        "rhino distributed hash table",
        "rhino etl",
        "rhino security",
        "rampaging rhinos"

We can PUT this document in the database, under the key ‘ayende’. We can also GET the document back by using the key ‘ayende’.

A document database is schema free, that is, you don’t have to define your schema ahead of time and adhere to that. It also allow us to store arbitrarily complex data. If I want to store trees, or collections, or dictionaries, that is quite easy. In fact, it is so natural that you don’t really think about it.

It does not, however, support relations. Each document is standalone. It can refer to other documents by store their key, but there is nothing to enforce relational integrity.

The major benefit of using a document database comes from the fact that while it has all the benefits of a key/value store, you aren’t limited to just querying by key. By storing information in a form that the database can understand, we can ask the server to do things for us. Such as defining the following index for us:

from doc in docs
from prj in doc.projects
select new { Project = prj, Name = doc.Name }

With that in place, we can now make queries on the stored documents:

GET /indexes/ProjectAndName?query=Project:raven

This will give us all the documents that include a project with the name raven.

We can take it a bit further, though, and ask the database to store aggregations for us:

from doc in docs
from prj in doc.projects
select new { Count = 1, Project = prj}

from result in results
group result by result.Project into g
select new { Project = g.Key, Count = g.Sum(x=>x.Count) };

And now we can query for the number of users involved in the raven project:

GET /indexes/UsersCountByProject?query=Project:raven

In general, querying a document database falls into the prepared ahead of time (Raven, CouchDB) or at query time (MongoDB).

In the first case, you define an indexing function (in Raven’s case, a Linq query, in CouchDB case, a JavaScript function) and the server will run it to prepare the results, once the results are prepared, they can be served to the client with minimal computation. CouchDB and Raven differs in the method they use to update those indexes, Raven will update the index immediately on document change, and queries to indexes will never wait. The query may return a stale result (and is explicitly marked as such), but it will return immediately. With CouchDB, a view is updated at view query time, which may lead to a long wait time on the first time a view is accessed if there were a lot of changes in the meanwhile. The CouchDB approach avoids doing extra works that may not be needed, but Raven’s approach means that you don’t have to handle potentially large delays at queyring times. CouchDB allow to pass a parameter at the view querying time to allow it to return stale results, which seems to be the best of all worlds.

Note that in both CouchDB and Raven’s cases, indexes do not affect write speed, since in both cases this is done at a background task.

MongoDB, on the other hand, allows ad-hoc querying, and relies on indexes defined on the document values to help it achieve reasonable performance when the data size grows large enough. MongoDB’s indexes behave in much the same way RDBMS indexes behave, that is, they are updated as part or the insert process, so large number of indexes is going to affect write performance.

Other document databases:

  • ThruDB is the closest one that I could find, but it is basically a key/value store with manual Lucene indexing.
  • Terrastore seems to be a full fledged document database, but I can’t find much information about it. Requires Terracotta to work.
  • Riak is another Erlang based Doc DB. It seems to be close to MongoDB in terms of operation, but that is about as much as I know about it. Querying Riak is done by writing a map/reduce query in javascript or erlang. I think it is the originator of this.