That No SQL ThingModeling Documents in a Document Database

time to read 11 min | 2200 words

So, after telling you the wrong way to go about it, I intend to show the right way to design the data model using a document database. So let’s be about it.

As a reminder, those are the scenarios that I want to deal with:

  • Main page: show list of blogs
  • Main page: show list of recent posts
  • Main page: show list of recent comments
  • Main page: show tag cloud for posts
  • Main page: show categories
  • Post page: show post and all comments
  • Post page: add comment to post
  • Tag page: show all posts for tag
  • Categories page: show all posts for category

And here is a sample data that we are going to work with (again, using C# because any other representation pretty much dictate the storage format).

var user = new User("ayende");
var blog = new Blog("Ayende @ Rahien", user) { Tags = {".NET", "Architecture", "Databases" } };
var categoryRaven = new Category("Raven");
var categoryNoSQL = new Category("NoSQL");
var post = new Post(blog, "RavenDB", "... content ...")  
{
    Categories  = { categoryRaven, categoryNoSQL },
    Tags = {"RavenDB", "Announcements" }
};
var comment = new Comment(post, "Great news");

PersistAll(user, blog, categoryRaven, categoryNoSQL, post, comment);

When approaching a document database model design, I like to think in aggregates. What entities in the model above are aggregates? User and Blog, obviously, and Post as well. Each of those has a right to exist in an independent manner. But nothing else has a right to exist on its own. Tags are obviously Value Objects, and the only usage I can see for categories is to use them for indexing the data. Comments, as well, aren’t really meaningful outside their post. Given that decision, I decided on the following format:

// users/ayende
{
   "type": "user",
   "name": "ayende"
}

// blogs/1
{
   "type": "blog",
    "users": ["users/ayende"],
    "name": "Ayende @ Rahien",
    "tags": [".NET", "Architecture", "Databases"]
}
// posts/1
{
    "blog": "blogs/1",
    "title": "RavenDB",
    "content": "... content ...",
    "categories": ["Raven", "NoSQL"]
    "tags" : ["RavenDB", "Announcements"],
    "comments": [
        { "content": "Great News" }
    ]
}

That gives me a much smaller model, and it make things like pulling a post and its comments (a very common operation) extremely cheap. Let us see how we can handle each scenario using this model…

Main page: show list of blogs

This is simple, using a commonly used index:

var blogs = docDb.Query<Blog>("DocumentsByType", "type:blog");

Main page: show list of recent posts

var posts = docDb.Query<Post>("PostsByTime", orderBy:"-posted_at");

This is using an index for posts by time:

from doc in docs
where doc.type == "post"
select new {doc.posted_at}

Main page: show list of recent comments

This is more interesting, because we don’t have the concept of comments as a separate thing. We handle this using an index that extract the values from the post document:

from doc in docs
where doc.type == "post"
from comment in doc.comments
select new {comment.posted_at, comment.content }

And the query itself:

var recentComments = docDb.Query<Comment>("CommentsByTime", orderBy:"-posted_at");

Main page: show tag cloud for posts

And now, we have another interesting challenge, we need to do aggregation on top of the tags. This is done using Map/Reduce operation. It sounds frightening, but it is pretty easy, just two linq queries:

from doc in docs
where doc.type == "post"
select tag in doc.tags
select new { tag, count = 1}

from result in results
group result by result.tag into g
select new {tag = g.Key, count = g.Sum(x=>x.count) }

And given that, we can now execute a simple query on top of that:

var tagCloud = docDb.Query<TagAndCount>("TagCloud");

Main page: show categories

This is pretty much the same thing as tag cloud, because we want to extract the categories from the posts, and get the number of posts in each category.

from doc in docs
where doc.type == "post"
select category in doc.categories
select new { category, count = 1}

from result in results
group result by result.category into g
select new {category= g.Key, count = g.Sum(x=>x.count) }

And given that, we can now execute a simple query on top of that:

var categories = docDb.Query<CategoryAndCount>("CategoriesCount");

Post page: show post and all comments

Here we see how easy it is to manage object graphs, all we need to do is pull the post document, and we get everything we need to render the post page:

var post = docDb.Get<Post>("posts/1");

Post page: add comment to post

Remember, a comment cannot live outside a post, so adding a comment involves getting the post, adding the comment and saving:

var post = docDb.Get<Post>("posts/1");
post.Comments.Add(new Comment(...));
docDb.SaveChanges();

Note that this uses the Unit of Work pattern, the Doc DB keep track of the post entity and will persist it to the database when SaveChanges() is called.

Tag page: show all posts for tag

Again, we have a two step approach, define the index, and then query on it. It is pretty simple:

from doc in docs
where doc.type == "post"
from tag in doc.tags
select new { tag }

And the query itself is:

var posts = docDb.Query<Post>("PostsByTag", "tag:Raven");

Categories page: show all posts for category

This is the same as all posts for tag, so I’ll skip it.

There are a few things to notice about this approach.

  • A good modeling technique is to think about Aggregates in the DDD sense, an Aggregate and all its associations are stored as a single document.
  • If we need to query on an entity encapsulated in an Aggregate, we can extract that information using an index. That extraction, however, doesn’t change the way that way that we access the data.
  • We can create “virtual entities”, such as Tags or Categories, which doesn’t really exists, but are generated by the indexes we define. Those are usually things that are only useful for users to search / organize the aggregates on.
  • Schema doesn’t really matter, because we either search by key, or on well known indexes, which massage the data to the format that we expect it to be.

Thoughts?

More posts in "That No SQL Thing" series:

  1. (03 Jun 2010) Video
  2. (14 May 2010) Column (Family) Databases
  3. (09 May 2010) Why do I need that again?
  4. (07 May 2010) Scaling Graph Databases
  5. (06 May 2010) Graph databases
  6. (22 Apr 2010) Document Database Migrations
  7. (21 Apr 2010) Modeling Documents in a Document Database
  8. (20 Apr 2010) The relational modeling anti pattern in document databases
  9. (19 Apr 2010) Document Databases – usages