Ayende @ Rahien

Refunds available at head office

That No SQL Thing: The relational modeling anti pattern in document databases

I am going to demonstrate the design of the data model in a document database for a typical blog application.

The following is my default sample data model, showing a very simple blog:

image

The absolutely wrong approach with a document database is to try to take the relational model and apply it on a document level. This is especially wrong because for a while, it might actually work. Let us say that we want to store the following:

var user = new User("ayende");
var blog = new Blog("Ayende @ Rahien", user) { Tags = {".NET", "Architecture", "Databases" } };
var categoryRaven = new Category("Raven");
var categoryNoSQL = new Category("NoSQL");
var post = new Post(blog, "RavenDB", "... content ...")  
{
    Categories  = { categoryRaven, categoryNoSQL },
    Tags = {"RavenDB", "Announcements" }
};
var comment = new Comment(post, "Great news");

PersistAll(user, blog, categoryRaven, categoryNoSQL, post, comment);

Interestingly enough, I need to use code to represent the data without tying it to a particular storage format.

The wrong approach to store the data would be to store each object as its own document, similar to the way we would store each object as its own row in a relational database. That wrong approach would look like this:

// users/ayende
{
   "type": "user",
   "name": "ayende"
}

// tags/1
{
   "name": ".NET"
}

// tags/2
{
   "name": "Architecture"
}

// tags/3
{
   "name": "Databases"
}
// tags/4
{
   "name": "RavenDB"
}
// tags/5
{
   "name": "Announcements"
}
// categories/1
{
    "name": "Raven"
}
// categories/2
{
    "name" : "NoSQL"
}
// blogs/1
{
   "type": "blog",
    "users": ["users/ayende"],
    "name": "Ayende @ Rahien",
    "tags": ["tags/1", "tags/2", "tags/3"]
}

// posts/1
{
    "blog": "blogs/1",
    "title": "RavenDB",
    "content": "... content ...",
    "categories": ["categories/1", "categories/2"]
    "tags" : ["tags/4", "tags/5"]
}

// comments/1
{
    "post": "posts/1",
    "content": "Great News"
}

I know that I am repeating myself here, but I have seen people miss the point before. Do NOT try to model a document database in this way.

See full size imageThe main reason that this is wrong is that a document database has no real support for doing joins, unions or any of the things that make such a model work effectively in a relational model.

Let us try to analyze the scenarios where we need this data, okay?

  • Main page: show list of blogs
  • Main page: show list of recent posts
  • Main page: show list of recent comments
  • Main page: show tag cloud for posts
  • Main page: show categories
  • Post page: show post and all comments
  • Post page: add comment to post
  • Tag page: show all posts for tag
  • Categories page: show all posts for category

I am going to analyze each of those scenarios using SQL (and the above model) and the current (and bad, smelly, nasty) document database model. I’ll have another post showing how to correctly model this in a document database, this post is about how not to do it.

Main page: show list of blogs

Using SQL, this is pretty easy:

select * from blogs

Using DocDB, this is easy, we are using a built-in index to query for documents by their type:

docDb.Query<Blog>("DocumentsByType", query:"type:blog");

Main page: show list of recent posts

Using SQL, this is pretty easy:

select * from posts order by PostedAt desc

Using DocDB, we need to define our own index function, to allow use to sort on it. That is painless, even if I say so myself:

from doc in docs
where doc.type == "post"
select new {doc.posted_at}

And now we can query this using:

docDb.Query<Post>("Posts", orderBy:"-posted_at");

Main page: show list of recent comments

This is exactly the same as recent posts, so I’ll skip it.

Main page: show tag cloud for posts

Here the SQL grows interesting:

select Name, COUNT(*) as TagCount from tags
where ItemType = 'Posts'
group by Name

And with the document database we need to write a map/reduce index (“Did you just told me to go @%$# myself?”)

from doc in docs
where doc.type == "posts"
from tag in doc.tags
select new { tag, count = 1 }

from result in results
group result by result.tag into g
select new { tag = g.Key, count = g.Sum(x=>x.count) }

And now that we have the index, we can get the values from it using:

var tagCloud = new TagCloud();
var tagIds = docDb.Query<TagAndCount>("TagsCloud", orderBy:"+count");
foreach(var tagId in tagIds)
{
    var tag = docDb.Get<Tag>(tagId.Tag);
    tagCloud.Add(tag.Name, tagId.Count);
}

See full size imageNow this is ugly on many levels. First, we have the fairly complex index. Second, we have to merge the data ourselves at the client side. Third, we have to perform a SELECT N+1.

Yuck doesn’t being to cover it. There are actually ways to handle this more nicely, by making a multi get request, but I’ll not bother.

Main page: show categories

Exactly the same as show blogs, so I’ll skip it.

Post page: show post and all comments

Using stupid SQL:

select * from Posts where Id = 1

select * from Comments where PostId = 1

A more efficient method would be to use a join:

select * from Posts 
  join Comments 
    on Posts.Id = Comments.Id
where Posts.Id = 1

With the doc db, we can do:

var post = docDb.Get<Post>(1);
var comments = docDb.Query<Comment>("CommentsByPost", query:"post_id:1", orderBy:"+posted_at");

Which, of course, require us to define the comments by post index:

from doc in docs
where doc.type == "comment"
select new{doc.post_id, doc.posted_at}

Note that we have to make two calls here, because a document database has no notion of joins.

Post page: add comment to post

In SQL, it is a straightforward insert:

insert into comments (PostId, ... )
values(1, ...)

And with a document database, you can use:

docDb.Store(new Comment{ PostId = 1, ... });
docDb.SaveChanges();

Nothing much to look at here, using this flawed model.

See full size imageTag page: show all posts for tag

Using sql, that is slightly complex, because tags may be associated with blogs or with posts, so we need to do:

select * from Posts 
where Id in (
    select ItemId from tags
    where ItemType = 'Posts' and TagId = 1
)

Using a document database:

var posts = docDb.Query<Post>("PostsByTag", query:"tag:tags/1");

With the following index:

from doc in docs
where doc.type == "post"
from tag in doc.tags
select new {tag}

Categories page: show all posts for category

This is exactly like tags, so I’ll skip it.

As you’ve seen, by copying the relational model, we have created just the same sort of environment that we already had with RDBMS, but now we are faced with the problem that a document database can’t do things that a relational database can. In my eyes, what we have done is a net lose. Oh, we may gain some small benefit by being schemaless, but that isn’t really that beneficial in compared to the amount of effort that we have to go to by trying to be relational on a non relational database.

Comments

Demis Bellot
04/20/2010 11:41 AM by
Demis Bellot

I think your solution is very much tied to RavenDB because your trying to use querying abilities built into the api.

In other NoSql db's that support rich data constructs (e.g. Redis, MongoDb etc), the way I'd achieve the same result is by storing posts in a sorted set ordered by date.

For anywhere I require a reference, I would create a 2-way link (i.e. bi-directional reference) storing the tag with the post and at the same time a list of posts with each tag the post has. That way you can query both indexes for the tags a particular post has as well as all posts matching a particular tag.

This just goes back to the general NoSQL mindset you should have where you need to know what querying functionality you require up front so at the time you're storing data you can update the 'query indexes' accordingly.

Other benefits you've missed is the speed increase which depending on your situation allows you to dispense of any caching logic or real-time/accuracy of the data your displaying.

Antirez (the author of Redis) has an interesting post on his VM solution in which he talks about the Redis Virtual Machine implementation where Redis is like a 'Reversed Mysql+Memcached solution'. Where the data you frequently access is kept in memory and the Least Recently Used items are swapped out to disk:

antirez.com/post/redis-virtual-memory-story.html

Edin
04/20/2010 12:44 PM by
Edin

I guess, we can expect another post explaining how to actually do correct design. If a DocDb is used the way you described above, I really don't see any benefit of it. OK, maybe speed.

I think that Demis kind of showed some directions by saying that posts could be stored in sorted set and ordered by date. Also, I like the idea of 2-way link between tags and posts.

Ayende Rahien
04/20/2010 01:39 PM by
Ayende Rahien

Demis,

You did notice all the warning about this being the WRONG way about it, right?

Yes, I am using Raven's querying ability to sort of make it work, but I don't really like the solution.

There is no need to do a bidirectional association here with tag/posts, by the way, I outline how to do this in tomorrow's post.

I talk a lot about setting up the indexes based on your needs then.

And I don't want to talk about advantages based on this model, it is like saying that getting punched in the gut it better than being shot, I would rather avoid both

Ayende Rahien
04/20/2010 01:39 PM by
Ayende Rahien

Edin,

Of course you can, in fact, look at the future posts queue

Demis Bellot
04/20/2010 03:09 PM by
Demis Bellot

@Ayende I see you've articulated it being the 'wrong' way to try to replace a RDBMS solution but my conclusion I've drawn from the wording of your original post is that blog posts do not belong in a NoSQL database but are rather best left in an RDBMS, not that a more ideal solution was already written and in your future post queue (which I still don't understand how people can blog so much that they have a future post-queue, as once a month takes enough time out of me :).

it is like saying getting punched in the gut it better than being shot, I would rather avoid both

I assume by this that you have a more appropriate solution forthcoming, I would still be doing bi-directional referencing in NoSQL databases but am looking forward to your proposed solution as well.

Tyler Burd
04/20/2010 08:22 PM by
Tyler Burd

Ha. I mistook one sentence for "Oh, we may gain some small benefit by being shemales".

Can't wait to see the "right" way post.

Ayende Rahien
04/20/2010 08:37 PM by
Ayende Rahien

Tyler,

The right way certainly include no shemales.

Frank Quednau
04/20/2010 09:46 PM by
Frank Quednau

I prefer fembots to shemales, tbh.

Yes, that looked pretty ugle at times. Let's see what comes next.

Marco
04/29/2010 08:35 AM by
Marco

I know that thinking "relational" is not the best approach here, but I'm wondering how can you handle an N to N relation between 2 documents type which are indipendent (they are indipendent entity) but they can be linked, and when they are linked then someone can put one or more comments explaining why they are linked together...is it something that RavenDB can solve or I should stick with the rational DB?

Maybe a mix of the two could be the best in my opinion, model the documents with RavenDB but handle the relations using a "relation document" used to express the relation and which contain the comments...

It sounds like an interesting idea ;)

Ayende Rahien
04/29/2010 11:17 AM by
Ayende Rahien

Marco,

You are thinking about graph databases.

I will probably add some support for Raven to do so, but that is in the future.

Marco
04/29/2010 01:11 PM by
Marco

Opss, I didn't know about this kind of DB, shame on me.

They are exactly what I was looking for.

If you add this functionality to RavenDB then it will be just brilliant, looking forward to it.

Thanks a lot

Comments have been closed on this topic.