Ayende @ Rahien

Refunds available at head office

That No SQL Thing: Modeling Documents in a Document Database

So, after telling you the wrong way to go about it, I intend to show the right way to design the data model using a document database. So let’s be about it.

As a reminder, those are the scenarios that I want to deal with:

  • Main page: show list of blogs
  • Main page: show list of recent posts
  • Main page: show list of recent comments
  • Main page: show tag cloud for posts
  • Main page: show categories
  • Post page: show post and all comments
  • Post page: add comment to post
  • Tag page: show all posts for tag
  • Categories page: show all posts for category

And here is a sample data that we are going to work with (again, using C# because any other representation pretty much dictate the storage format).

var user = new User("ayende");
var blog = new Blog("Ayende @ Rahien", user) { Tags = {".NET", "Architecture", "Databases" } };
var categoryRaven = new Category("Raven");
var categoryNoSQL = new Category("NoSQL");
var post = new Post(blog, "RavenDB", "... content ...")  
{
    Categories  = { categoryRaven, categoryNoSQL },
    Tags = {"RavenDB", "Announcements" }
};
var comment = new Comment(post, "Great news");

PersistAll(user, blog, categoryRaven, categoryNoSQL, post, comment);

When approaching a document database model design, I like to think in aggregates. What entities in the model above are aggregates? User and Blog, obviously, and Post as well. Each of those has a right to exist in an independent manner. But nothing else has a right to exist on its own. Tags are obviously Value Objects, and the only usage I can see for categories is to use them for indexing the data. Comments, as well, aren’t really meaningful outside their post. Given that decision, I decided on the following format:

// users/ayende
{
   "type": "user",
   "name": "ayende"
}

// blogs/1
{
   "type": "blog",
    "users": ["users/ayende"],
    "name": "Ayende @ Rahien",
    "tags": [".NET", "Architecture", "Databases"]
}
// posts/1
{
    "blog": "blogs/1",
    "title": "RavenDB",
    "content": "... content ...",
    "categories": ["Raven", "NoSQL"]
    "tags" : ["RavenDB", "Announcements"],
    "comments": [
        { "content": "Great News" }
    ]
}

That gives me a much smaller model, and it make things like pulling a post and its comments (a very common operation) extremely cheap. Let us see how we can handle each scenario using this model…

Main page: show list of blogs

This is simple, using a commonly used index:

var blogs = docDb.Query<Blog>("DocumentsByType", "type:blog");

Main page: show list of recent posts

var posts = docDb.Query<Post>("PostsByTime", orderBy:"-posted_at");

This is using an index for posts by time:

from doc in docs
where doc.type == "post"
select new {doc.posted_at}

Main page: show list of recent comments

This is more interesting, because we don’t have the concept of comments as a separate thing. We handle this using an index that extract the values from the post document:

from doc in docs
where doc.type == "post"
from comment in doc.comments
select new {comment.posted_at, comment.content }

And the query itself:

var recentComments = docDb.Query<Comment>("CommentsByTime", orderBy:"-posted_at");

Main page: show tag cloud for posts

And now, we have another interesting challenge, we need to do aggregation on top of the tags. This is done using Map/Reduce operation. It sounds frightening, but it is pretty easy, just two linq queries:

from doc in docs
where doc.type == "post"
select tag in doc.tags
select new { tag, count = 1}

from result in results
group result by result.tag into g
select new {tag = g.Key, count = g.Sum(x=>x.count) }

And given that, we can now execute a simple query on top of that:

var tagCloud = docDb.Query<TagAndCount>("TagCloud");

Main page: show categories

This is pretty much the same thing as tag cloud, because we want to extract the categories from the posts, and get the number of posts in each category.

from doc in docs
where doc.type == "post"
select category in doc.categories
select new { category, count = 1}

from result in results
group result by result.category into g
select new {category= g.Key, count = g.Sum(x=>x.count) }

And given that, we can now execute a simple query on top of that:

var categories = docDb.Query<CategoryAndCount>("CategoriesCount");

Post page: show post and all comments

Here we see how easy it is to manage object graphs, all we need to do is pull the post document, and we get everything we need to render the post page:

var post = docDb.Get<Post>("posts/1");

Post page: add comment to post

Remember, a comment cannot live outside a post, so adding a comment involves getting the post, adding the comment and saving:

var post = docDb.Get<Post>("posts/1");
post.Comments.Add(new Comment(...));
docDb.SaveChanges();

Note that this uses the Unit of Work pattern, the Doc DB keep track of the post entity and will persist it to the database when SaveChanges() is called.

Tag page: show all posts for tag

Again, we have a two step approach, define the index, and then query on it. It is pretty simple:

from doc in docs
where doc.type == "post"
from tag in doc.tags
select new { tag }

And the query itself is:

var posts = docDb.Query<Post>("PostsByTag", "tag:Raven");

Categories page: show all posts for category

This is the same as all posts for tag, so I’ll skip it.

There are a few things to notice about this approach.

  • A good modeling technique is to think about Aggregates in the DDD sense, an Aggregate and all its associations are stored as a single document.
  • If we need to query on an entity encapsulated in an Aggregate, we can extract that information using an index. That extraction, however, doesn’t change the way that way that we access the data.
  • We can create “virtual entities”, such as Tags or Categories, which doesn’t really exists, but are generated by the indexes we define. Those are usually things that are only useful for users to search / organize the aggregates on.
  • Schema doesn’t really matter, because we either search by key, or on well known indexes, which massage the data to the format that we expect it to be.

Thoughts?

Comments

Ralf Westphal
04/21/2010 09:36 AM by
Ralf Westphal

This is a nice simple example! Although I don´t know where to issue/store the linq queries, I guess I understand the thinking underlying this approach. It´s elegant and sufficient.

Since you´re referring to DDD aggregates a couple of times, though, I´d like to ask: what´s the aggregate root of the whole thing?

You might say, if you retrieve a blog it´s the blog with all it´s nested postings. Or you might say it´s the user with her blogs and postings. Depending on the context.

But then: Why isn´t that reflected in the data/document model? If I look at that, comments are the root objects. They get injected a posting, which get injected a blog, which gets injected a user. Strange world ;-)

This looks like a relational data model, where a comment table would contain a foreign key with posting table record ids. No much object oriented, I´d say.

I´d expect a document oriented database to support my mental data model. That defines a user as having a collection of blogs having a collection of postings. And I´d like to model my data model classes according to that. All else seems pretty retro ;-)

-Ralf

Javi
04/21/2010 09:46 AM by
Javi

"Comments, as well, aren’t really meaningful outside their post."

For the same principle you might as well argue that Posts aren't meaninful outside a blog and then Posts would be Blog value objects.

Ayende Rahien
04/21/2010 09:47 AM by
Ayende Rahien

Ralf,

The linq queries are indexes inside Raven, actually :-)

There are three aggregates here:

  • Users

  • Blogs

  • Posts

There isn't an overall aggregate root.

What you describe, a User >> Blogs >> Posts >> Comments, is not a workable doc db model, because while it is possible to do so, it would make things like my blog a single document, with 4K+ posts and 27K+ comments.

That is hardly a good way of going about it.

The way you model things in a doc db is by defining aggregates based on predicted access patterns. Both for reads and writes.

Adding a post is a simple PUT, adding a comment is an update.

The association between Blog and Posts is maintained using an index, not inside the document, because otherwise that blog document would be a huge hotspot, and be huge as well :-)

That is why I choose several different scenarios and answered on how you can resolve them, because it showed how to solve a whole host of different problems using this model

Ayende Rahien
04/21/2010 09:49 AM by
Ayende Rahien

Javi,

Not really. In a Blog application, you are usually dealing with Posts.

You refer to them, manage them, etc. The Blog is more a grouping, a tag for convenience, than an actual player.

In contrast, you rarely manage comments outside their posts, they have no meaning outside it.

A post has meaning on its own (cross posting).

Ralf Westphal
04/21/2010 10:00 AM by
Ralf Westphal

@Ayende: I understand the reason why you´re doing this. But this does not become clear when looking at the object model.

new Blog(..., user) sure looks like an object holding a permanent reference to a another object. The loose coupling of the persistent data is not reflected in the runtime data.

So why not do it like this: new Blog(..., user.Id) ? This would make clear the nature of the relationship.

But even then: The relationships in the object model (loose or tight) seem odd. A comment references a posting? Hm...

Isn´t there any other way to consolidate the developers view of data (with its natural hierarchie User/Blog/Posting/comment) with good "document modelling"?

Obviously the developer has to trade off something. But inverted relationships seem to be a very high price to pay for "document thinking".

-Ralf

Kristof Claes
04/21/2010 10:01 AM by
Kristof Claes

Ayende, how do you store the linq queries as indexes in Raven? That's something that still isn't very clear to me. I should say that my experience with document databases is non-existent though :-)

For example, how do you "add" this linq query to Raven so that you can do "var tagCloud = docDb.Query <tagandcount("TagCloud");"?

from doc in docs

where doc.type == "post"

select tag in doc.tags

select new { tag, count = 1}

from result in results

group result by result.tag into g

select new {tag = g.Key, count = g.Sum(x=>x.count) }

Ayende Rahien
04/21/2010 10:17 AM by
Ayende Rahien

Ralf,

The loose coupling of the persistent data is not reflected in the runtime data.

I intentionally choose to show the data in object form, because I didn't want the way that I am showing the data to affect the way people think about it.

It seems that I the object code still shows some bias, but that was unintentional. Please treat the object form as solely a scratch, just a way to give you an idea about the data that we store.

And I am not sure that I am following what you mean by inverted relationship. I think that you are reading too much into the code that I put there. Look at the JSON docs, that is what I really mean in this post.

Ayende Rahien
04/21/2010 10:20 AM by
Ayende Rahien

Kristof,

In Raven, you create a new index by issuing a rest call.

In your example, it would look:

PUT /indexes/TagCloud

Content-Length: ???

{

"Map": "from doc in docs where doc.type == "post" select tag in doc.tags select new { tag, count = 1}",

"Reduce": "from result in results group result by result.tag into g select new { tag = g.Key, count = g.Sum( x=> x.count)}"

}

Raven will then compile and index this on the background, allowing you to make queries on it.

Edin
04/21/2010 10:28 AM by
Edin

-- What entities in the model above are entities?

Did you mean to say "What entities in the model above are aggregates?" or am I missing something?

I like the idea of writing map and reduce in LINQ. But is it possible to have simple indexes as well, like on a field? Also, how will the indexes affect the performance, especially when there are more writes then reads?

Anyway, do you know when Raven will be released?

Ayende Rahien
04/21/2010 10:33 AM by
Ayende Rahien

Edin,

Yes, you are correct, fixed.

Yes, you can define simple indexes, you can see an example of one in: PostsByTime

As for release date, soon, as in the next few weeks.

Ralf Westphal
04/21/2010 10:34 AM by
Ralf Westphal

@Ayende: Thx for the index sample.

As for the data model: Sure, we want to see how you think the documents should be "linked". For example that clearly shows post and blog to be entities/documents and comments to be values within an entity.

But again: If you want to sell the "world" document databases you need to address how the "world" is used to model runtime data. That´s why ORM sprang from RDBMS. And that´s why we need some more intuitive mapping from objects to documents.

The natural relationship between blog and posting is: a blog "contains" postings. It´s a 1:n relationship. That´s what I see when I look at your blog. I navigate to the blog and therein find postings. This hasn´t to do with RDBMS thinking. It´s just how the world is organized: buckets containing stuff which again are buckets etc. (Sure, there are other relationships, but for blog and posting 1:n seems to be quite natural.)

Now, if you assign a posting a blog, well, that´s inverting the natural relationship. Suddenly the posting links to the blog.

We all can deal with this kind of relationship. It´s (again) the RDBMS view of the world. But is that what you want? To me that´s thinking in technical terms. Low level of abstraction. The persistence platform shines through. That´s not very intuitive.

So the question remains: What price to pay in an object model to fit it to the document database world? Inverted relationships seem to high a price.

My perspective is a client perspective. I´m the customer of a document database. I want it to be of service to me. So I want a handy API along my way of thinking. That´s OO thinking. Not relational thinking, not document thinkig, but OO thinking.

I´m willing to pay a certain price to use document databases. So I´m not off to using db4o. But inverted relationship... No, Sir ;-) Don´t think that will entice many developers.

-Ralf

Demis Bellot
04/21/2010 10:40 AM by
Demis Bellot

This blog example is actually a really sweet and simple example showing a good way to how to go about modelling an application in a NoSQL database.

Unfortunately most NoSQL databases usually don't have built-in support for querying (RavenDB is unique in this regard).

They do however provide other constructs that let you achieve the same result without the need for querying.

For those of you that are interested in how you would achieve the same result in Redis, I will maintain along with Ayende's posts on this page (it's currently a work-in-progress):

code.google.com/.../DesigningNoSqlDatabase

Torkel
04/21/2010 10:57 AM by
Torkel

Great post, exactly what I was hoping for after the last post.

Will try to play with Raven some more. Could you expand on what your goals and ambitions are for Raven, what it's current state is etc.

Is there a dev mailing list?

Ayende Rahien
04/21/2010 11:10 AM by
Ayende Rahien

Ralf,

That really depends on the client, one of the things that I like was something like [Aggregate] attribute on certain classes, which the client can use to decide whatever to embed or create a "reference".

And I don't think that you can say that a blog "contains" a post. Not in the DDD terms, at least, which is what I am trying to use here.

You don't need a blog to refer to a post, after all.

When you start thinking about it this way, it makes a lot more sense.

In DocDBs, you have the following options for references:

  • Containment - Value Objects - can only be discussed in the context of the aggregate.

  • Reference - hold the key of another document, only for other aggregates.

If you go and look at how DDD models associations, you'll see that it is a natural fit.

No, a DocDB isn't an OODB. It isn't meant to be. And while the impedance mismatch between the two is far smaller than with RDBMS, there are still differences.

And yes, you want to be aware of those issues. Leaky abstractions and all of that. By making it easy to handle those scenarios, and by making them explicit, you gain a big advantage in performance and simplicity.

Edin
04/21/2010 11:12 AM by
Edin

I kind of got a feeling that DocDB requires a lot of planning a head. One needs to estimate or predict most common usages patterns to be able to design it effectively. I don't like that. Sure one needs to do the same with RDBMS but it's different. In RDBMS I'll use ORM, a have a separate object model and I'm kind of isolated from the database. In NoSQL, my code is tightly coupled to the design of the database (collections etc).

Does my reasoning make any sense?

Ayende Rahien
04/21/2010 11:14 AM by
Ayende Rahien

Torkel,

The dev mailing list is here: http://groups.google.com/group/ravendb/

My goal is to create a .NET document database that is suitable for high performance usage in production.

The reason for creating Raven was to get to grips with NoSQL idea at the beginning, but it was clear that there is a lack of a good Doc DB suitable to run on Windows and with good .NET integration

Ayende Rahien
04/21/2010 11:22 AM by
Ayende Rahien

Edin,

I strongly disagree.

I usually suggest doing some forward thinking anyway, because it can save a lot of effort down the road, but a Doc DB actually reduce the amount of planning that you need to start with.

You don't need to define a schema, or worry about migrations during development.

I am not sure why your code is more tightly coupled to the database because of collections than because of table names.

So no, I just don't see it

Carsten Hess
04/21/2010 11:27 AM by
Carsten Hess

Very interesting post on the NoSQL issue indeed. I would like to hear your thoughts on versioning of data already stored in a document DB. In RDBMS's you can often get away with scripts in order to accomodate for added functionality or schema changes - even with production data in it, but I dont see an easy way of doing this in the NoSQL world?

Ayende Rahien
04/21/2010 11:29 AM by
Ayende Rahien

Carsten,

Check tomorrow's post :-)

Edin
04/21/2010 11:34 AM by
Edin

Ayende,

What I meant to say is: in RDBMS I have mapping from objects to tables. My initial thinking was that in NoSQL I would need any mapping at all. I guess I was wrong...

Ayende Rahien
04/21/2010 11:36 AM by
Ayende Rahien

That really depend on how you structure things.

With Raven, we have .NET client API that works by you giving it an instance and it creates a document for you

Matt Freeman
04/21/2010 11:46 AM by
Matt Freeman

"Map": "from doc in docs where doc.type == "post" select tag in doc.tags select new { tag, count = 1}",

Whats the point of count=1 above would the below work the same? Or am I reading into example code too much.

"Reduce": "from result in results group result by result.tag into g select new { tag = g.Key, count = g.Sum( x=> 1)}"

What are your release plans for RavenDB? What license?

Are you intending to support 'dynamic' objects?

Ralf Westphal
04/21/2010 11:51 AM by
Ralf Westphal

@Ayende: Well, whether a blog contains posts or not might depend on your point of view (or use case). But as long as the existence of some entity depends on the existence of another entity I´d say you can model the relationship as a "contains relationship". (A user thus contains a blog; and a blog contains postings, its postings, which might resemble postings in another blog 100% (cross posting).)

We are on the same page with regard to blogs and posts being entities. But what you´re modelling are associations whereas I think of aggregation. But never mind...

The question remains: Who should know about the necessary inversion (comments pointing to postings)?

You might say: Nobody except the entity object code sees those details. If an entity hands out some kind of data model that can be linked in any way.

Ok. I can accept that. But that would require true DDD modelling. That´s not for the average developer who wants to throw some data on a doc db.

But if you think like that, why then not go a step further? Why link documents at all? Why put a reference to a posting into a comment? Wouldn´t it be even more flexible to externalize all relationships? With RDBMS you do that mostly just for n:m relationships. But why not be even more strict?

Currently you still ne to think about which relationships to put where. But with "external relationships" documents would not need to be changed just because some relationship changes.

But that´s just an idea I´m entertaining... Why think in terms of documents when you could think of data as "granules", small pieces from which ever new object can be formed. Ah, but that´s touching on philosophy... ;-)

Bottom line: With doc dbs we still ne to distinguish between:

  1. persistent data model

  2. representation of persistent data model in memory (within entities)

  3. data models/views of data in memory, but outside of entities

I guess, you have shown us 1. and 2. whereas I mistook your 2. for a 3.

-Ralf

Ayende Rahien
04/21/2010 12:02 PM by
Ayende Rahien

Matt,

Map & Reduce should have the same output.

You probably want to read this: ayende.com/.../...-ndash-a-visual-explanation.aspx

Raven will be released shortly, under an OSS license.

There will probably be commercial side as well.

Not sure that I understand what you mean with dynamic objects.

Ayende Rahien
04/21/2010 12:07 PM by
Ayende Rahien

Ralf,

Who should know about the necessary inversion (comments pointing to postings)?

Huh? Where do you see comments pointing to postings?

But that would require true DDD modelling. That´s not for the average developer who wants to throw some data on a doc db.

I am showing what I consider to be best practices in DocDB model design. If the avg developer disagree, they can just store documents in whatever form they like. It will still work.

And it doesn't really require a lot of thinking, all you need to do is consider "what do I want to be able to access independently".

Wouldn´t it be even more flexible to externalize all relationships?

Now you are talking about a graph database, that is a different beast all together.

Ralf Westphal
04/21/2010 12:26 PM by
Ralf Westphal

@Ayende: Graph databases or associative databases... in the end I don´t care. I´m just musing over how best to store data to not couple my code to it too tightly, and to stay flexible.

RDBMS, Doc DBs are means to an end, not an end in themselves. So what do we want to accomplish? Data persistence is just, well, a side effect ;-) While persisted data is "dead". The interesting stuff happens after data is loaded into memory/objects (or some kind of data structure) to process it.

Since in the OO world the in-mem data model is pretty much set, how should the data look while persisted? You could say: persist it as objects if you want to work with objects in memory. Well, maybe that´s not so good because once persisted data has a certain form that might make it difficult to change it. The same is true for documents, I guess.

But I don´t want to delve deeper into this here.

Back to the "pointing comments": in document form (text) comments don´t point to postings. But the comment object receives a post object. Since this is the same as with postings receiving a blog, this looks like comment objects pointing to postings.

You modeled the value object inclusion of comments into postings the same way as setting up relationships between entities. That´s kinda strange, I´d say.

-Ralf

Demis Bellot
04/21/2010 12:44 PM by
Demis Bellot
@Carsten
  
  
The beauty of NoSQL databases is that they are schema-less. So believe it or not when adding / removing fields there *is nothing to do*. My Redis Client also supports being able to change the type of most fields if it doesn't corrupt the data i.e. you can change a List
<int to a List
<double (not the other way around) string to an Enum, etc. The beauty of strings is that it allows any List
<t to be converted to a List
<string.
  
  
There will be rare occasions when you want to store 'the version' of the entity as it represents the state of the entity as created by a particular version of your software, in these cases I find the best way to handle it is to store an additional property called something like 'schema version'.
>
Demis Bellot
04/21/2010 12:51 PM by
Demis Bellot

@Ralf Westphal

The interesting stuff happens after data is loaded into memory/objects (or some kind of data structure) to process it.

Bang on! So effectively we are just talking about the ease of creating/persisting your in-memory domain models from your data store.

There are other qualities like maintaining consistency, redundancy and performance so not all data stores are created equally.

Personally in most cases I find schema-less designs make a better 'programmatic fit' to C# POCO objects than mapping from an ORM. But having an Integrated query language, visual and reporting tools means your data is 'more open' to you via an RDBMS.

Demis Bellot
04/21/2010 12:55 PM by
Demis Bellot

@Me

... if it doesn't corrupt the data i.e. you can change a List to a List

Crap HTML angle brackets corrupted my data!

I meant you can change from List[of int] to List[of double] without probs.

Dennis
04/21/2010 01:10 PM by
Dennis

I know that you update the indexes later in a post commit step.

But what if you actually require some count to be visible immediately after doing something?

Barry
04/21/2010 01:21 PM by
Barry

I noticed a reference to ravendb.exe. Are there plans to allow RavenDb to be used in a shared hosting environment

Aaron Carlson
04/21/2010 01:38 PM by
Aaron Carlson

Is there any concept inside a document DB to store the data in a tree structure. Similar to how a LDAP repository or file system stores data?

That way you could have Posts as child documents under a Blog Document which will give you the relationship but still allow you to load them independently.

Ayende Rahien
04/21/2010 02:56 PM by
Ayende Rahien

Ralf,

Data persistence is just, well, a side effect ;-)

Not really, the way the data is persisted is quite critical to the application.

If you don't access the data properly, you are going to hit a lot of issues.

But the comment object receives a post object.

As I said, that one is only to get you to understand what the source data is. It is a draft, it isn't real.

Ayende Rahien
04/21/2010 02:58 PM by
Ayende Rahien

Dennis,

You can block until the index return a non stale result.

But what we expect is that in most circumstances, the delay in updating the index would be small enough you wouldn't care about it

Ayende Rahien
04/21/2010 02:59 PM by
Ayende Rahien

Barry,

No. At least not right now.

Raven uses Esent, which requires Full Trust anyway.

Ayende Rahien
04/21/2010 02:59 PM by
Ayende Rahien

Aaron,

You are talking about Graph databases, that is something different than a document database

Barry
04/21/2010 03:18 PM by
Barry

There are a lot of shared hosting providers that offer Full Trust - I am using one of them. I guess it could be done if the RavenDB.exe was a ASP.NET Service? Looking forward to the launch event on the 18th and seeing some performance stats on this. Its good to see .NET getting some non-relational-db love.

Ralf Westphal
04/21/2010 04:09 PM by
Ralf Westphal

@Ayende:

Not really, the way the data is persisted is quite critical to the application.

If you don't access the data properly, you are going to hit a lot of issues.

This is true - but nevertheless as an application programmer I want to focus on processing data, not structuring it in a way to serve some non-functional purposes.

So, yes, there is always mapping needed. But still I´d say: "persistence is overrated" ;-) I want to focus on in-mem data.

But the comment object receives a post object.

As I said, that one is only to get you to understand what the source >data is. It is a draft, it isn't real.

So why not write the draft like this:

new Blog(..., new[] { new Comment(...), ... })

-Ralf

PS: Despite my "quibbling" here please be assured I highly appreciate your effort to implement a doc db on .NET. I like RavenDB for what it is pretty much.

Rafal
04/21/2010 05:51 PM by
Rafal

Ayende, have you done any real-world, production software based on a NoSQL document database? I'm asking because you are taking an expert's standpoint advicing on software architecture, therefore I'd like to know how successful was this architecture in your case.

Ralf Westphal
04/21/2010 07:40 PM by
Ralf Westphal

On the type field in the JSON objects: During serialization you introduce a new field in addition to an objects properties. "type" stores the name of the serialized type.

Why don´t you prefix it somehow to distinguish it from the object´s properties? What, if an object contains a property called type? I´d either use well known GUIDs for RavenDB field names or at least prefix them with something like "ranvendb", e.g. "ravendbtype".

Ayende Rahien
04/21/2010 07:53 PM by
Ayende Rahien

Barry,

Yes, an ASP.Net service is very possible, and should be very easy to do.

Ayende Rahien
04/21/2010 07:54 PM by
Ayende Rahien

Ralf,

new Blog(..., new[] { new Comment(...), ... })

There is no reason, both representations are valid.

Ayende Rahien
04/21/2010 07:55 PM by
Ayende Rahien

Rafal,

Yes, I built a relatively large system on top of a KeyValue store (Rhino DHT).

It worked very well, and it simplified what we needed to do drastically.

It also gave very low latency overall.

Ayende Rahien
04/21/2010 07:57 PM by
Ayende Rahien

Ralf,

I am using this to demonstrate things, not as the real implementation.

Raven supports additional metadata for a document, so the client API can do something like:

{

"@metadata": { "type": " Full .Net Type Name" },

// other content

}

Demis Bellot
04/21/2010 08:48 PM by
Demis Bellot

@Ayende:

I've noticed you've taking a rest-like scheme to storing entities e.g. 'users/ayende' and 'blogs/1', although semantically correct, I think you may run into unnecessary problems as:

  • the value on it's own 'blogs/1' is not distinguishable from a normal text value.

  • 'blogs' is not a direct correlation to its POCO type 'Blog' so you will need to maintain the link via some external configuration

  • by concatenating it with a string 'blogs/1' it requires effort to get the strongly-typed id of '1' which means being able to it automatically incremented is not possible.

My preference is to rely on convention, e.g. the 'User' model would still to be to store the strongly typed id in the model which would make it stricter and allows the NoSQL database to increment the 'id sequence' to guarantee uniqueness. In order to keep each entity unique within the same datastore I use a predictable urn as 'entity id' e.g:

urn:User:1 =>

{

Id = 1,

Name = "Ayende",

BlogIds = [1,2,3]

}

urn:Blog:1 =>

{

Id = 1,

UserId = 1,

}

In this way, no other configuration/manual intervention is required and I can get automatic sequencing. The URN's can now be used outside of the application to identify an entity in a universally identifiable in a standardized format. The Urn also contains the exact type of the entity it identifies which is useful when you want to generically handle a bag of mixed URNs generically.

You also have decided not to have the User dual-referencing it's blog posts. Which means you are forced to 'search' for your related entities through an external index. Personally I don't like this idea, searching is a very nice to have in a lot of places I just don't think it belongs in the construction of your domain model.

Ryan Smith
04/21/2010 09:33 PM by
Ryan Smith

I'm pretty comfortable with my intellect...but man this discussion makes me feel dumb.

It's not 'clicking' at all.

Ayende Rahien
04/21/2010 09:43 PM by
Ayende Rahien

Demis,

  • the value on it's own 'blogs/1' is not distinguishable from a normal text value.

I don't worry about it. That is something that the client can decide on, not something that the database enforce.

  • 'blogs' is not a direct correlation to its POCO type 'Blog' so you will need to maintain the link via some external configuration

That is actually intentional. You either already know that (because you are following a strongly typed POCO) or you need to get that document.

Having the type encoded as the relationship isn't useful.

  • by concatenating it with a string 'blogs/1' it requires effort to get the strongly-typed id of '1' which means being able to it automatically incremented is not possible.

Raven supports this, actually. You can ask it to generate a key with a given prefix, which gives us the nice keys for the documents.

Admittedly, I implemented that feature after I wrong the post, because I had the same objection :-)

searching is a very nice to have in a lot of places I just don't think it belongs in the construction of your domain model

That is pretty much the point, I don't see a direct association between users & blogs as a good thing. I want you to search on them

Ayende Rahien
04/21/2010 09:43 PM by
Ayende Rahien

Ryan,

Can you explain where you lost the thread of the conversation?

Demis Bellot
04/21/2010 10:14 PM by
Demis Bellot

@Aydende

That is actually intentional. You either already know that (because you are following a strongly typed POCO) or you need to get that document.

In my experience erring on predictable, convention-based solutions are more elegant, less error-prone, require less effort and allow you to provide richer API's that 'do more for free'.

Intention over convention means that you have a lot more intelligence about the data embedded in the application itself and the 'data speaks less about itself'.

I predict you will have a hard time providing a good generic GUI that lets you navigate the data store on its own (i.e. independently of any application logic).

Having the type encoded as the relationship isn't useful.

I'm not sure I would agree with that, relationship encoded urns give you a chance to provide a generic solution to aggregate, fetch and instantiate the entites without internal knowledge of the data values.

That is pretty much the point, I don't see a direct association between users & blogs as a good thing. I want you to search on them

It is a unique perspective on one hand I like it from a SOA perspective where one entity is the master reference and 'owns the relationship', OTOH the difference is that the relationship still exists in an external index, 'its just not visible'.

I also prefer a GetByIds() to a manually-crafted LINQ statement to achieve the same result (although LINQ in DHT datastore is cool :).

Ayende Rahien
04/21/2010 10:37 PM by
Ayende Rahien

Demis,

In my experience erring on predictable, convention-based solutions are more elegant, less error-prone, require less effort and allow you to provide richer API's that 'do more for free'.

I get your point, and I agree. This is something that the client needs to worry about, and I am concentrating more on the server side for now.

the difference is that the relationship still exists in an external index

The relationship exists in a once removed sort of way.

I can put users on one machine and blogs on the other, and I don't have to have distributed transactions.

I also prefer a GetByIds() to a manually-crafted LINQ statement

I am not sure that I am following that

The Linq statement is the "index definition"

Demis Bellot
04/21/2010 11:01 PM by
Demis Bellot

I can put users on one machine and blogs on the other, and I don't have to have distributed transactions.

Yeah that's the SOA-perspective part of the solution I liked.

If I had to cross a service boundary I would only keep one of the references which in this case will be on the Blog.

At the same time I would convert the ids to be urn's for any cross-service relationships.

I am not sure that I am following that The Linq statement is the "index definition"

Yeah I just meant the manual work in creating LINQ statement for being able to query on the relationship.

With my setup you could just skip the index definition and just do:

redisBlogs.GetByIds(user.BlogIds);

Tom Clarkson
04/22/2010 01:48 AM by
Tom Clarkson

Storing comments within the blog post is certainly simplest for reading, but introduces some risks on write - specifically what happens if two comments are added simultaneously.

A specific document database implementation may understand how to apply two add operations, but that's not necessarily true when applied to a generic document store.

Demis Bellot
04/22/2010 02:03 AM by
Demis Bellot

@Tom Clarkson

You can handle race collisions using 'Application level locks' (i.e. Redis's SETNX) where you only set if the value hasn't changed. Redis is going to introduce the notion of 'watched variables' within a transaction so if any of them has been updated since the transaction started the transaction is aborted.

Tom Clarkson
04/22/2010 02:34 AM by
Tom Clarkson

@Demis

Certainly it can be handled, though I prefer to avoid needing to handle locking if possible - If comment is a separate object, you don't have anything that needs to be modified by multiple users. Implementing in Redis I would keep the post itself as a single value (only editable by the author) and use a list (with atomic add operation) for the comments (either directly in the list or as seperate objects if they are to be editable).

Implementation details aside, my point was mostly about the generic applicability of the model and what belongs in a best practice example where the required capabilities of the underlying platform are not specified.

Demis Bellot
04/22/2010 02:57 AM by
Demis Bellot

@Tom Clarkson

I see, you're explaining the pro's and con's of each approach.

My preference would still be for comments to remain with the post, and personally judging by the quality of some of the comments on YouTube it wouldn't hurt to lose a few :)

Ayende Rahien
04/22/2010 07:04 AM by
Ayende Rahien

Tom,

That is another thing that you have to decide how to do, yes. And it affect how you design the documents.

In general, most DocDBs knows either how to do a compare & swap or supports locking, so that isn't really a big problem.

In Raven's case, you can ask it to perform an add to the comments collection, something which is safe to do concurrently.

Jon Dokulil
04/22/2010 02:30 PM by
Jon Dokulil

Like most, I'm coming from an RDBMS background. One thing that has been nice with using a relational DB for the app I'm currently working on is the ability to project the data in new ways, as new requirements/features come up.

Over a year into the project we added a feature that was a completely new view of the data. The SQL queries that are needed to generate that new view are pretty hairy (they really make me appreciate LINQ to SQL), but they work and seems to be no problem for SQL Server. It seems like, using a DocDB approach we would have to write migration code more often to re-model our data in a way that suits both old and new features. I'm struggling to come up with a good example, but I'm imagining a case where our old assumption that a piece of data only exists within the context of some 'owning' object but then we want to add a feature where that is no longer a true assumption.

Essentially, blogs are very well understood, your example is easy to follow. However, many projects revolve around domains that are not as well understood and have shifting requirements. A normalized RDB model seems to work pretty well for this. Do you think DocDBs are better suited for refactoring already well understood problems (perhaps to gain simplicity or for performance or scalability improvements), or would you feel comfortable going with a DocDB on a project where you expect shifting requirements down the road?

Doron
04/22/2010 05:23 PM by
Doron

Isn't it rather inefficient that adding a comment requires retrieving the entire Post entity including all the other comments?

Ayende Rahien
04/22/2010 07:07 PM by
Ayende Rahien

Doron,

It might be, at that point, you can use partial updates to add a comment without retrieving the full document.

Demis Bellot
04/22/2010 10:58 PM by
Demis Bellot

@Jon Dokulil

I understand you may be concerned about projection but personally I don't think this is a problem as there is no more complete LINQ support than LINQ 2 Objects. You can instantiate the entity with a single call then project it to a different POCO view using a combination of LINQ and a domain mapper tool like Auto Mapper to do all the heavy lifting for you:

http://automapper.codeplex.com/

I cache different intelligent views of the same data all the time where as soon as the master entity is modified all the dependent cached data views are invalidated.

@Doron

There are potentially many optimizations you can make, you can maintain a separate list of comments outside of the Post as Ayende has suggested. Though storing large blobs in a NoSQL datastore is not anywhere as inefficient as it is in an RDBMS as most of the time spent is CPU resources on the client de/serializing and network I/O throughput, this is in contrast with an RDBMS handling of large entities where most of the time is spent on the server processing and parsing the request extracting the data and updating internal data structures and indexes on disk. Most NoSQL datastores don't parse the payload at all (and some like Redis don't even need to touch disk), this is part of the reasons why NoSQL data stores are much quicker than RDBMS's for a lot of data access scenarios.

If performance was a problem and I had to optimize my preferred approach would still be to maintain the 'latest page of comments' on the post entity while storing the rest of the comments in an external list. This way you can satisfy the data requirements for the post page view with a single request.

Mike
04/26/2010 02:04 PM by
Mike

Thanks for writing all these posts, they have really helped me understand better.

Would you consider explaining a situation where the users are stored in a relational database?

Thanks!

Ayende Rahien
04/26/2010 02:14 PM by
Ayende Rahien

Mike,

I am not sure that I understand the question, can you expand?

Mike
04/27/2010 08:12 AM by
Mike

Sure, what I mean is more in general, how would you handle a situation where you need (or want) to store some information in a relational database. For example, user accounts.

Would you duplicate the user accounts in the documentdb? If not, how would you relate posts to users and preserve some kind of integrity.

Thanks!

Michael J. Ryan
05/05/2010 01:34 AM by
Michael J. Ryan

@Mike, @Medyum

It really depends, the point of the layout in Post, is that is probably the most typical usage scenario. If you look at how the "related posts" are structured, it shows that information that is most needed is copied over. If you don't really need the user data in your document database (DDB), then just save the user-id from the rdbms. There's nothing stopping you from persisting some data in an RDBMS, and other data in the DDB. In fact, depending on your reporting needs, it may behoove you to have a regular process that normalizes some of the DDB data and exports it into an SQL-RDBMS. It just depends on your needs, and your usage scenario.

I suggest the SQL-RDBMS export mainly for reporting needs, as the tooling available for reporting with SQL-RDBMS is so compelling compared to what is available to DDB, and the usage of SQL-RDBMS can support some of these types of scenarios better. There's nothing wrong with a hybrid approach, as long as it suits your needs, and is well defined/documented.

Matthew
05/18/2010 12:36 AM by
Matthew

I'm thinking that RavenDB isn't for me. But I want to be sure. Are documents always tied to class objects? What if I don't have a class associated with my data? My data comes from CSV files, which don't all have the same structure, so I can't construct a class for them (they have similar structure, but some times different fields.

I know that I can load the data into a datatable and put it into RavenDB from there. I've tried it, it works. However, I'm having trouble creating a query/index for the data that is contained in the datatable. Can I do this? I know that with MongoDB this is no problem. But I think Raven is a little different.

Ayende Rahien
05/18/2010 12:39 AM by
Ayende Rahien

Matthew,

No, there is no requirement for a class. You can handle this totally dynamically.

The .NET Client API comes in two forms, high level, which gives you an interface like NHibernate's session, entities, etc.

That level requires you to work with classes, yes.

But there is one layer down, where you can work directly against Json documents and do whatever your heart content to do with them

Matthew
05/18/2010 12:45 AM by
Matthew

Wonderful. Could you provide an example of creating document without an associated class?

Ayende Rahien
05/18/2010 12:52 AM by
Ayende Rahien

Matthew,

It goes something like this:

documentStore.Commands.Put("foo", null, JObject.Parse("{ a: 1}"), new JObject(), null);

Comments have been closed on this topic.