Ayende @ Rahien

filter by tags archive

architecture (612) rss
bugs (451) rss
challanges (123) rss
community (380) rss
databases (481) rss
design (895) rss
development (642) rss
hibernating-practices (71) rss
miscellaneous (592) rss
performance (397) rss
programming (1085) rss
raven (1450) rss
ravendb.net (534) rss
reviews (184) rss

2025
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

RavenDB Workshops - Deep dive into practical use of Document Data Modeling

May 06 2010

That No SQL ThingGraph databases

time to read 3 min | 459 words

Tweet Share Share 6 comments

Tags:

After a short break, let us continue the discussion. Think about a graph database as a document database, with a special type of documents, relations. A simple example is a social network:

There are four documents and three relations in this example. Relations in a graph database are more than just a pointer. A relation can be unidirectional or bidirectional, but more importantly, a relation is typed, I may be associated to you in several ways, you may be a client, family or my alter ego. And the relation itself can carry information. In the case of the relation document in the example above, we simply record the type of the association.

And that is about it, mostly. Once you figured out that graph database can be seen as document databases with a special document type, you are pretty much done.

Except that graph database has one additional quality that make them very useful. They allow you to perform graph operations. The most basic graph operation is traversal. For example, let us say that I want to know who of my friends is in town so I can go and have a drink. That is pretty easy to do, right? But what about indirect friends? Using a graph database, I can define the following query:

new GraphDatabaseQuery
{
   SourceNode = ayende,
   MaxDepth = 3,
   RelationsToFollow = new[]{"As Known As", "Family", "Friend", "Romantic", "Ex"},
   Where = node => node.Location == ayende.Location,
   SearchOrder = SearchOrder.BreadthFirst
}.Execute();

I can execute more complex queries, filtering on the relation properties, considering weights, etc.

Graph databases are commonly used to solve network problems. In fact, most social networking sites use some form of a graph database to do things like “You might know…”.

Because graph databases are intentionally design to make sure that graph traversal is cheap, they also provide other operations that tend to be very expensive without it. For example, Shortest Path between two nodes. That turn out to be frequently useful when you want to do things like: “Who can recommend me to this company’s CTO so they would hire me”.

May 04 2010

Document Databases are not Relational

time to read 4 min | 692 words

Tweet Share Share 56 comments

Tags:

I got several similar questions regarding my post about modeling data for document databases:

…how would you handle a situation where you need (or want) to store some information in a relational database. For example, user accounts.
Would you duplicate the user accounts in the document db? If not, how would you relate posts to users and preserve some kind of integrity.

The most typical error people make when trying to design the data model on top of a document database is to try to model it the same way you would on top of a relational database. A document database is a non relational data store, and trying to hammer a relational model on top of it will produce sub optimal results. But you can get fantastic results by taking advantage on the documented oriented nature of Raven.

Documents, unlike a row in a RDBMS, are not flat, you are not limited to just storing keys and value. Instead, you can store complex object graphs as a single document. That includes arrays, dictionaries and trees. What it means, in turn, is that unlike a relational database, where a row can only contain simple values and more complex data structures need to be stored as relations, you don't need to work hard to get your data into Raven.

Let us take the following page as an example:

In a relational database, we would have to touch no less than 4 tables to show the data in this single page (Posts, Comments, Tags, RelatedPosts).

Using a document database, we can store all the data that we need to work with as a single document with the following format:

This format allows to get everything that we need to display the page shown above in a single request.

Documents are expected to be meaningful on their own. You can certainly store references to other documents, but if you need to refer to another document to understand what the current document means, you are probably using the document database wrongly.

With document database, you are encourage to include in your documents all the information they need. Take a look at the post example above. In a relational database, we would have a link table for RelatedPosts, which would contain just the ids of the linked posts. If we would have wanted to get the titles of the related posts, we would need to join to the Posts table again. You can do that in document database, but that isn't the recommended approach, instead, as shown in the example above, you should include all the details that you need inside the document. Using this approach, you can display the page with just a single request, leading to much better overall performance.

Nitpicker corner: Yes, it does mean that you need to update related posts if you edit the title of a post.

Once we established this context, we can try answering the actual question.

Assuming that we store users in a relational database, the question now becomes, what would we gain by replicating the users information to a document database?

If we were using a relational database, that would have given us the ability to join against the users. But a document database doesn’t support joins. Moreover, if we consider the apparent aim of the question “maintain some integrity”, we can see that it doesn’t really matter where we store the users’ data. A document database doesn’t support things like referential integrity in the first place, so putting the users inside the document database gives you no benefit.

Now, you may want to be able to put the users in the document database anyway, to benefit from the features that it brings to the table, but integrity isn’t one of those reasons.

Apr 22 2010

That No SQL ThingDocument Database Migrations

time to read 6 min | 1073 words

Tweet Share Share 15 comments

Tags:

One of the things that I get ask frequently about No SQL is how to handle migrations. First, I want to clarify something very important:

Migrations sucks.

It doesn’t matter what you are doing, the moment that you need to be forward / backward compatible, or just update existing data to a new format, you are going to run into a lot of unpleasant situations. Chief of which is that what you thought was true about your data isn’t, and probably never was.

But leaving that aside, in the RDBMS world, we have three major steps when we migrate from one database version to another, non destructive update to the schema (add columns / tables), update the data (or move it to its new location), destructive schema update (delete columns / tables, introduce indexes, etc). With a document database, the situation is much the same, while we don’t have a mandatory schema, we do have the data in a certain shape.

I am going to use three examples, which hopefully demonstrate enough about migrations to give you the basic idea.

Renaming a column – Posts.Text should be renamed to Posts.Content
Update column value – Update tax rate in all orders after 2010-01-01
Change many-to-one association to many-to-many association – Posts.Category to Posts Categories

Renaming a column

Using SQL, we can do this using:

-- non destructive update
alter table Posts add Content nvarchar(max) null

-- update the data
update Posts
set Content = Text

-- destructive schema update
alter table Posts drop column Text
alter table Posts alter column Content nvarchar(max) not null

Using a Document Database, we run the following Linq query on the server (this is effectively an update):

from doc in docs
where doc.type == "post"
select new 
{
   type = doc.type,
   content = doc.text, // rename a column
   posted_at = doc.posted_at
}

Update column value

In SQL, it is pretty easy:

-- update the data
update Orders
set Tax = 0.16
where OrderDate > '2010-01-01'

And using an update Linq query:

from doc in docs
where doc.type == "order" && doc.order_date > new DateTime(2010,1,1)
select new
{
   tax = 0.16,
   order_date = doc.order_date,
   order_lines = doc.order_lines
}

Change many-to-one association to many-to-many association

Using SQL, it is pretty complex (especially when you add the FKs):

create table CategoriesPosts (PostId int, CategoryId int)

insert into CategoriesPosts (PostId, CategoryId)
select Id, CategoryId from Posts

alter table Posts drop column CategoryId

And using a doc db:

from doc in docs
where doc.type == "post"
select new 
{
   type = doc.type,
   content = doc.content,
   posted_at = doc.posted_at,
   categories = new [] { doc.category }
}

It is pretty much the same pattern over and over again :-)

Apr 21 2010

That No SQL ThingModeling Documents in a Document Database

time to read 11 min | 2200 words

Tweet Share Share 68 comments

Tags:

So, after telling you the wrong way to go about it, I intend to show the right way to design the data model using a document database. So let’s be about it.

As a reminder, those are the scenarios that I want to deal with:

Main page: show list of blogs
Main page: show list of recent posts
Main page: show list of recent comments
Main page: show tag cloud for posts
Main page: show categories
Post page: show post and all comments
Post page: add comment to post
Tag page: show all posts for tag
Categories page: show all posts for category

And here is a sample data that we are going to work with (again, using C# because any other representation pretty much dictate the storage format).

var user = new User("ayende");
var blog = new Blog("Ayende @ Rahien", user) { Tags = {".NET", "Architecture", "Databases" } };
var categoryRaven = new Category("Raven");
var categoryNoSQL = new Category("NoSQL");
var post = new Post(blog, "RavenDB", "... content ...")  
{
    Categories  = { categoryRaven, categoryNoSQL },
    Tags = {"RavenDB", "Announcements" }
};
var comment = new Comment(post, "Great news");

PersistAll(user, blog, categoryRaven, categoryNoSQL, post, comment);

When approaching a document database model design, I like to think in aggregates. What entities in the model above are aggregates? User and Blog, obviously, and Post as well. Each of those has a right to exist in an independent manner. But nothing else has a right to exist on its own. Tags are obviously Value Objects, and the only usage I can see for categories is to use them for indexing the data. Comments, as well, aren’t really meaningful outside their post. Given that decision, I decided on the following format:

// users/ayende
{
   "type": "user",
   "name": "ayende"
}

// blogs/1
{
   "type": "blog",
    "users": ["users/ayende"],
    "name": "Ayende @ Rahien",
    "tags": [".NET", "Architecture", "Databases"]
}
// posts/1
{
    "blog": "blogs/1",
    "title": "RavenDB",
    "content": "... content ...",
    "categories": ["Raven", "NoSQL"]
    "tags" : ["RavenDB", "Announcements"],
    "comments": [
        { "content": "Great News" }
    ]
}

That gives me a much smaller model, and it make things like pulling a post and its comments (a very common operation) extremely cheap. Let us see how we can handle each scenario using this model…

Main page: show list of blogs

This is simple, using a commonly used index:

var blogs = docDb.Query<Blog>("DocumentsByType", "type:blog");

Main page: show list of recent posts

var posts = docDb.Query<Post>("PostsByTime", orderBy:"-posted_at");

This is using an index for posts by time:

from doc in docs
where doc.type == "post"
select new {doc.posted_at}

Main page: show list of recent comments

This is more interesting, because we don’t have the concept of comments as a separate thing. We handle this using an index that extract the values from the post document:

from doc in docs
where doc.type == "post"
from comment in doc.comments
select new {comment.posted_at, comment.content }

And the query itself:

var recentComments = docDb.Query<Comment>("CommentsByTime", orderBy:"-posted_at");

Main page: show tag cloud for posts

And now, we have another interesting challenge, we need to do aggregation on top of the tags. This is done using Map/Reduce operation. It sounds frightening, but it is pretty easy, just two linq queries:

from doc in docs
where doc.type == "post"
select tag in doc.tags
select new { tag, count = 1}

from result in results
group result by result.tag into g
select new {tag = g.Key, count = g.Sum(x=>x.count) }

And given that, we can now execute a simple query on top of that:

var tagCloud = docDb.Query<TagAndCount>("TagCloud");

Main page: show categories

This is pretty much the same thing as tag cloud, because we want to extract the categories from the posts, and get the number of posts in each category.

from doc in docs
where doc.type == "post"
select category in doc.categories
select new { category, count = 1}

from result in results
group result by result.category into g
select new {category= g.Key, count = g.Sum(x=>x.count) }

And given that, we can now execute a simple query on top of that:

var categories = docDb.Query<CategoryAndCount>("CategoriesCount");

Post page: show post and all comments

Here we see how easy it is to manage object graphs, all we need to do is pull the post document, and we get everything we need to render the post page:

var post = docDb.Get<Post>("posts/1");

Post page: add comment to post

Remember, a comment cannot live outside a post, so adding a comment involves getting the post, adding the comment and saving:

var post = docDb.Get<Post>("posts/1");
post.Comments.Add(new Comment(...));
docDb.SaveChanges();

Note that this uses the Unit of Work pattern, the Doc DB keep track of the post entity and will persist it to the database when SaveChanges() is called.

Tag page: show all posts for tag

Again, we have a two step approach, define the index, and then query on it. It is pretty simple:

from doc in docs
where doc.type == "post"
from tag in doc.tags
select new { tag }

And the query itself is:

var posts = docDb.Query<Post>("PostsByTag", "tag:Raven");

Categories page: show all posts for category

This is the same as all posts for tag, so I’ll skip it.

There are a few things to notice about this approach.

A good modeling technique is to think about Aggregates in the DDD sense, an Aggregate and all its associations are stored as a single document.

If we need to query on an entity encapsulated in an Aggregate, we can extract that information using an index. That extraction, however, doesn’t change the way that way that we access the data.

We can create “virtual entities”, such as Tags or Categories, which doesn’t really exists, but are generated by the indexes we define. Those are usually things that are only useful for users to search / organize the aggregates on.

Schema doesn’t really matter, because we either search by key, or on well known indexes, which massage the data to the format that we expect it to be.

Thoughts?

Apr 20 2010

That No SQL ThingThe relational modeling anti pattern in document databases

time to read 15 min | 2852 words

Tweet Share Share 11 comments

Tags:

I am going to demonstrate the design of the data model in a document database for a typical blog application.

The following is my default sample data model, showing a very simple blog:

The absolutely wrong approach with a document database is to try to take the relational model and apply it on a document level. This is especially wrong because for a while, it might actually work. Let us say that we want to store the following:

var user = new User("ayende");
var blog = new Blog("Ayende @ Rahien", user) { Tags = {".NET", "Architecture", "Databases" } };
var categoryRaven = new Category("Raven");
var categoryNoSQL = new Category("NoSQL");
var post = new Post(blog, "RavenDB", "... content ...")  
{
    Categories  = { categoryRaven, categoryNoSQL },
    Tags = {"RavenDB", "Announcements" }
};
var comment = new Comment(post, "Great news");

PersistAll(user, blog, categoryRaven, categoryNoSQL, post, comment);

Interestingly enough, I need to use code to represent the data without tying it to a particular storage format.

The wrong approach to store the data would be to store each object as its own document, similar to the way we would store each object as its own row in a relational database. That wrong approach would look like this:

// users/ayende
{
   "type": "user",
   "name": "ayende"
}

// tags/1
{
   "name": ".NET"
}

// tags/2
{
   "name": "Architecture"
}

// tags/3
{
   "name": "Databases"
}
// tags/4
{
   "name": "RavenDB"
}
// tags/5
{
   "name": "Announcements"
}
// categories/1
{
    "name": "Raven"
}
// categories/2
{
    "name" : "NoSQL"
}
// blogs/1
{
   "type": "blog",
    "users": ["users/ayende"],
    "name": "Ayende @ Rahien",
    "tags": ["tags/1", "tags/2", "tags/3"]
}

// posts/1
{
    "blog": "blogs/1",
    "title": "RavenDB",
    "content": "... content ...",
    "categories": ["categories/1", "categories/2"]
    "tags" : ["tags/4", "tags/5"]
}

// comments/1
{
    "post": "posts/1",
    "content": "Great News"
}

I know that I am repeating myself here, but I have seen people miss the point before. Do NOT try to model a document database in this way.

See full size image The main reason that this is wrong is that a document database has no real support for doing joins, unions or any of the things that make such a model work effectively in a relational model.

Let us try to analyze the scenarios where we need this data, okay?

Main page: show list of blogs
Main page: show list of recent posts
Main page: show list of recent comments
Main page: show tag cloud for posts
Main page: show categories
Post page: show post and all comments
Post page: add comment to post
Tag page: show all posts for tag
Categories page: show all posts for category

I am going to analyze each of those scenarios using SQL (and the above model) and the current (and bad, smelly, nasty) document database model. I’ll have another post showing how to correctly model this in a document database, this post is about how not to do it.

Main page: show list of blogs

Using SQL, this is pretty easy:

select * from blogs

Using DocDB, this is easy, we are using a built-in index to query for documents by their type:

docDb.Query<Blog>("DocumentsByType", query:"type:blog");

Main page: show list of recent posts

Using SQL, this is pretty easy:

select * from posts order by PostedAt desc

Using DocDB, we need to define our own index function, to allow use to sort on it. That is painless, even if I say so myself:

from doc in docs
where doc.type == "post"
select new {doc.posted_at}

And now we can query this using:

docDb.Query<Post>("Posts", orderBy:"-posted_at");

Main page: show list of recent comments

This is exactly the same as recent posts, so I’ll skip it.

Main page: show tag cloud for posts

Here the SQL grows interesting:

select Name, COUNT(*) as TagCount from tags
where ItemType = 'Posts'
group by Name

And with the document database we need to write a map/reduce index (“Did you just told me to go @%$# myself?”)

from doc in docs
where doc.type == "posts"
from tag in doc.tags
select new { tag, count = 1 }

from result in results
group result by result.tag into g
select new { tag = g.Key, count = g.Sum(x=>x.count) }

And now that we have the index, we can get the values from it using:

var tagCloud = new TagCloud();
var tagIds = docDb.Query<TagAndCount>("TagsCloud", orderBy:"+count");
foreach(var tagId in tagIds)
{
    var tag = docDb.Get<Tag>(tagId.Tag);
    tagCloud.Add(tag.Name, tagId.Count);
}

See full size image Now this is ugly on many levels. First, we have the fairly complex index. Second, we have to merge the data ourselves at the client side. Third, we have to perform a SELECT N+1.

Yuck doesn’t being to cover it. There are actually ways to handle this more nicely, by making a multi get request, but I’ll not bother.

Main page: show categories

Exactly the same as show blogs, so I’ll skip it.

Post page: show post and all comments

Using stupid SQL:

select * from Posts where Id = 1

select * from Comments where PostId = 1

A more efficient method would be to use a join:

select * from Posts 
  join Comments 
    on Posts.Id = Comments.Id
where Posts.Id = 1

With the doc db, we can do:

var post = docDb.Get<Post>(1);
var comments = docDb.Query<Comment>("CommentsByPost", query:"post_id:1", orderBy:"+posted_at");

Which, of course, require us to define the comments by post index:

from doc in docs
where doc.type == "comment"
select new{doc.post_id, doc.posted_at}

Note that we have to make two calls here, because a document database has no notion of joins.

Post page: add comment to post

In SQL, it is a straightforward insert:

insert into comments (PostId, ... )
values(1, ...)

And with a document database, you can use:

docDb.Store(new Comment{ PostId = 1, ... });
docDb.SaveChanges();

Nothing much to look at here, using this flawed model.

See full size image Tag page: show all posts for tag

Using sql, that is slightly complex, because tags may be associated with blogs or with posts, so we need to do:

select * from Posts 
where Id in (
    select ItemId from tags
    where ItemType = 'Posts' and TagId = 1
)

Using a document database:

var posts = docDb.Query<Post>("PostsByTag", query:"tag:tags/1");

With the following index:

from doc in docs
where doc.type == "post"
from tag in doc.tags
select new {tag}

Categories page: show all posts for category

This is exactly like tags, so I’ll skip it.

As you’ve seen, by copying the relational model, we have created just the same sort of environment that we already had with RDBMS, but now we are faced with the problem that a document database can’t do things that a relational database can. In my eyes, what we have done is a net lose. Oh, we may gain some small benefit by being schemaless, but that isn’t really that beneficial in compared to the amount of effort that we have to go to by trying to be relational on a non relational database.

Apr 19 2010

That No SQL ThingDocument Databases – usages

time to read 2 min | 381 words

Tweet Share Share 19 comments

Tags:

I described what a document database is, but I haven’t touched about why you would want to use it.

The major benefit, of course, is that you are dealing with documents. There is little or no impedance mismatch between DTOs and documents. That means that storing data in the document database is usually significantly easier than when using an RDBMS for most non trivial scenarios.

It is usually quite painful to design a good physical data model for an RDBMS, because the way the data is laid out in the database and the way that we think about it in our application are drastically different. Moreover, RDBMS has this little thing called Schemas. And modifying a schema can be a painful thing indeed.

Sidebar: One of the most common problems that I find when reviewing a project is that the first step (or one of them) was to build the Entity Relations Diagram, thereby sinking a large time/effort commitment into it before the project really starts and real world usage tells us what we actually need.

The schemaless nature of a document database means that we don’t have to worry about the shape of the data we are using, we can just serialize things into and out of the database. It helps that the commonly used format (JSON) is both human readable and easily managed by tools.

A document database doesn’t support relations, which means that each document is independent. That makes it much easier to shard the database than it would be in a relational database, because we don’t need to either store all relations on the same shard or support distributed joins.

Finally, I like to think about document databases as a natural candidate for DDD and DDDish (DDD-like?) applications. When using a relational database, we are instructed to think in terms of Aggregates and always go through an aggregate. The problem with that is that it tends to produce very bad performance in many instances, as we need to traverse the aggregate associations, or specialized knowledge in each context. With a document database, aggregates are quite natural, and highly performant, they are just the same document, after all.

I’ll post more about this issue tomorrow.

Apr 11 2010

That No SQL Thing – Document Databases

time to read 6 min | 1151 words

Tweet Share Share 10 comments

Tags:

A document database is, at its core, a key/value store with one major exception. Instead of just storing any blob in it, a document db requires that the data will be store in a format that the database can understand. The format can be XML, JSON, Binary JSON (MongoDB), or just about anything, as long as the database can understand it.

Why is this such a big thing? Because when the database can understand the format of the data that you send it, it can now do server side operations on that data. In most doc dbs, that means that we can now allow queries on the document data.The known format also means that it is much easier to write tooling for the database, since it is possible to show, display and edit the data.

I am going to use Raven as the example for this post. Raven is a JSON based, REST enabled document database. Documents in Raven use the JSON format, and each document contains both the actual data and an additional metadata information that you can use. Here is an example of a document:

{
    "name": "ayende",
    "email": "ayende@ayende.com",
    "projects": [
        "rhino mocks",
        "nhibernate",
        "rhino service bus",
        "raven db",
        "rhino persistent hash table",
        "rhino distributed hash table",
        "rhino etl",
        "rhino security",
        "rampaging rhinos"
    ]
}

We can PUT this document in the database, under the key ‘ayende’. We can also GET the document back by using the key ‘ayende’.

A document database is schema free, that is, you don’t have to define your schema ahead of time and adhere to that. It also allow us to store arbitrarily complex data. If I want to store trees, or collections, or dictionaries, that is quite easy. In fact, it is so natural that you don’t really think about it.

It does not, however, support relations. Each document is standalone. It can refer to other documents by store their key, but there is nothing to enforce relational integrity.

The major benefit of using a document database comes from the fact that while it has all the benefits of a key/value store, you aren’t limited to just querying by key. By storing information in a form that the database can understand, we can ask the server to do things for us. Such as defining the following index for us:

from doc in docs
from prj in doc.projects
select new { Project = prj, Name = doc.Name }

With that in place, we can now make queries on the stored documents:

GET /indexes/ProjectAndName?query=Project:raven

This will give us all the documents that include a project with the name raven.

We can take it a bit further, though, and ask the database to store aggregations for us:

from doc in docs
from prj in doc.projects
select new { Count = 1, Project = prj}

from result in results
group result by result.Project into g
select new { Project = g.Key, Count = g.Sum(x=>x.Count) };

And now we can query for the number of users involved in the raven project:

GET /indexes/UsersCountByProject?query=Project:raven

In general, querying a document database falls into the prepared ahead of time (Raven, CouchDB) or at query time (MongoDB).

In the first case, you define an indexing function (in Raven’s case, a Linq query, in CouchDB case, a JavaScript function) and the server will run it to prepare the results, once the results are prepared, they can be served to the client with minimal computation. CouchDB and Raven differs in the method they use to update those indexes, Raven will update the index immediately on document change, and queries to indexes will never wait. The query may return a stale result (and is explicitly marked as such), but it will return immediately. With CouchDB, a view is updated at view query time, which may lead to a long wait time on the first time a view is accessed if there were a lot of changes in the meanwhile. The CouchDB approach avoids doing extra works that may not be needed, but Raven’s approach means that you don’t have to handle potentially large delays at queyring times. CouchDB allow to pass a parameter at the view querying time to allow it to return stale results, which seems to be the best of all worlds.

Note that in both CouchDB and Raven’s cases, indexes do not affect write speed, since in both cases this is done at a background task.

MongoDB, on the other hand, allows ad-hoc querying, and relies on indexes defined on the document values to help it achieve reasonable performance when the data size grows large enough. MongoDB’s indexes behave in much the same way RDBMS indexes behave, that is, they are updated as part or the insert process, so large number of indexes is going to affect write performance.

Apr 10 2010

reNoSQL, meh

time to read 4 min | 604 words

Tweet Share Share 11 comments

Tags:

I was pointed to this blog post, it is written by the guy who wrote ZODB (a python object database) who isn’t excited about NoSQL:

But for me there was a moment of pure joy one morning when an absolutely awesome colleague I worked with at the time said to me something like: "There's a problem with this invoice, I traced it down to this table in the database which has errors in these columns. I've written a SQL statement to correct, or should it be done at the model level?". Not only had he found and analyzed the problem, he was offering to fix it.

Praise the gods. To do similar in Plone, he would have had to learn Python, read up on all the classes. Write a script to get those objects from the ZODB and examine them. Not a small undertaking by any means.

What was going on

The tools for the ZODB just weren't there and it wasn't transparent enough. If there was a good object browser for the ZODB (and yes a few simple projects showed up that tried to do that) that did all the work and exposed things that would really help. But setting up and configuring such a tool is hard and I never saw one that allowed you to do large scale changes.

I also got the following comment, on an unrelated post:

Not directly related, but I'm curious why you don't use Rhino Queues? The tooling with msmq?

Tooling are incredibly important! In fact, it is the tooling that can make or break a system. Rhino Queues is a really nice queuing system, it offer several benefits over MSMQ, but it has no tooling support, and as such, it has a huge disadvantage in comparison to MSMQ (or other Queuing technologies).

With databases, this is even more important. I can (usually) live without having direct access to queued messages, but for databases and data stores, having the ability to access the data in an easy manner is mandatory. Some data is unimportant (I couldn’t care less what the values are in user #12094’s session are), but for the most part, you really want the ability to read that data and look at it in a human readable fashion.

The problems that Andy run into with ZODB are related more to the fact that ZODB didn’t have any good tooling and that the ZODB storage format was tied directly to Python’s pickle.

With Raven, I flat out refuse to release it publically until we have a good tooling story. Raven comes with its own internal UI (accessible via any web browser), which allows you to define indexes, view/create/edit the documents, browse the database, etc.

I consider such abilities crucial, because without those, a database is basically a black hole, which requires special tooling to work with. By providing such tooling out of the box, you reduce the barrier to entry by a huge margin.

This image was sent to me by a lot of people:

This is funny, and true. There is another layer here, you don’t query a key/value store, that is like asking how to feed a car, because you are used to riding horses. If you need to perform queries on a key/value store, you are using the wrong tool, or perhaps you are trying to solve the problem in a non idiomatic way.

Apr 08 2010

That No SQL Thing – Key / Value stores – Usages

time to read 2 min | 276 words

Tweet Share Share 12 comments

Tags:

Now that we have a fairly good understanding of what Key / Value stores are, what are they good for? Well, they are really good if you need to lookup things by key :-)

Storing the user’s session is a common example, or the user’s shopping cart. In fact, per user’s data is probably the most common usage for a key value store. The reasons for that are fairly obvious, I would imagine, we have the user’s id already, and we need a well known data. That make it very cheap to access, modify and store at the end of a request.

Other usages include storing already processed data for quick retrieval, consider things like: Favorite Products, Best Of …, etc. The idea here is to have the application access the key / value store using well known keys for those data items, while background processes will update those periodically. It isn’t quite a cache, because there isn’t really a scenario where the value doesn’t exists, but it is a near thing.

One interesting use case that I run into was to store the product catalog in the key/value store, that allowed pulling a particular product’s details in a single call. The key/value store was also used to track how many units were physically in stock, so it was able to tell if the item was available for purchase or not.

Just remember, if you need to do things more complex than just access a bucket of bits using a key, you probably need to look at something else, and the logical next step in the chain in the Document Database.

Apr 07 2010

That No SQL Thing – Key / Value stores – Operations

time to read 2 min | 338 words

Tweet Share Share 1 comments

Tags:

I mentioned before that the traditional API for Key / Value stores is pretty simplistic, Get/Put/Delete over blob values. That doesn’t really give us much options,

I have a long series of posts talking about how to do a whole host of operations using this minimal API.

In practice, it is usually much better to have support for such things directly in the Key / Value store implementation. Rhino DHT has intrinsic support for lists, and Redis support plenty of operations over lists, sets and ordered sets. Most Key/Value stores offer at least the option to atomically increment a value.

For the most part, however, you can assume that if you need a lot of operations support from the Key/Value store, you are probably using the wrong backend for your tasks. Key/Value stores are just for that, if you need complex operations on the values stored there, you might be using the wrong tool.

Oren Eini

Oren Eini

CEO of RavenDB

That No SQL ThingGraph databases

Document Databases are not Relational

That No SQL ThingDocument Database Migrations

That No SQL ThingModeling Documents in a Document Database

That No SQL ThingThe relational modeling anti pattern in document databases

That No SQL ThingDocument Databases – usages

That No SQL Thing – Document Databases

reNoSQL, meh

What was going on

That No SQL Thing – Key / Value stores – Usages

That No SQL Thing – Key / Value stores – Operations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication