Designing a document databaseViews
One of the more interesting problems with document databases is how you handle views. But a lot of people already had some issues with understanding what I mean with document database (hint, I am not talking about a word docs repository), so I have better explain what I mean by this.
A document database stores documents. Those aren’t what most people would consider as a document, however. It is not excel or word files. Rather, we are talking about storing data in a well known format, but with no schema. Consider the case of storing an XML document or a Json document. In both cases, we have a well known format, but there is not a required schema for those. That is, after all, one of the advantages of document db’s schema less nature.
However, trying to query on top of schema less data can be… problematic. Unless you are talking about lucene, which I would consider to be a document indexer rather than a document DB, although it can be used as such. Even with lucene, you have to specify the things that you are actually interested on to be able to search on them.
So, what are views? Views are a way to transform a document to some well known and well defined format. For example, let us say that I want to use my DB to store wiki information, I can do this easily enough by storing the document as a whole, but how do I lookup a page by its title? Trying to do this on the fly is a receipt for disastrous performance. In most document databases, the answer is to create a view. For RDMBS people, a DDB view is often called a materialized view in an RDMBS.
I thought about creating it like this:
Please note that this is only to demonstrate the concept, actually implementing the above syntax requires either on the fly rewrites or C# 4.0
The code above can scan through the relevant documents, and in a very clean fashion (I think), generate the values that we actually care about. Basically, we now have created a view called “pagesByTitleAndVersion”, index by title (ascending) and version (descending). We can now query this view for a particular value, and get it in a very quick manner.
Note that this means that updating views happen as part of a background process, so there is going to be some delay between updating the document and updating the view. That is BASE for you :-)
Another important thing is that this syntax is for projections only. Those are actually very simple to build. Well, simple is relative, there is going to be some very funky Linq stuff going on in there, but from my perspective, it is fairly straightforward. The part that is going to be much harder to deal with is aggregation. I am going to deal with that separately, however.
More posts in "Designing a document database" series:
- (17 Mar 2009) What next?
- (16 Mar 2009) Remote API & Public API
- (16 Mar 2009) Looking at views
- (15 Mar 2009) View syntax
- (14 Mar 2009) Aggregation Recalculating
- (13 Mar 2009) Aggregation
- (12 Mar 2009) Views
- (11 Mar 2009) Replication
- (11 Mar 2009) Attachments
- (10 Mar 2009) Authorization
- (10 Mar 2009) Concurrency
- (10 Mar 2009) Scale
- (10 Mar 2009) Storage
Ok, now I see I want it. Are you going to release it open-sourcely?
And BTW, aren't views a good candidate to be based on RDBMS? It would save you implementation of joins, sorting, query capabilities... performance shouldn't be a problem since they could be optimized for querying.
one question, however, why would you have to wait until C# 4.0?
what are they doing to the CLR or to the code base that would allow for this to happen?
i guess the bigger question really would be... why would you have to wait for anything to be able to execute that kind of code?
wouldn't the view be "updated" anytime that a document was persisted? and if so wouldn't the information needed to update the view already be in the context that you are working in?
or is the issue that the document db doesn't know how to update the view
i am starting to loose my mind
Fascinating series... Thank you for sharing. It's not clear to me (from this post) where the updating of the view indexes are happening?
@Rafal, i think you'd have to be concerned about changes to a document's title in the doc db if that view was stored in an RDBMS.
meisinger: Perhaps dynamic is necessary, because the documents are untyped and you are accessing their properties (Type, Title, Version)?
strong type sucks
when it is built, probably
views aren't a good idea on top of RDBMS. the view generation tend to take a lot of time in many scenarios. Remember that the data itself is schema less, so a lot of the RDBBMS advantages are just not there.
Wait for it...
You don't see where updating the views happens because I haven't discussed it yet, wait for it...
Ayende, for C# 3 wouldn't a Dictionaryish syntax be best
doc["Type"], doc["Title"], doc["Version"]
The disadvantage with this syntax is that it is not strongly typed - but neither is the dynamic syntax.
Another option is to use some sort of duck typing for your docs
For that query, all docs must support a certain interface (with Type, Title and Version as properties). I.e. it is possible to create an interface for your DDB where it is queried as such:
The interface as a type parameter causes a duck class to be generated by the DDB, mapping the given properties into their matching indexer.
First, I like this series of posts!
This that you are trying to accomplish reminds me a lot of Lotus Notes which doesn't bring back good memories. ;-)
This is a fascinating series of articles Oren, thankyou.
I have a similar requirement to document database views coming up in my current work. We need view support for a system built on top of db4o to support various reporting scenarios and things like incremental search in a performant way. Lucene is my toy of choice for things like this, and we are allready using it for complex queries. It will be great to see how you tackle your query language / criteria for this.
And tuples can't come soon enough for me.
This is an interesting series, and now I finally grasp what its all about :)
I have been reading about azure storage tables and one of the ideas is partitions -- queries over a single partition are fast, queries over multiple partitions are slow. The problem is that a partition that optimizes a query over one property might suck for a query over a different property.
A solution would be to replicate the the data (or reference to) with different partitions optimized per property you want to query on. I have no idea if this is optimal (hoping some good good use practices for azure storage are shown at MIX). The problem is if you want to add a property or new query or even insert new data these multiple partitions can be out of sync easily.
Seeing this post about views finally makes it all click. This is very interesting stuff with regards to scaling.
But that syntax is just ugly
I am actually looking for someone to sponsor the development of this :-)