One of the more interesting problems with document databases is how you handle views. But a lot of people already had some issues with understanding what I mean with document database (hint, I am not talking about a word docs repository), so I have better explain what I mean by this.
A document database stores documents. Those aren’t what most people would consider as a document, however. It is not excel or word files. Rather, we are talking about storing data in a well known format, but with no schema. Consider the case of storing an XML document or a Json document. In both cases, we have a well known format, but there is not a required schema for those. That is, after all, one of the advantages of document db’s schema less nature.
However, trying to query on top of schema less data can be… problematic. Unless you are talking about lucene, which I would consider to be a document indexer rather than a document DB, although it can be used as such. Even with lucene, you have to specify the things that you are actually interested on to be able to search on them.
So, what are views? Views are a way to transform a document to some well known and well defined format. For example, let us say that I want to use my DB to store wiki information, I can do this easily enough by storing the document as a whole, but how do I lookup a page by its title? Trying to do this on the fly is a receipt for disastrous performance. In most document databases, the answer is to create a view. For RDMBS people, a DDB view is often called a materialized view in an RDMBS.
I thought about creating it like this:
Please note that this is only to demonstrate the concept, actually implementing the above syntax requires either on the fly rewrites or C# 4.0
The code above can scan through the relevant documents, and in a very clean fashion (I think), generate the values that we actually care about. Basically, we now have created a view called “pagesByTitleAndVersion”, index by title (ascending) and version (descending). We can now query this view for a particular value, and get it in a very quick manner.
Note that this means that updating views happen as part of a background process, so there is going to be some delay between updating the document and updating the view. That is BASE for you :-)
Another important thing is that this syntax is for projections only. Those are actually very simple to build. Well, simple is relative, there is going to be some very funky Linq stuff going on in there, but from my perspective, it is fairly straightforward. The part that is going to be much harder to deal with is aggregation. I am going to deal with that separately, however.
More posts in "Designing a document database" series:
- (17 Mar 2009) What next?
- (16 Mar 2009) Remote API & Public API
- (16 Mar 2009) Looking at views
- (15 Mar 2009) View syntax
- (14 Mar 2009) Aggregation Recalculating
- (13 Mar 2009) Aggregation
- (12 Mar 2009) Views
- (11 Mar 2009) Replication
- (11 Mar 2009) Attachments
- (10 Mar 2009) Authorization
- (10 Mar 2009) Concurrency
- (10 Mar 2009) Scale
- (10 Mar 2009) Storage