Ayende @ Rahien

It's a girl

Rhino Divan DB reboot idea

Divan DB is my pet database. I created it to scratch an itch [Nitpickers: please note this!], to see if I can create Couch DB like system in .NET. You can read all about it in the following series of posts.

It stalled for a while, mostly because I run into the hard problems (building map/reduce views). But I think that I actually have a better idea, instead of trying to build something that would just mimic Couch DB, a .NET based Document DB is actually a very achievable goal.

The way it would work is actually pretty simple, the server would accept Json-formatted documents, like those:

[
    {
        "id": 153,
        "type": "book",
        "name": "Storm from the Shadows",
        "authors": [
            "David Weber"
        ],
        "categories": [
            "SciFi",
            "Awesome",
            "You gotta read it"
        ],
        "avg_stars": 4.5,
        "reviews": [13,5423,423,123,512]
    },
    {
        "id": 1337,
        "type": "book",
        "name": "DSLs in Boo",
        "authors": [
            "Ayende Rahien",
            "Oren Eini"
        ],
        "categories": [
            "DSL",
            ".NET",
            "You REALLY gotta read it"
        ],
        "avg_stars": 7,
        "reviews": [843,214,451]
    }
]

Querying could be done either by id, or using a query on an index. Indexes can be defined using the following syntax:

var booksByTitle = 
   from book in docs
   where book.type == "book"
   select new { book.title };

The fun part here is that this index would be translated into a Lucene index, which means that you could query the index using a query:

Query(“booksByTitle”, “title:Boo”) –> documents that match this query.

As well as apply any & all the usual Lucene tricks.

You don’t get Map/Reduce using this method, but the amount of complexity you have is quite low, and the implementation should take only several days to build.

Thoughts?

Tags:

Posted By: Ayende Rahien

Published at

Originally posted at

Comments

Kyle Szklenski
02/25/2010 01:21 PM by
Kyle Szklenski

Don't think we missed your little ID number for your own book - I didn't miss it last time either, just didn't have time to comment on it. ;)

Rafal
02/25/2010 01:35 PM by
Rafal

It was and still is a very good idea to implement a lightweight document db in .Net. Especially if it could be used both embedded and standalone and supported System.Transactions

On the other hand, there are several JSON document databases like MongoDB mentioned above or the Persevere project, both offering very rich functionalities, so you must think what will be the niche for DivanDB and what features will make it a superior tool in that niche.

iLude
02/25/2010 02:42 PM by
iLude

Those pointing out MongoDB and other NoSQL alternatives need to keep in mind that the .Net connectors for these products are still in there infancy. I know that Craig and Samus are working on a MongoDB connector but its not quite there yet, in my opinion. I'm not trying to down play the effort in anyway though. I just don't think I could take that to my company and say we should build our apps using it. Yet!

So the idea of anyone taking some time to put together an alternative that integrates well from the start and, in the case of Oren, supports their style of work. All while releasing the code and discussing the design choices is a good thing. Its both a learning experience for all of us and could lead to some new insight to the problem that benefits us all.

Judah Himango
02/25/2010 03:41 PM by
Judah Himango

and the implementation should take only several days to build.

Famous last words. :-)

But hey, if anybody could pull it off, it's you, Oren.

Andrew Stewart
02/25/2010 03:44 PM by
Andrew Stewart

Hi

Purely from a learning perspective I've been hacking up a documentdb in .net, inspired by mongodb. I've used esent as the backing store thanks to Oren for the idea.

If anyone would be interested in helping out or just looking over the codebase, you can find it below(very hacky in places still).

http://github.com/AndyStewart/docsharp

It's very early about a week old and only currently works in embedded mode, but thought it would add to the conversation.

Andy

Ayende Rahien
02/25/2010 03:48 PM by
Ayende Rahien

Judah,

That post was written about a week ago, two days ago and today I completed the engine impl.

It is working

Judah Himango
02/25/2010 05:26 PM by
Judah Himango

Awesome, nice work! So when can we play with it? :-)

Rob Conery
02/25/2010 06:27 PM by
Rob Conery

How about instead of a Document, we build an Expression-based database :) that uses Linq internally to query the data?

You can do it... of all people :). I've mused on what it would take but... it's over my head. Your head is massive.

Matt
02/25/2010 07:00 PM by
Matt

Can't wait to have a go at it - so is this built using Lucene.NET as a class library, i.e. I could use it in a medium trust environment even?

MattMc3
02/25/2010 07:30 PM by
MattMc3

I guess I don't quite understand the difference between creating a JSON document DB versus just using XML or YAML. Yeah, XML isn't as readable and YAML isn't as wide spread - but unless you're only ever going to parse the result with JavaScript - what's the advantage in storing the data as JSON?

Ayende Rahien
02/25/2010 07:51 PM by
Ayende Rahien

Rob,

Can you explain what you mean by expression based?

Ayende Rahien
02/25/2010 07:51 PM by
Ayende Rahien

Matt,

This is using Esent, so probably not.

Ayende Rahien
02/25/2010 07:53 PM by
Ayende Rahien

JSON can be read anywhere, including on the browser, cheap to create and read, easy to understand, easy to manipulate.

Compared the cost of reading XML doc vs. JSON doc, you'll see the issues.

But it is also much more readable and ideally suited for object graph serializations

Tracker1
02/25/2010 08:10 PM by
Tracker1

@MattMc3 I don't think the point is to STORE the data in JSON, but to interact with it in JSON. Essentially the data store would be any level of native to object representation of said data. The real advantage is having a light interaction to the database query/storage mechanism. Not to mention that queries can naturally translate via JSON based calls to/from the database layer.

@Ayende Are you planning on using IronJS + DLR for a query engine at all? I'm just curious here, as it could allow a similar interactive ability to what is offered by MongoDB. I really like what I've seen in MongoDB, and would love to see the client interactions natively available in Jaxer (spidermonkey based) and node.js (V8 based).

Matt
02/25/2010 08:33 PM by
Matt

Oren,

Darn (re: Esent) - oh well, still can't wait to see how it turns out. Question: would it be possible to do what you're looking to do with just Lucene.NET indexes?

Rob Conery
02/25/2010 08:38 PM by
Rob Conery

@Oren what I mean is that (and I'm arm-waving... again) the storage model here is JSON, which is groovy, but I was thinking it would be interesting to somehow store the data using an Expression Tree. The storage would be MemberAccess I spose (not sure how to best leverage it... just musing) but instead of working up JSON calls, you could use LINQ directly.

Does this make sense?

Andrew Stewart
02/25/2010 08:52 PM by
Andrew Stewart

@RobConnery

Isnt all your asking for is a way to query the documents you store through linq? ie

using (var docDb = new DocDb())

{

var doc = new Document <company();

doc.Data = new Company { Name= "My Company"};

docDb.Store(doc);

var docsFound = docDb.Query <company().Where(q=>q.Name=="My Company");

}

This is exactly the syntax I'm working on for my docdb, the docs contents are still stored in json though. They're just queried through linq accessed via strongly typed.

Cheers

Andy

Andy Hitchman
02/25/2010 09:26 PM by
Andy Hitchman

I've had similar ideas about using Lucene as a KV store. Hit a bit of a roadblock when I realised that you have to re-open IndexReaders to pick up changes made by IndexWriters...

It looks like IndexReaders take a snapshot of the index when they open, which could be a massive scalability issue for a line-of-business transaction processing app.

I has a PoC workng very quickly, but on realising the above, I switched by backing store to Berkeley DB.

I might be wrong about Lucene's behaviour though.

Adny

Ayende Rahien
02/25/2010 11:53 PM by
Ayende Rahien

Tracker,

Right now, no. It actually shouldn't be hard at all.

Right now I am using Linq based expressions to do this, and it is fairly easy to work.

Ayende Rahien
02/25/2010 11:54 PM by
Ayende Rahien

Matt,

What I actually need is just a way to store blobs by key in an ACID manner.

It isn't hard to write, but ESENT gives it to me for free, and is very easy to use.

It is separated into a distinct place in the app, so you can replace it if you really want to.

It might be interesting to do a BDB storage implementation.

And no, Lucene doesn't offer TX guarantees, so you need to handle this differently.

Ayende Rahien
02/25/2010 11:55 PM by
Ayende Rahien

You have wave enough to make this impossible. I don't think that what you want is even desirable.

Can you show some code here to explain that?

Ayende Rahien
02/25/2010 11:56 PM by
Ayende Rahien

Andrew,

The problem is that your method requires O(N) approach and running in the same address space.

Ayende Rahien
02/25/2010 11:56 PM by
Ayende Rahien

Andy,

You are correct about Lucene's needing to open the index readers, that is actually a big plus, because that means that readers don't have to wait for writers, and I don't get the scalability issue.

Lucene is highly scalable.

Dan Finch
02/26/2010 12:53 AM by
Dan Finch

The redis project ( http://code.google.com/p/redis/) is a document db where the documents have types (string, list, set, ordered set ) allowing you to do interesting operations at the server ( http://code.google.com/p/redis/wiki/CommandReference). Sharing this because if I were writing a document db I'd want to see it.

Demis Bellot
02/26/2010 10:33 AM by
Demis Bellot

Redis let's you store and retrieve blobs atomically, it also lets you store and retrieve lists, sets and ordered sets of blobs fast. I've developed an open source generic Redis client that can store and retrieve the entire Northwind database (3202 records) in less than 1.2 seconds (on my 3yo imac) here:

code.google.com/.../ServiceStackRedis

I've opted for a smaller, faster serialization format, that is over 5x faster JSON (it's effectively JSON with the quotes and whitespace removed) as the JSON serializers in .NET we're having a noticeable impact on performance:

Benchmarks:

www.servicestack.net/.../...-times.2010-02-06.html

Serialization format:

code.google.com/p/servicestack/wiki/TypeSerializer

Though I have to say adding searching capabilities in Lucene is quite an interesting idea. Are you doing real-time searches with Lucene? i.e. are you adding it to the Lucene index as soon as you've added it to Rhino DB? My old problem with Lucene was it didn't used to handle updates very well (and they've recommended that you build a new index instead) but it now looks like the new version does.

Ayende Rahien
02/26/2010 11:25 AM by
Ayende Rahien

Demis,

I am doing background indexing, but we are talking about 25 ms update time from insert to index update

Chris Smith
02/26/2010 02:26 PM by
Chris Smith

FYI, Lucene 2.9+ supports "near realtime" updates. You can get an IndexReader from an IndexWriter that will return documents that haven't been flushed yet.

You can also take a look at Zoie ( http://code.google.com/p/zoie/) which does something similar.

J Chris A
02/28/2010 03:45 AM by
J Chris A

Very cool project to be doing this. I think there is definitely room for more diversity in the document database world.

I'd ask you to at least consider maintaining superficial compatibility with CouchDB (eg use _id instead of id, _rev if you have MVCC) consider reusing our JavaScript view engine (can be imported without any Erlang code at all, etc.)

There is still a lot of room to add some new value beyond CouchDB. We consider CouchDB to be a protocol even more than a database. I've started a project to port it to Ruby here: http://github.com/jchris/booth This would be a nice place to steal the map reduce code from.

I'd be really curious to see how far you can deviate from CouchDB (implementation, API, targeted use cases) and still maintain, for instance, the ability to replicate with CouchDB, reuse CouchDB design documents, etc.

What do you think of that challenge?

Ayende Rahien
02/28/2010 03:59 AM by
Ayende Rahien

Chris,

I am actually thinking about having each document be composed of data & metadata, the id & rev would go in the metadata.

That would allow easier extensibility for things like adding security filters using user's code. I believe I got the idea from couch, but I am not sure.

Replication is something that I don't intend to deal with for the v1.0 release.

We have something very similar to design docs, basically a set of docs that uses Linq to define an index, which allow very efficient indexing.

Note that we don't even use the term view, mostly because it isn't a view like in couch, but rather just a way to define interesting indexes. We do allow filtering on the indexes, though, and some rather interesting document flattening.

Thanks for the pointer about booth, I'll look into that once the current crunch is over. I certainly find ruby easier to grok then erlang.

J Chris A
02/28/2010 04:18 AM by
J Chris A

Ayende,

It sounds like you are on the right track.

One really cool benefit you have if you stay close enough to the CouchDB API is that Futon should "just work". If your API differs enough that Futon doesn't work out of the box, you could of course build an HTTP facade to mimic CouchDB.

There is, for instance a Futon 4 Mongo project that does this. Pretty cool if you ask me: http://github.com/sbellity/futon4mongo If they got cross replication with CouchDB to work, then it'd be not just cool, but seriously useful.

RichB
03/03/2010 04:57 PM by
RichB

Lets say I want to store and query over dates. How would I do that?

Ayende Rahien
03/03/2010 05:08 PM by
Ayende Rahien

Range query, probably, you need to be more specific

RichB
03/03/2010 06:34 PM by
RichB

Are floats, strings and lists the only data types supported in the JSON representation? If I need a higher-level data type, such as Date, how does the index get generated? Or do I have to represent it as iso-8601 and index that?

Comments have been closed on this topic.