Rhino Divan DB reboot idea
Divan DB is my pet database. I created it to scratch an itch [Nitpickers: please note this!], to see if I can create Couch DB like system in .NET. You can read all about it in the following series of posts.
It stalled for a while, mostly because I run into the hard problems (building map/reduce views). But I think that I actually have a better idea, instead of trying to build something that would just mimic Couch DB, a .NET based Document DB is actually a very achievable goal.
The way it would work is actually pretty simple, the server would accept Json-formatted documents, like those:
[ { "id": 153, "type": "book", "name": "Storm from the Shadows", "authors": [ "David Weber" ], "categories": [ "SciFi", "Awesome", "You gotta read it" ], "avg_stars": 4.5, "reviews": [13,5423,423,123,512] }, { "id": 1337, "type": "book", "name": "DSLs in Boo", "authors": [ "Ayende Rahien", "Oren Eini" ], "categories": [ "DSL", ".NET", "You REALLY gotta read it" ], "avg_stars": 7, "reviews": [843,214,451] } ]
Querying could be done either by id, or using a query on an index. Indexes can be defined using the following syntax:
var booksByTitle = from book in docs where book.type == "book" select new { book.title };
The fun part here is that this index would be translated into a Lucene index, which means that you could query the index using a query:
Query(“booksByTitle”, “title:Boo”) –> documents that match this query.
As well as apply any & all the usual Lucene tricks.
You don’t get Map/Reduce using this method, but the amount of complexity you have is quite low, and the implementation should take only several days to build.
Thoughts?
Comments
Don't think we missed your little ID number for your own book - I didn't miss it last time either, just didn't have time to comment on it. ;)
It was and still is a very good idea to implement a lightweight document db in .Net. Especially if it could be used both embedded and standalone and supported System.Transactions
On the other hand, there are several JSON document databases like MongoDB mentioned above or the Persevere project, both offering very rich functionalities, so you must think what will be the niche for DivanDB and what features will make it a superior tool in that niche.
Those pointing out MongoDB and other NoSQL alternatives need to keep in mind that the .Net connectors for these products are still in there infancy. I know that Craig and Samus are working on a MongoDB connector but its not quite there yet, in my opinion. I'm not trying to down play the effort in anyway though. I just don't think I could take that to my company and say we should build our apps using it. Yet!
So the idea of anyone taking some time to put together an alternative that integrates well from the start and, in the case of Oren, supports their style of work. All while releasing the code and discussing the design choices is a good thing. Its both a learning experience for all of us and could lead to some new insight to the problem that benefits us all.
I'm all for it.
Famous last words. :-)
But hey, if anybody could pull it off, it's you, Oren.
Hi
Purely from a learning perspective I've been hacking up a documentdb in .net, inspired by mongodb. I've used esent as the backing store thanks to Oren for the idea.
If anyone would be interested in helping out or just looking over the codebase, you can find it below(very hacky in places still).
http://github.com/AndyStewart/docsharp
It's very early about a week old and only currently works in embedded mode, but thought it would add to the conversation.
Andy
Judah,
That post was written about a week ago, two days ago and today I completed the engine impl.
It is working
Awesome, nice work! So when can we play with it? :-)
How about instead of a Document, we build an Expression-based database :) that uses Linq internally to query the data?
You can do it... of all people :). I've mused on what it would take but... it's over my head. Your head is massive.
Can't wait to have a go at it - so is this built using Lucene.NET as a class library, i.e. I could use it in a medium trust environment even?
I guess I don't quite understand the difference between creating a JSON document DB versus just using XML or YAML. Yeah, XML isn't as readable and YAML isn't as wide spread - but unless you're only ever going to parse the result with JavaScript - what's the advantage in storing the data as JSON?
Rob,
Can you explain what you mean by expression based?
Matt,
This is using Esent, so probably not.
JSON can be read anywhere, including on the browser, cheap to create and read, easy to understand, easy to manipulate.
Compared the cost of reading XML doc vs. JSON doc, you'll see the issues.
But it is also much more readable and ideally suited for object graph serializations
@MattMc3 I don't think the point is to STORE the data in JSON, but to interact with it in JSON. Essentially the data store would be any level of native to object representation of said data. The real advantage is having a light interaction to the database query/storage mechanism. Not to mention that queries can naturally translate via JSON based calls to/from the database layer.
@Ayende Are you planning on using IronJS + DLR for a query engine at all? I'm just curious here, as it could allow a similar interactive ability to what is offered by MongoDB. I really like what I've seen in MongoDB, and would love to see the client interactions natively available in Jaxer (spidermonkey based) and node.js (V8 based).
Oren,
Darn (re: Esent) - oh well, still can't wait to see how it turns out. Question: would it be possible to do what you're looking to do with just Lucene.NET indexes?
@Oren what I mean is that (and I'm arm-waving... again) the storage model here is JSON, which is groovy, but I was thinking it would be interesting to somehow store the data using an Expression Tree. The storage would be MemberAccess I spose (not sure how to best leverage it... just musing) but instead of working up JSON calls, you could use LINQ directly.
Does this make sense?
@RobConnery
Isnt all your asking for is a way to query the documents you store through linq? ie
using (var docDb = new DocDb())
{
var doc = new Document <company();
doc.Data = new Company { Name= "My Company"};
docDb.Store(doc);
var docsFound = docDb.Query <company().Where(q=>q.Name=="My Company");
}
This is exactly the syntax I'm working on for my docdb, the docs contents are still stored in json though. They're just queried through linq accessed via strongly typed.
Cheers
Andy
I've had similar ideas about using Lucene as a KV store. Hit a bit of a roadblock when I realised that you have to re-open IndexReaders to pick up changes made by IndexWriters...
It looks like IndexReaders take a snapshot of the index when they open, which could be a massive scalability issue for a line-of-business transaction processing app.
I has a PoC workng very quickly, but on realising the above, I switched by backing store to Berkeley DB.
I might be wrong about Lucene's behaviour though.
Adny
Tracker,
Right now, no. It actually shouldn't be hard at all.
Right now I am using Linq based expressions to do this, and it is fairly easy to work.
Matt,
What I actually need is just a way to store blobs by key in an ACID manner.
It isn't hard to write, but ESENT gives it to me for free, and is very easy to use.
It is separated into a distinct place in the app, so you can replace it if you really want to.
It might be interesting to do a BDB storage implementation.
And no, Lucene doesn't offer TX guarantees, so you need to handle this differently.
You have wave enough to make this impossible. I don't think that what you want is even desirable.
Can you show some code here to explain that?
Andrew,
The problem is that your method requires O(N) approach and running in the same address space.
Andy,
You are correct about Lucene's needing to open the index readers, that is actually a big plus, because that means that readers don't have to wait for writers, and I don't get the scalability issue.
Lucene is highly scalable.
The redis project ( http://code.google.com/p/redis/) is a document db where the documents have types (string, list, set, ordered set ) allowing you to do interesting operations at the server ( http://code.google.com/p/redis/wiki/CommandReference). Sharing this because if I were writing a document db I'd want to see it.
Dan,
I know of that.
My thought about this:
ayende.com/.../...alue-programming-model-to-a.aspx
Redis let's you store and retrieve blobs atomically, it also lets you store and retrieve lists, sets and ordered sets of blobs fast. I've developed an open source generic Redis client that can store and retrieve the entire Northwind database (3202 records) in less than 1.2 seconds (on my 3yo imac) here:
code.google.com/.../ServiceStackRedis
I've opted for a smaller, faster serialization format, that is over 5x faster JSON (it's effectively JSON with the quotes and whitespace removed) as the JSON serializers in .NET we're having a noticeable impact on performance:
Benchmarks:
www.servicestack.net/.../...-times.2010-02-06.html
Serialization format:
code.google.com/p/servicestack/wiki/TypeSerializer
Though I have to say adding searching capabilities in Lucene is quite an interesting idea. Are you doing real-time searches with Lucene? i.e. are you adding it to the Lucene index as soon as you've added it to Rhino DB? My old problem with Lucene was it didn't used to handle updates very well (and they've recommended that you build a new index instead) but it now looks like the new version does.
Demis,
I am doing background indexing, but we are talking about 25 ms update time from insert to index update
FYI, Lucene 2.9+ supports "near realtime" updates. You can get an IndexReader from an IndexWriter that will return documents that haven't been flushed yet.
You can also take a look at Zoie ( http://code.google.com/p/zoie/) which does something similar.
Very cool project to be doing this. I think there is definitely room for more diversity in the document database world.
I'd ask you to at least consider maintaining superficial compatibility with CouchDB (eg use _id instead of id, _rev if you have MVCC) consider reusing our JavaScript view engine (can be imported without any Erlang code at all, etc.)
There is still a lot of room to add some new value beyond CouchDB. We consider CouchDB to be a protocol even more than a database. I've started a project to port it to Ruby here: http://github.com/jchris/booth This would be a nice place to steal the map reduce code from.
I'd be really curious to see how far you can deviate from CouchDB (implementation, API, targeted use cases) and still maintain, for instance, the ability to replicate with CouchDB, reuse CouchDB design documents, etc.
What do you think of that challenge?
Chris,
I am actually thinking about having each document be composed of data & metadata, the id & rev would go in the metadata.
That would allow easier extensibility for things like adding security filters using user's code. I believe I got the idea from couch, but I am not sure.
Replication is something that I don't intend to deal with for the v1.0 release.
We have something very similar to design docs, basically a set of docs that uses Linq to define an index, which allow very efficient indexing.
Note that we don't even use the term view, mostly because it isn't a view like in couch, but rather just a way to define interesting indexes. We do allow filtering on the indexes, though, and some rather interesting document flattening.
Thanks for the pointer about booth, I'll look into that once the current crunch is over. I certainly find ruby easier to grok then erlang.
Ayende,
It sounds like you are on the right track.
One really cool benefit you have if you stay close enough to the CouchDB API is that Futon should "just work". If your API differs enough that Futon doesn't work out of the box, you could of course build an HTTP facade to mimic CouchDB.
There is, for instance a Futon 4 Mongo project that does this. Pretty cool if you ask me: http://github.com/sbellity/futon4mongo If they got cross replication with CouchDB to work, then it'd be not just cool, but seriously useful.
Lets say I want to store and query over dates. How would I do that?
Range query, probably, you need to be more specific
Are floats, strings and lists the only data types supported in the JSON representation? If I need a higher-level data type, such as Date, how does the index get generated? Or do I have to represent it as iso-8601 and index that?
Comment preview