Ayende @ Rahien

Unnatural acts on source code

Designing a document database: Scale

In my previous post, I asked about designing a document DB, and brought up the issue of scale, along with a set of questions that effect the design of the system:

  • Do we start from the get go as a distributed DB?

Yes and no. I think that we should start from the get go assuming that a database is not alone, but we shouldn’t burden it with the costs that are associated with this. I think that simply building replication should be a pretty good task, which mean that we can push more smarts regarding the distribution into the client library. Simpler server side code usually means goodness, so I think we should go with that.

  • Do we allow relations?
    • Joins?
    • Who resolves them?

Joins are usually not used in a document DB. They are very useful, however. The problem is how do we resolve them, and by whom. This is especially true when we consider that a joined document may reside on a completely different server. I think that I am going to stick closely to the actual convention in other document databases, that is, joins are not supported. There is another idea that I am toying with, the notion of document attributes, which may be used to record this, but that is another aspect all together. See the discussion about attachments for more details.

  • Do we assume data may reside on several nodes?

Yes and no. The database only care about data that is stored locally, while it may reference data on other nodes, we don’t care about that.

  • Do we allow partial updates to a document?

That is a tricky question. The initial answer is yes, I want this feature. The complete answer is that while I want this feature, I am not sure how I can implement this.

Basically, this is desirable since we can use this to reduce the amount of data we send over the network. The problem is that we run into an interesting issue of how to express that partial update. My current thinking is that we can apply a diff to the initial Json version vs. the updated Json version, and send that. That is problematic since there is no standard way of actually diffing Json. We can just throw it into a string and compare that, of course, but that expose us to json format differences that may cause problems.

I think that I am going to put this issue as: postphoned.

Comments

Yitzchok
03/10/2009 05:12 AM by
Yitzchok

I know that for now you are only supporting JSON but you might also support compressed JSON to save some space and then you will really get into problems with partial updates.

Thomas Eyde
03/10/2009 01:58 PM by
Thomas Eyde

Can't you use a partial json document for partial edits? Just pass along the fields that have changed. Missing fields are not updated.

josh
03/10/2009 02:39 PM by
josh

I'm sure you're already thinking this and probably say so in a more recent post, but you can of course use another table to index the documents as is talked about in the FriendFeed article you posted a couple days ago. That index table(s) doesn't have to reside on the same node as the document table; just give you a way to find the ID of the document you are looking for.

Ayende Rahien
03/10/2009 03:38 PM by
Ayende Rahien

Thomas

How do you know that a removed field is not really removed?

configurator
03/10/2009 05:10 PM by
configurator

json has a simple enough syntax to diff, doesn't it? You can either diff it per-field or normalize the string's whitespace to easily string-diff it. Where's the hard part with diffing json?

Thomas Eyde
03/10/2009 11:39 PM by
Thomas Eyde

I don't know why a field should be removed. If that is part of your spec, I guess I would use a special json document just for that.

My thought was that json, and javascript, is dynamic. So we could take advantage of that feature to build up documents which only have changed fields.

Ayende Rahien
03/11/2009 03:03 AM by
Ayende Rahien

configurator,

There is no standard way of diffing json

Ayende Rahien
03/11/2009 03:22 AM by
Ayende Rahien

A field can be removed for many reasons.

For instance, because its value is missing.

If I am planning to support generic Json docs, then it is pretty obvious that this is a requirement.

configurator
03/11/2009 04:32 AM by
configurator

"How do you know that a removed field is not really removed?"

The format for partial updates should send a json change object and a list of fields that should be removed.

"There is no standard way of diffing json"

I understand that - I was just saying that it should be relatively trivial to implement as far as I understand json (and I don't, really).

Ayende Rahien
03/11/2009 05:35 AM by
Ayende Rahien

configurator,

Yes, it is pretty easy to create a diff format to json.

That is not the problem, I don't want to write one.

Comments have been closed on this topic.