Designing a document databaseScale
In my previous post, I asked about designing a document DB, and brought up the issue of scale, along with a set of questions that effect the design of the system:
- Do we start from the get go as a distributed DB?
Yes and no. I think that we should start from the get go assuming that a database is not alone, but we shouldn’t burden it with the costs that are associated with this. I think that simply building replication should be a pretty good task, which mean that we can push more smarts regarding the distribution into the client library. Simpler server side code usually means goodness, so I think we should go with that.
- Do we allow relations?
- Joins?
- Who resolves them?
Joins are usually not used in a document DB. They are very useful, however. The problem is how do we resolve them, and by whom. This is especially true when we consider that a joined document may reside on a completely different server. I think that I am going to stick closely to the actual convention in other document databases, that is, joins are not supported. There is another idea that I am toying with, the notion of document attributes, which may be used to record this, but that is another aspect all together. See the discussion about attachments for more details.
- Do we assume data may reside on several nodes?
Yes and no. The database only care about data that is stored locally, while it may reference data on other nodes, we don’t care about that.
- Do we allow partial updates to a document?
That is a tricky question. The initial answer is yes, I want this feature. The complete answer is that while I want this feature, I am not sure how I can implement this.
Basically, this is desirable since we can use this to reduce the amount of data we send over the network. The problem is that we run into an interesting issue of how to express that partial update. My current thinking is that we can apply a diff to the initial Json version vs. the updated Json version, and send that. That is problematic since there is no standard way of actually diffing Json. We can just throw it into a string and compare that, of course, but that expose us to json format differences that may cause problems.
I think that I am going to put this issue as: postphoned.
More posts in "Designing a document database" series:
- (17 Mar 2009) What next?
- (16 Mar 2009) Remote API & Public API
- (16 Mar 2009) Looking at views
- (15 Mar 2009) View syntax
- (14 Mar 2009) Aggregation Recalculating
- (13 Mar 2009) Aggregation
- (12 Mar 2009) Views
- (11 Mar 2009) Replication
- (11 Mar 2009) Attachments
- (10 Mar 2009) Authorization
- (10 Mar 2009) Concurrency
- (10 Mar 2009) Scale
- (10 Mar 2009) Storage
Comments
I know that for now you are only supporting JSON but you might also support compressed JSON to save some space and then you will really get into problems with partial updates.
Can't you use a partial json document for partial edits? Just pass along the fields that have changed. Missing fields are not updated.
I'm sure you're already thinking this and probably say so in a more recent post, but you can of course use another table to index the documents as is talked about in the FriendFeed article you posted a couple days ago. That index table(s) doesn't have to reside on the same node as the document table; just give you a way to find the ID of the document you are looking for.
Thomas
How do you know that a removed field is not really removed?
json has a simple enough syntax to diff, doesn't it? You can either diff it per-field or normalize the string's whitespace to easily string-diff it. Where's the hard part with diffing json?
I don't know why a field should be removed. If that is part of your spec, I guess I would use a special json document just for that.
My thought was that json, and javascript, is dynamic. So we could take advantage of that feature to build up documents which only have changed fields.
configurator,
There is no standard way of diffing json
A field can be removed for many reasons.
For instance, because its value is missing.
If I am planning to support generic Json docs, then it is pretty obvious that this is a requirement.
"How do you know that a removed field is not really removed?"
The format for partial updates should send a json change object and a list of fields that should be removed.
"There is no standard way of diffing json"
I understand that - I was just saying that it should be relatively trivial to implement as far as I understand json (and I don't, really).
configurator,
Yes, it is pretty easy to create a diff format to json.
That is not the problem, I don't want to write one.
Comment preview