Designing a document database

time to read 4 min | 674 words

A while ago I started experimenting with building my own document DB, based on the concepts that Couch DB have. As it turn out, there isn’t really much to it, at a conceptual level. A document DB requires the following features:

Store a document
Retrieve document by id
Add attachment to document
Replicate to a backup server
Create views on top of documents

The first two requirements are easily handled, and should generally take less than a day to develop. Indeed, after learning about the Esent database, it took me very little time to create this. I should mention that as an interesting limitation to the DB, I made the decision to accept only documents in Json format. That makes some things very simple, specifically views and partial updates.

There are several topics here that are worth discussion, because they represent non trivial issues. I am going to raise them here as questions, and answer them in future posts.

Storage:

How do we physically store things?
How do we do backups?
How do we handle corrupted state?
Where do we store the views?

Should we store them in the same file as the actual data?

Are we transactional?
Do we allow multi document operation to be transactional?

Scale:

Do we start from the get go as a distributed DB?
Do we allow relations?

Joins?
Who resolves them?

Do we assume data may reside on several nodes?
Do we allow partial updates to a document?

Concurrency:

What concurrency alternatives do we choose?
What about versioning?

Attachments:

Do we allow them at all?
How are they stored?

In the DB?
Outside the DB?

Are they replicated?
Should we even care about them at all? Can we apply SoC and say that this is the task of some other part of the system?

Replication:

How often should we replicate?

As part of the transaction?
Backend process?
Every X amount of time?
Manual?

Should we replicate only the documents?

What about attachments?
What about the generated view data?

Should we replicate to all machines?

To specified set of machines for all documents?
Should we use some sharding algorithm?

Views:

How do we define views?
How do we define the conversion process from a document to a view item?
Does views have fixed schema?
How often do we update views?
How do we remove view items from the DB when the origin document has been removed?

There are some very interesting challenges relating to doing the views. Again, I am interested in your opinions about this.

There are several other posts, detailing my current design, which will be posted spaced about a day apart from one another. I’ll post a summary post with all the relevant feedback as well.

Tweet Share Share 13 comments

Tags:

Databases

Comments

09 Mar 2009
01:53 AM

It's called Sharepoint :)

09 Mar 2009
07:11 AM

Uriel Katz

a really simple design will be to build that db on top of a persistent DHT so you get replication and scalability right away,or you could build it on top of some DFS.

for the views you can make some incremental background process that updated views from documents,or shift the cost of updating views to the queries i think it is called cracking databases i think.

09 Mar 2009
08:49 AM

Rafal

Very nice, but for a starter I'd like to see some background:

what is a document
how does it differ from an attachment
what is a view
how does document lifecycle look like
what will be the intended use of the software you are designing

09 Mar 2009
08:57 AM

Marco Dissel

And what about:

the meta-data on top of documents?
security
versioning
searching (full-text and on meta-data)
archiving (document lifecycle)

09 Mar 2009
10:07 AM

Ayende Rahien

Uriel,

The problem with layer the DB on top of the DHT is that you need to maintain additional data that the actual DHT doesn't let you.

Lists of documents to be processed, for example.

Rafal,

Take a look at CouchDB, that is the source of much of the design.

09 Mar 2009
10:08 AM

Ayende Rahien

Marco,

meta data - I am addressing that

security - haven't thought about that, will need to address this

versioning - I am addressing that

searching - I am addressing that

archiving - can you explain more?

09 Mar 2009
12:11 PM

Marco Dissel

-- archiving - can you explain more?

Old documents that are never touched anymore can be moved to an archive database. Most DMS systems supports some kind of document lifecycle workflow / records management (for example document with type x (some metadata field) should be saved 10 years, after that it should be destroyed)

(see http://en.wikipedia.org/wiki/Records_management)

09 Mar 2009
12:16 PM

Marco Dissel

Another one:

working offline

. local (partial) backup and commit changes/new items to the server

09 Mar 2009
14:15 PM

Ayende Rahien

Marco,

I don't think that I am going to handle archiving. You can run a view and do it as a separate process, outside of the DB.

Working offline, that seems like a special case of replication.

In this case, we replicate to the master, and we have to do it in an async manner

09 Mar 2009
16:15 PM

Szymon

Looking forward to your next posts on this topic. Right now I'm evaluating the same but from perspective of offline client. We are building distributed clients that need to pull some data (mostly read-only) and image attachments. I was thinking about using the System.IO.Packaging format to store this as documents on the client (like Office docs). Do you think it's related? If so please consider it in your series as well.

09 Mar 2009
16:45 PM

Peter

Looking forward to these posts, Ayende. Good outline.

09 Mar 2009
18:25 PM

Steve Campbell

My team and I developed a "document repository" late last year. We had different constraints, but some of the challenges for us were:

delivering documents to the web when they are stored in a secure repository inside our firewall
deciding sql blobs vs filesystem (we chose filesystem, even though we could have done a hybrid, i.e. SQL 2008)
workflow, i.e. allowing different people to see docs at different places in a workflow

Other challenges which I doubt you will have, related to being able to deal with groups of related docs:

determining duplicates (what makes a document an update vs a new doc)
versioning of groups of documents vs versioning individual docs

10 Mar 2009
18:18 PM

Charlie Barker

Re: Searching:

Could you compute hashes for documents as they were added to the db and return the hash value to the calling app to use later for retrieval?

You might also be able to use the hash to search for identical documents with different Id's.

Allowing calling applications to add tags to a document would be useful for searching also.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB