Designing a document database

time to read 4 min | 674 words

A while ago I started experimenting with building my own document DB, based on the concepts that Couch DB have. As it turn out, there isn’t really much to it, at a conceptual level. A document DB requires the following features:

  • Store a document
  • Retrieve document by id
  • Add attachment to document
  • Replicate to a backup server
  • Create views on top of documents

The first two requirements are easily handled, and should generally take less than a day to develop. Indeed, after learning about the Esent database, it took me very little time to create this. I should mention that as an interesting limitation to the DB, I made the decision to accept only documents in Json format. That makes some things very simple, specifically views and partial updates.

There are several topics here that are worth discussion, because they represent non trivial issues. I am going to raise them here as questions, and answer them in future posts.

Storage:

  • How do we physically store things?
  • How do we do backups?
  • How do we handle corrupted state?
  • Where do we store the views?
    • Should we store them in the same file as the actual data? 
  • Are we transactional?
  • Do we allow multi document operation to be transactional?

Scale:

  • Do we start from the get go as a distributed DB?
  • Do we allow relations?
    • Joins?
    • Who resolves them?
  • Do we assume data may reside on several nodes?
  • Do we allow partial updates to a document?

Concurrency:

  • What concurrency alternatives do we choose?
  • What about versioning?

Attachments:

  • Do we allow them at all?
  • How are they stored?
    • In the DB?
    • Outside the DB?
  • Are they replicated?
  • Should we even care about them at all? Can we apply SoC and say that this is the task of some other part of the system?

Replication:

  • How often should we replicate?
    • As part of the transaction?
    • Backend process?
    • Every X amount of time?
    • Manual?
  • Should we replicate only the documents?
    • What about attachments?
    • What about the generated view data?
  • Should we replicate to all machines?
    • To specified set of machines for all documents?
    • Should we use some sharding algorithm?

Views:

  • How do we define views?
  • How do we define the conversion process from a document to a view item?
  • Does views have fixed schema?
  • How often do we update views?
  • How do we remove view items from the DB when the origin document has been removed?

There are some very interesting challenges relating to doing the views. Again, I am interested in your opinions about this.

There are several other posts, detailing my current design, which will be posted spaced about a day apart from one another. I’ll post a summary post with all the relevant feedback as well.