Designing a document databaseStorage
In a previous post, I asked about designing a document DB, and brought up the issue of storage, along with a set of questions that needs to be handled:
- How do we physically store things?
There are several options, from building our own persistent format, to using an RDMBS. I think that the most effective option would be to use Esent. It is small, highly efficient, require no installation and very simple to use. It also neatly resolve a lot of the questions that we have to ask in addition to that.
- How do we do backups?
Esent already has the facilities to do that, so we have very little to worry about it here.
- How do we handle corrupted state?
See above, Esent is also pretty good in doing auto recovery, which is a nice plus.
- Where do we store the views?
- Should we store them in the same file as the actual data?
I think not, I think that the best alternative is to have a file per view. That should make things such backing up just the DB easier, not to mention that it will reduce contention internally. Esent is built to handle that, but it is better to make it this way than not. All the data (include logs & temp dirs) should reside inside the same directory.
Crash recovery on startup should be enabled. Transactions should probably avoid crossing file boundaries.It is important the the files will include a version table, which will allow to detect invalid versions (caused a whole bunch of problems with RDHT until we fixed it).
- Are we transactional?
Yes, we are transactional. But only for document writes. We are not transactional for document + views, for example, since view generation is done as a background service.
- Do we allow multi document operation to be transactional?
Yes, depending on the operation. We allow submittal of several document writes / deletes at the same time, and they would succeed or fail as a single unit. Beyond that, no.
More posts in "Designing a document database" series:
- (17 Mar 2009) What next?
- (16 Mar 2009) Remote API & Public API
- (16 Mar 2009) Looking at views
- (15 Mar 2009) View syntax
- (14 Mar 2009) Aggregation Recalculating
- (13 Mar 2009) Aggregation
- (12 Mar 2009) Views
- (11 Mar 2009) Replication
- (11 Mar 2009) Attachments
- (10 Mar 2009) Authorization
- (10 Mar 2009) Concurrency
- (10 Mar 2009) Scale
- (10 Mar 2009) Storage
Comments
I'm really curious where you re going with this. one old client had us help write an electronic doc application, which had to address some similar questions. With the previous mention of schema-less database, it seems you're going in a really different direction with this.
not sure I understand what you mean by view for this. could you clarify?
Josh,
document database are not e-doc apps, nothing like share point.
Take a look at lucene or couch db for the details
ok. my bad. didn't read your couch db posts. this makes more sense after reading about couch db.
now i'm really curious how it works. how it finds the document you're looking for. i'll have to read more about couch db.
Any Esent support on Linux? if not maybe its best to separate the datastore part so that you can create a backend database/store implementation and the main system doesn't really know about the datastore. Or will this just make the system too slow.
Esent doesn't run on Linux
But I am not going to worry about that until there is a contributer that is actually going to make it run there.
ESENT has a 16TB size limit.
I you were storing scanned documents that averaged 1mb you could only store 16 million documents in a single ESENT db. Not a horrific limitation but one worth noting now.
Charlie,
You are missing the point, we are not storing scanned documents. Look at couch db for example.
In addition to that, I really don't think that we will reach the 16 TB limit so quickly. If you do, run another instance.
Just had another look at Couch Db, yes I was missing the point :)
Comment preview