Ayende @ Rahien

It's a girl

World’s smallest No SQL Database: Persistent data

The second item to go over with the World’s smallest No SQL database is about persistence, and following that, I/O. Right now, there is no persistence. If you restart the process, all the data is gone. Now, there are actually quite a few real world dbs that behave in this fashion. But they are a rarity. For the most part, if I put data inside a db, I expect it to be there until I do something about it.

And at that point, you are in for quite a bit of complexity. How are you going to persist the data? The easiest way to do it, just create a file per every value in the db is going to be… problematic on most systems. So you need to put a lot of data in a small set of files. Which means that you have to decide how you are going to put the data together. In general, there is either the fixed size option, in which you divide the file(s) into pages and work around that. The good thing about this is that this gives you the ability to reuse space in the file after deletes / updates. The bad thing about that is that it is quite complex. Alternatively, you can just write the data out as needed, but then you can’t really update written data, and would need to run compactions.

And we haven’t talked about searching yet. Some DBs, like Bitcask / Munin, would actually store the keys in memory, and store the position on the disk for retrieving the value. But for the most part, both keys & values tend to be on disk in some form. In CouchDB, they are held inside an append only B+Tree. In LevelDB, they are held in Sorted String Tables. LMDB uses Copy-On-Write B+Tree. Esent use a B+Tree with a Log file.

In each of those cases, the actual semantics for persistent data involve at least two concerns. You need to actually be able to search the data (that usually mean more than just O(1) access, you want to be able to go back & forth on the keys) and you need to be able to do a transactional save. This is so you can recover  in case of a crash, most of the time.

But there are actually a lot more that goes into the selection of the proper persistence format. To start with, how you store the data on disk will have a big effect on your performance. If you store the data as a linked list, for example, you might as well kiss your performance goodbye. Beyond that, we also have issues with things like how is the format going to scale when we have concurrent readers. For example, if you have something that does a lot of seeks, and rely on the seeks always going forward to ensure performance, that is going to be very badly hit the moment that you have concurrent readers doing concurrent reads on different parts of the system. You would be forcing the system to do random seeks.

There are other considerations, as well. For example, if you are using something like B+Tree, it is likely that you’ll be overwriting the same section on the disk multiple times. That is fine with HDD, but SSD would just give up & die on you at some point. And we haven’t started talking about secondary indexes yet…

Tags:

Posted By: Ayende Rahien

Published at

Originally posted at

Comments

Jos van der Til
08/16/2013 10:49 AM by
Jos van der Til

Just touching on this comment: "That is fine with HDD, but SSD would just give up & die on you at some point."

This is not strictly true, as the SSD will use wear leveling to avoid this. A logical sector does not map 1-to-1 with a physical sector. The SSD will actually spread writes to the same part of the disk across the available space, thus this is not a problem at all.

Ayende Rahien
08/19/2013 08:46 AM by
Ayende Rahien

Jos, You are correct, but that brings back the issues of sequential vs. random writes.

Comments have been closed on this topic.