More thoughts about compression & storage
One of the things that I liked in LevelDB is that it pretty much had compression built in from day one. The entire format is built to save space just about wherever it can. From using 7 bit numbers to prefix compression to making compression core part of the on disk data structure. Considering the cost of I/O, it is usually worth trading CPU time for I/O costs. LMDB, in contrast, went quite the opposite way. It actively discourage compression, since you have no support in the format. But it is more than that, LMDB is meant to be used by directly getting a memory pointer, and it actively encourage you to do such things as have a pointer reference from one record to another record.
The idea is that you can just persist your in memory state and require zero difference between the on disk structure and the in memory structure. This allows you to take full advantage of the memory mapped nature of the database.
Voron, however, doesn’t really do that. And the idea of compressing the data can significantly cut the amount of I/O we use, and can allow us to pack much more data into the same space. Now, in a few places, Howard Chu (creator of LMDB) points out that you don’t need any support from LMDB to add compression, and points to this as proof. I don’t find this satisfactory. I think that this puts a lot of effort on the application side, and require a lot of special hacks.
Now, what do we talk about when we talk about compression. We can, of course, individually compress each record. But that just means that we remove redundancies in that single value. In practice, it is almost always better to do compression on multiple values, because then you can get much better compression ratios.
For reference, I have taken the opposite view a while ago, see here: http://ayende.com/blog/162466/some-thoughts-about-compression-storage
But let us say that we compress the whole page? Taking a sample page from one of our files, I can see that compressing that (using DeflateStream) is taking essentially zero time, and show that a 4KB set of sample data got turned into about 800 bytes.
That means a lot, because it means that I can shove more data into a single page. And if I can do that, it means that I can also (drastically) reduce the height of the tree, and since that is one of the major factors in I/O costs in B-Tree based systems, that might actually have a lot more performance implications that it would seem.
It is also a lot harder. For example, you would have a very hard time telling ahead of time if an addition / deletion is going to be within the page boundary, or if it is going to require to require a split. For now, it is a really nice idea, with some interesting perf implications for later on, but I don’t think we’ll explore this very closely in the near future.