Intersection of the complexities
As you can imagine, I’m really interested in what is going on with storage engines. So when I read the RocksDB Tuning Guide with interest. Actually, mounting horror was more like it, to be frank.
It all culminated pretty nicely with the final quote:
Unfortunately, configuring RocksDB optimally is not trivial. Even we as RocksDB developers don't fully understand the effect of each configuration change. If you want to fully optimize RocksDB for your workload, we recommend experiments…
Um… okaaaay.
That is pretty scary statement. Now, to be fair, RocksDB is supposed to be a low level embedded storage engine, it isn’t meant to be something that is directly exposed to the user or operations people.
And yet…
I’m literally writing databases for a living, and I had a hard time following all the options they had there. And it appears that from the final thought that the developers of RocksDB are also at a loss here. It is a very problematic state to be in. Because optimizations and knowing how to get the whole thing working is a pretty important part of using a database engine. And if your optimizations process relies on “change a bunch of settings”, and see what happens, that is not kind at all.
Remember, RocksDB is actually based on LevelDB, which is a database which was designed to be the backend of IndexdDB, which runs in the browser and has a single threaded client and a single user, pretty much. LevelDB can do a lot more than that, but that was the primary design goal, at least initially.
The problems with LevelDB are pretty well known, it suffers from write amplification, as well as known to hang if there is a compaction going on, and… well, you can read on that elsewhere.
RocksDB was supposed to take LevelDB and improve on all the issues that LevelDB had. But it appears that most of what was done was to actually create more configuration options (yes, I’m well aware that this is an unfair statement, but it sure seems so from the tuning guide). What is bad here is that the number of options is very high, and the burden it puts on the implementers is pretty high. Basically, it is a “change & pray” mindset.
Another thing that bugs me is the number of “mutex bottlenecks” that are mentioned in the tuning guide, with the suggestions to shard things to resolve them.
Now, to be fair, an embedded database require a bit of expertise, and cooperation from the host process, but that seems to be pushing it. It is possible that this is only required for the very high end usage, but that is still not fun.
Compared to that, Voron has about 15 configuration options, most of them about controlling the default sizes we use. And pretty much all of them are set to be auto tuned on the fly. Voron on its on actually have very few moving parts, which makes reasoning about it much simpler and efficient.
Comments
Have you considered Basho's version of leveldb? Basho is making ongoing improvements to leveldb for server class systems storing terabytes of data:
https://github.com/basho/leveldb/wiki
And several of the critical configuration settings now have automated, runtime tuning. The developer can just ignore them ...
Matthew, Yes, I looked at that at the time. It was mostly (and I speak from memory over two years ago) about splitting leveldb into multiple shards, to avoid a lot of the stalls that are built into the process. Yes, I'm aware that this is gross oversimplification. But the number of moving parts is still pretty big.
In contrast, the design we ended up going with is a single file (1 writer, many readers, with concurrent merged transactions), and a tx log. The MVCC format we utilize allows us to get greater simplicity and avoid the common pitfalls of write amplifications.
Just to give you some idea, we don't have an option for non sync writes. We typically can do over a million writes per second on commodity hardware (all fully synced and safe), and in certain workloads, we can do an order of magnitude over that. It also have great read speeds, and it has a very minor cost to open it, and the number of moving parts are actually quite small.
Oren, It would be nice if I could look up all posts (that you have written here) regarding voron right from the motivations that led to it, to its inception, development, optimization, maturity etc. Right now you don't have a tag Voron. So I have to search and filter what comes through.
It would be a lot nicer to simply browse to http://ayende.com/blog/tags/voron
Giving it more thought, its perhaps more than just a tag. May be a series that tells me what is the right order to read your Voron posts so they make most sense.
Afif, This is actually something that I have been looking at since 2008. Doing a search on Voron in the blog is the easiest way to look at it. Here is an example of a predecessor to Voron: http://ayende.com/blog/4686/raven-munin
That is from 2010.
Comment preview