Intersection of the complexities

Aug 13 2015

Intersection of the complexities

time to read 3 min | 516 words

As you can imagine, I’m really interested in what is going on with storage engines. So when I read the RocksDB Tuning Guide with interest. Actually, mounting horror was more like it, to be frank.

It all culminated pretty nicely with the final quote:

Unfortunately, configuring RocksDB optimally is not trivial. Even we as RocksDB developers don't fully understand the effect of each configuration change. If you want to fully optimize RocksDB for your workload, we recommend experiments…

Um… okaaaay.

That is pretty scary statement. Now, to be fair, RocksDB is supposed to be a low level embedded storage engine, it isn’t meant to be something that is directly exposed to the user or operations people.

And yet…

I’m literally writing databases for a living, and I had a hard time following all the options they had there. And it appears that from the final thought that the developers of RocksDB are also at a loss here. It is a very problematic state to be in. Because optimizations and knowing how to get the whole thing working is a pretty important part of using a database engine. And if your optimizations process relies on “change a bunch of settings”, and see what happens, that is not kind at all.

Remember, RocksDB is actually based on LevelDB, which is a database which was designed to be the backend of IndexdDB, which runs in the browser and has a single threaded client and a single user, pretty much. LevelDB can do a lot more than that, but that was the primary design goal, at least initially.

The problems with LevelDB are pretty well known, it suffers from write amplification, as well as known to hang if there is a compaction going on, and… well, you can read on that elsewhere.

RocksDB was supposed to take LevelDB and improve on all the issues that LevelDB had. But it appears that most of what was done was to actually create more configuration options (yes, I’m well aware that this is an unfair statement, but it sure seems so from the tuning guide). What is bad here is that the number of options is very high, and the burden it puts on the implementers is pretty high. Basically, it is a “change & pray” mindset.

Another thing that bugs me is the number of “mutex bottlenecks” that are mentioned in the tuning guide, with the suggestions to shard things to resolve them.

Now, to be fair, an embedded database require a bit of expertise, and cooperation from the host process, but that seems to be pushing it. It is possible that this is only required for the very high end usage, but that is still not fun.

Compared to that, Voron has about 15 configuration options, most of them about controlling the default sizes we use. And pretty much all of them are set to be auto tuned on the fly. Voron on its on actually have very few moving parts, which makes reasoning about it much simpler and efficient.

Tweet Share Share 5 comments

Tags:

Comments

17 Aug 2015
00:12 AM

Matthew

Have you considered Basho's version of leveldb? Basho is making ongoing improvements to leveldb for server class systems storing terabytes of data:

https://github.com/basho/leveldb/wiki

And several of the critical configuration settings now have automated, runtime tuning. The developer can just ignore them ...

17 Aug 2015
07:28 AM

Oren Eini

Matthew, Yes, I looked at that at the time. It was mostly (and I speak from memory over two years ago) about splitting leveldb into multiple shards, to avoid a lot of the stalls that are built into the process. Yes, I'm aware that this is gross oversimplification. But the number of moving parts is still pretty big.

In contrast, the design we ended up going with is a single file (1 writer, many readers, with concurrent merged transactions), and a tx log. The MVCC format we utilize allows us to get greater simplicity and avoid the common pitfalls of write amplifications.

Just to give you some idea, we don't have an option for non sync writes. We typically can do over a million writes per second on commodity hardware (all fully synced and safe), and in certain workloads, we can do an order of magnitude over that. It also have great read speeds, and it has a very minor cost to open it, and the number of moving parts are actually quite small.

21 Aug 2015
01:25 AM

Afif

Oren, It would be nice if I could look up all posts (that you have written here) regarding voron right from the motivations that led to it, to its inception, development, optimization, maturity etc. Right now you don't have a tag Voron. So I have to search and filter what comes through.

It would be a lot nicer to simply browse to http://ayende.com/blog/tags/voron

21 Aug 2015
01:30 AM

Afif

Giving it more thought, its perhaps more than just a tag. May be a series that tells me what is the right order to read your Voron posts so they make most sense.

21 Aug 2015
01:41 AM

Oren Eini

Afif, This is actually something that I have been looking at since 2008. Doing a search on Voron in the blog is the easiest way to look at it. Here is an example of a predecessor to Voron: http://ayende.com/blog/4686/raven-munin

That is from 2010.

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB