Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,124 | Comments: 45,470

filter by tags archive

Degenerate performance scenario for LMDB

time to read 6 min | 1076 words

I am working peacefully through some stuff with Voron, when I run into some really bad performance in a pretty important scenario. I decided that this is probably something that I am doing wrong, and set out to reproduce the same scenario in LMDB, to figure out what I am doing wrong.

Here it the code:

int main(int argc,char * argv[])
    int i = 0, j = 0, rc;
    UUID id;
    MDB_env *env;

    MDB_val key, data;

    MDB_stat mst;
    MDB_cursor *cursor;
    char sval[32];
    clock_t start = clock(), end;


    rc = mdb_env_create(&env);
    rc = mdb_env_set_mapsize(env, 1024*1024*768);
rc = mdb_env_set_flags(env, MDB_WRITEMAP, 1); rc = mdb_env_open(env, "E:\\data\\lmdb2", 0, 0664); key.mv_size = sizeof(UUID); data.mv_size = sizeof(sval); data.mv_data = sval; for (i=0;i<100;i++) { MDB_txn *txn; MDB_dbi dbi; rc = mdb_txn_begin(env, NULL, 0, &txn); rc = mdb_open(txn, NULL, 0, &dbi); for (j= 0; j < 100; j++) { UuidCreate(&id); key.mv_data = &id; rc = mdb_put(txn, dbi, &key, &data, 0); } rc = mdb_txn_commit(txn); mdb_close(env, dbi); } end = clock(); printf("%i", (end - start)); fgetc(stdin); mdb_env_close(env); return 0; }

As you can see, we are inserting 10,000 items (100 transactions of 100 items each). Each item has key of 16 bytes and a value of 100 bytes. Now, you might note that this is probably the worst case scenario for a B+Tree, UUID are unordered, and this generate a lot of fragmentation in the tree. Bad scenario, yes, but also a relatively common one, and one that needs to be handled properly for our needs. It took me a long while to narrow down what is actually going on. At first I was convinced that the problem was with Windows’ implementation of memory mapped files, but eventually I realized that select ain’t broken.

Here is the Process Monitor’s trace of the first few transactions. The highlighted calls are to FlushBuffersFile, which is how FlushFileBuffers look like, which indicate a commit.


As I said, those are the first few transactions. You can see that the OS is doing a pretty good job in merging writes and avoid seeks. However, as times goes by…


And as more times goes by, we get to… ( there are more rows at the top that I had to remove so I could take the snapshot).


In fact, I put the data in Excel and got:



And those are some really bad numbers, I think you’ll agree. After the very short period of time, the cost of committing a transaction goes to over 50 seeks and writing of 350KB.

Let us compare that to what happens when we are actually writing sequential data (using UuidCreateSequential instead of UuidCreate). The test complete in half the time, and the actual statistics are also very interesting.



We are writing a lot less, but much more important, we are doing a lot less seeks. For reference, here is the final transaction that we have with the sequential write scenario:


I run those tests on a HDD drive, so seeks times matter a lot, but I have gotten similar results when running on SSD.

Now, the interesting question here is what exactly is going on? And I think that a large part of this is the way LMDB allocate pages. It uses a copy on write method, which means that after the transaction is done, the previously used page are free. That means that they can be reclaimed. And the next transaction will do just that. The problem with this approach is that it tends to spread writes all over the file. This saves disk space, but require a lot more seeks when you need to commit a transaction.

That means that in order to optimize this particular scenario, I probably need to do some thinking about the free page allocation behavior. We want to be a lot more conservative about when / how we give out those pages, to make sure that we aren’t generating a really bad situation for the db when commit time comes.



Interesting. So now your problem is more like an efficient memory manager implementation, where you have to look for larger contiguous blocks in a pool of free pages. Maybe some kind of compacting 'garbage collector' will be the solution? BTW, is there a significant performance difference between Voron at its current stage and LMDB?

Ayende Rahien

Rafal, A major premise of LMDB is that it isn't doing ANY background work. Compaction would be such a thing, so we want to avoid it in Voron as well.


Doesn't look so good. Maybe the whole model is broken and cannot be reasonably fixed. A transaction log architecture with in-place lazy writes seems superior.


Ayende, I'm not talking about background work, some level of compaction/fragmentation management can be performed synchronously while commiting a transaction. For example, you can move additional pages (untouched by data modification) just to get a single contiguous block of data to flush, instead of several dirty pages separated by unmodified ones. You can move the data so you're in a better situation than malloc, for example.

Ayende Rahien

Tobi, There is a reason I said that this is a true degenerate performance test case. This is writing 100 random values 1000 times. It is a test case that would have issues for any sorted data structure. Buffering / tx log would help, sure, but that means that you have to deal with write amplification and secondary affects. Not to mention that recovery gets complex.

In addition, this turn out to be a really rare scenario.

Ayende Rahien

Rafal, Actually, I can't really move data. That data might be currently be read by another transaction, so moving it under that tx feet would cause corruption.


Use copy-on-write, Luke

Ayende Rahien

Rafal, Huh? I already do COW. That doesn't help me creating holes.


I mean there's no difference if you're copying a page because it's been modified or because you just want to move it to another place to keep the memory contiguous. Of course I don't know if it's worth it - I mean, if shuffling additional pages around would make writes less fragmented and thus more performant. I don't have any experience with such 'packing' algorithms.


oh, shit, there is a difference - you can't overwrite pages modified/copied in current transaction...

Ayende Rahien

Rafal, The problem is that now you are doing things for the future, and it gets complex when you have stuff that isn't actually on a single page basis. I think we have a better solution, but we will see...

Matt Warren

Is this because LMDB was written to be used with OpenLDAP, where this type of access scenarion isn't possible? I.e. is purely random inserts an access pattern for OpenLDAP?

If you look here you can see that LMDB doesn't do as well against LevelDB, in the Random Write benchmarks, http://symas.com/mdb/microbench/


Yeah, i'm quite curious what's your idea

Ayende Rahien

Matt, Random writes are always hard. LevelDB will do particularly well in that scenario simple because of the way it does this. It writes to log file (sequential) and to memory (fast). That means that it has a lot less work to do in order to random writes right. The downside is that down the road, LMDB is probably going to be still about the same speed, and leveldb would start hitting frequent compactions.

Howard Chu

Matt: yes, in a way. LMDB was written for OpenLDAP and we just didn't focus much effort on write performance. LevelDB wins most of the write benchmarks in the microbench, and that's what we expected to happen. Really, if you were paying attention to all the results, you should reasonably conclude that LMDB's write speed is not the best. It isn't, except in some very specific conditions (e.g. sequential bulk insert).

But ... just in the past couple days I tripped over some new information, which is going to require a lot of our benchmarks to be rerun.


This has no bearing on the Windows results Ayende is reporting, but all of my benchmarks were run on Linux 3.2 or 3.5, which are all affected by the kernel bug referenced in the above post. This means that switching to Linux 3.10 should give us a fairly significant write performance boost. (And has already been verified by another OpenLDAP team member, as you can see in the followup post on that thread.)

Comment preview

Comments have been closed on this topic.


  1. RavenDB 3.5 whirl wind tour: You want all the data, you can’t handle all the data - about one day from now
  2. The design of RavenDB 4.0: Making Lucene reliable - 3 days from now
  3. RavenDB 3.5 whirl wind tour: I’ll find who is taking my I/O bandwidth and they SHALL pay - 4 days from now
  4. The design of RavenDB 4.0: Physically segregating collections - 5 days from now
  5. RavenDB 3.5 Whirlwind tour: I need to be free to explore my data - 6 days from now

And 14 more posts are pending...

There are posts all the way to May 30, 2016


  1. RavenDB 3.5 whirl wind tour (14):
    29 Apr 2016 - A large cluster goes into a bar and order N^2 drinks
  2. The design of RavenDB 4.0 (13):
    28 Apr 2016 - The implications of the blittable format
  3. Tasks for the new comer (2):
    15 Apr 2016 - Quartz.NET with RavenDB
  4. Code through the looking glass (5):
    18 Mar 2016 - And a linear search to rule them
  5. Find the bug (8):
    29 Feb 2016 - When you can't rely on your own identity
View all series



Main feed Feed Stats
Comments feed   Comments Feed Stats