Ayende @ Rahien

It's a girl

Degenerate performance scenario for LMDB

I am working peacefully through some stuff with Voron, when I run into some really bad performance in a pretty important scenario. I decided that this is probably something that I am doing wrong, and set out to reproduce the same scenario in LMDB, to figure out what I am doing wrong.

Here it the code:

int main(int argc,char * argv[])
{
    int i = 0, j = 0, rc;
    UUID id;
    MDB_env *env;

    MDB_val key, data;

    MDB_stat mst;
    MDB_cursor *cursor;
    char sval[32];
    clock_t start = clock(), end;

    srandom(time(NULL));

    rc = mdb_env_create(&env);
    rc = mdb_env_set_mapsize(env, 1024*1024*768);
rc = mdb_env_set_flags(env, MDB_WRITEMAP, 1); rc = mdb_env_open(env, "E:\\data\\lmdb2", 0, 0664); key.mv_size = sizeof(UUID); data.mv_size = sizeof(sval); data.mv_data = sval; for (i=0;i<100;i++) { MDB_txn *txn; MDB_dbi dbi; rc = mdb_txn_begin(env, NULL, 0, &txn); rc = mdb_open(txn, NULL, 0, &dbi); for (j= 0; j < 100; j++) { UuidCreate(&id); key.mv_data = &id; rc = mdb_put(txn, dbi, &key, &data, 0); } rc = mdb_txn_commit(txn); mdb_close(env, dbi); } end = clock(); printf("%i", (end - start)); fgetc(stdin); mdb_env_close(env); return 0; }

As you can see, we are inserting 10,000 items (100 transactions of 100 items each). Each item has key of 16 bytes and a value of 100 bytes. Now, you might note that this is probably the worst case scenario for a B+Tree, UUID are unordered, and this generate a lot of fragmentation in the tree. Bad scenario, yes, but also a relatively common one, and one that needs to be handled properly for our needs. It took me a long while to narrow down what is actually going on. At first I was convinced that the problem was with Windows’ implementation of memory mapped files, but eventually I realized that select ain’t broken.

Here is the Process Monitor’s trace of the first few transactions. The highlighted calls are to FlushBuffersFile, which is how FlushFileBuffers look like, which indicate a commit.

image

As I said, those are the first few transactions. You can see that the OS is doing a pretty good job in merging writes and avoid seeks. However, as times goes by…

image

And as more times goes by, we get to… ( there are more rows at the top that I had to remove so I could take the snapshot).

image

In fact, I put the data in Excel and got:

 

image

And those are some really bad numbers, I think you’ll agree. After the very short period of time, the cost of committing a transaction goes to over 50 seeks and writing of 350KB.

Let us compare that to what happens when we are actually writing sequential data (using UuidCreateSequential instead of UuidCreate). The test complete in half the time, and the actual statistics are also very interesting.

image

 

We are writing a lot less, but much more important, we are doing a lot less seeks. For reference, here is the final transaction that we have with the sequential write scenario:

image

I run those tests on a HDD drive, so seeks times matter a lot, but I have gotten similar results when running on SSD.

Now, the interesting question here is what exactly is going on? And I think that a large part of this is the way LMDB allocate pages. It uses a copy on write method, which means that after the transaction is done, the previously used page are free. That means that they can be reclaimed. And the next transaction will do just that. The problem with this approach is that it tends to spread writes all over the file. This saves disk space, but require a lot more seeks when you need to commit a transaction.

That means that in order to optimize this particular scenario, I probably need to do some thinking about the free page allocation behavior. We want to be a lot more conservative about when / how we give out those pages, to make sure that we aren’t generating a really bad situation for the db when commit time comes.

Comments

Rafal
09/11/2013 09:47 AM by
Rafal

Interesting. So now your problem is more like an efficient memory manager implementation, where you have to look for larger contiguous blocks in a pool of free pages. Maybe some kind of compacting 'garbage collector' will be the solution? BTW, is there a significant performance difference between Voron at its current stage and LMDB?

Ayende Rahien
09/11/2013 11:37 AM by
Ayende Rahien

Rafal, A major premise of LMDB is that it isn't doing ANY background work. Compaction would be such a thing, so we want to avoid it in Voron as well.

tobi
09/11/2013 11:42 AM by
tobi

Doesn't look so good. Maybe the whole model is broken and cannot be reasonably fixed. A transaction log architecture with in-place lazy writes seems superior.

Rafal
09/11/2013 11:51 AM by
Rafal

Ayende, I'm not talking about background work, some level of compaction/fragmentation management can be performed synchronously while commiting a transaction. For example, you can move additional pages (untouched by data modification) just to get a single contiguous block of data to flush, instead of several dirty pages separated by unmodified ones. You can move the data so you're in a better situation than malloc, for example.

Ayende Rahien
09/11/2013 11:52 AM by
Ayende Rahien

Tobi, There is a reason I said that this is a true degenerate performance test case. This is writing 100 random values 1000 times. It is a test case that would have issues for any sorted data structure. Buffering / tx log would help, sure, but that means that you have to deal with write amplification and secondary affects. Not to mention that recovery gets complex.

In addition, this turn out to be a really rare scenario.

Ayende Rahien
09/11/2013 11:53 AM by
Ayende Rahien

Rafal, Actually, I can't really move data. That data might be currently be read by another transaction, so moving it under that tx feet would cause corruption.

Rafal
09/11/2013 11:55 AM by
Rafal

Use copy-on-write, Luke

Ayende Rahien
09/11/2013 11:56 AM by
Ayende Rahien

Rafal, Huh? I already do COW. That doesn't help me creating holes.

Rafal
09/11/2013 12:06 PM by
Rafal

I mean there's no difference if you're copying a page because it's been modified or because you just want to move it to another place to keep the memory contiguous. Of course I don't know if it's worth it - I mean, if shuffling additional pages around would make writes less fragmented and thus more performant. I don't have any experience with such 'packing' algorithms.

Rafal
09/11/2013 12:09 PM by
Rafal

oh, shit, there is a difference - you can't overwrite pages modified/copied in current transaction...

Ayende Rahien
09/11/2013 12:09 PM by
Ayende Rahien

Rafal, The problem is that now you are doing things for the future, and it gets complex when you have stuff that isn't actually on a single page basis. I think we have a better solution, but we will see...

Matt Warren
09/11/2013 12:26 PM by
Matt Warren

Is this because LMDB was written to be used with OpenLDAP, where this type of access scenarion isn't possible? I.e. is purely random inserts an access pattern for OpenLDAP?

If you look here you can see that LMDB doesn't do as well against LevelDB, in the Random Write benchmarks, http://symas.com/mdb/microbench/

Rafal
09/11/2013 12:33 PM by
Rafal

Yeah, i'm quite curious what's your idea

Ayende Rahien
09/11/2013 12:48 PM by
Ayende Rahien

Matt, Random writes are always hard. LevelDB will do particularly well in that scenario simple because of the way it does this. It writes to log file (sequential) and to memory (fast). That means that it has a lot less work to do in order to random writes right. The downside is that down the road, LMDB is probably going to be still about the same speed, and leveldb would start hitting frequent compactions.

Howard Chu
09/11/2013 07:04 PM by
Howard Chu

Matt: yes, in a way. LMDB was written for OpenLDAP and we just didn't focus much effort on write performance. LevelDB wins most of the write benchmarks in the microbench, and that's what we expected to happen. Really, if you were paying attention to all the results, you should reasonably conclude that LMDB's write speed is not the best. It isn't, except in some very specific conditions (e.g. sequential bulk insert).

But ... just in the past couple days I tripped over some new information, which is going to require a lot of our benchmarks to be rerun.

http://www.openldap.org/lists/openldap-devel/201309/msg00008.html

This has no bearing on the Windows results Ayende is reporting, but all of my benchmarks were run on Linux 3.2 or 3.5, which are all affected by the kernel bug referenced in the above post. This means that switching to Linux 3.10 should give us a fairly significant write performance boost. (And has already been verified by another OpenLDAP team member, as you can see in the followup post on that thread.)

Comments have been closed on this topic.