Degenerate performance scenario for LMDB
I am working peacefully through some stuff with Voron, when I run into some really bad performance in a pretty important scenario. I decided that this is probably something that I am doing wrong, and set out to reproduce the same scenario in LMDB, to figure out what I am doing wrong.
Here it the code:
int main(int argc,char * argv[]) { int i = 0, j = 0, rc; UUID id; MDB_env *env; MDB_val key, data; MDB_stat mst; MDB_cursor *cursor; char sval[32]; clock_t start = clock(), end; srandom(time(NULL)); rc = mdb_env_create(&env); rc = mdb_env_set_mapsize(env, 1024*1024*768);
rc = mdb_env_set_flags(env, MDB_WRITEMAP, 1); rc = mdb_env_open(env, "E:\\data\\lmdb2", 0, 0664); key.mv_size = sizeof(UUID); data.mv_size = sizeof(sval); data.mv_data = sval; for (i=0;i<100;i++) { MDB_txn *txn; MDB_dbi dbi; rc = mdb_txn_begin(env, NULL, 0, &txn); rc = mdb_open(txn, NULL, 0, &dbi); for (j= 0; j < 100; j++) { UuidCreate(&id); key.mv_data = &id; rc = mdb_put(txn, dbi, &key, &data, 0); } rc = mdb_txn_commit(txn); mdb_close(env, dbi); } end = clock(); printf("%i", (end - start)); fgetc(stdin); mdb_env_close(env); return 0; }
As you can see, we are inserting 10,000 items (100 transactions of 100 items each). Each item has key of 16 bytes and a value of 100 bytes. Now, you might note that this is probably the worst case scenario for a B+Tree, UUID are unordered, and this generate a lot of fragmentation in the tree. Bad scenario, yes, but also a relatively common one, and one that needs to be handled properly for our needs. It took me a long while to narrow down what is actually going on. At first I was convinced that the problem was with Windows’ implementation of memory mapped files, but eventually I realized that select ain’t broken.
Here is the Process Monitor’s trace of the first few transactions. The highlighted calls are to FlushBuffersFile, which is how FlushFileBuffers look like, which indicate a commit.
As I said, those are the first few transactions. You can see that the OS is doing a pretty good job in merging writes and avoid seeks. However, as times goes by…
And as more times goes by, we get to… ( there are more rows at the top that I had to remove so I could take the snapshot).
In fact, I put the data in Excel and got:
And those are some really bad numbers, I think you’ll agree. After the very short period of time, the cost of committing a transaction goes to over 50 seeks and writing of 350KB.
Let us compare that to what happens when we are actually writing sequential data (using UuidCreateSequential instead of UuidCreate). The test complete in half the time, and the actual statistics are also very interesting.
We are writing a lot less, but much more important, we are doing a lot less seeks. For reference, here is the final transaction that we have with the sequential write scenario:
I run those tests on a HDD drive, so seeks times matter a lot, but I have gotten similar results when running on SSD.
Now, the interesting question here is what exactly is going on? And I think that a large part of this is the way LMDB allocate pages. It uses a copy on write method, which means that after the transaction is done, the previously used page are free. That means that they can be reclaimed. And the next transaction will do just that. The problem with this approach is that it tends to spread writes all over the file. This saves disk space, but require a lot more seeks when you need to commit a transaction.
That means that in order to optimize this particular scenario, I probably need to do some thinking about the free page allocation behavior. We want to be a lot more conservative about when / how we give out those pages, to make sure that we aren’t generating a really bad situation for the db when commit time comes.
Comments
Interesting. So now your problem is more like an efficient memory manager implementation, where you have to look for larger contiguous blocks in a pool of free pages. Maybe some kind of compacting 'garbage collector' will be the solution? BTW, is there a significant performance difference between Voron at its current stage and LMDB?
Rafal, A major premise of LMDB is that it isn't doing ANY background work. Compaction would be such a thing, so we want to avoid it in Voron as well.
Doesn't look so good. Maybe the whole model is broken and cannot be reasonably fixed. A transaction log architecture with in-place lazy writes seems superior.
Ayende, I'm not talking about background work, some level of compaction/fragmentation management can be performed synchronously while commiting a transaction. For example, you can move additional pages (untouched by data modification) just to get a single contiguous block of data to flush, instead of several dirty pages separated by unmodified ones. You can move the data so you're in a better situation than _malloc_, for example.
Tobi, There is a reason I said that this is a true degenerate performance test case. This is writing 100 random values 1000 times. It is a test case that would have issues for any sorted data structure. Buffering / tx log would help, sure, but that means that you have to deal with write amplification and secondary affects. Not to mention that recovery gets _complex_.
In addition, this turn out to be a really rare scenario.
Rafal, Actually, I can't really move data. That data might be currently be read by another transaction, so moving it under that tx feet would cause corruption.
Use copy-on-write, Luke
Rafal, Huh? I already do COW. That doesn't help me creating holes.
I mean there's no difference if you're copying a page because it's been modified or because you just want to move it to another place to keep the memory contiguous. Of course I don't know if it's worth it - I mean, if shuffling additional pages around would make writes less fragmented and thus more performant. I don't have any experience with such 'packing' algorithms.
oh, shit, there is a difference - you can't overwrite pages modified/copied in current transaction...
Rafal, The problem is that now you are doing things for the future, and it gets complex when you have stuff that isn't actually on a single page basis. I think we have a better solution, but we will see...
Is this because LMDB was written to be used with OpenLDAP, where this type of access scenarion isn't possible? I.e. is purely random inserts an access pattern for OpenLDAP?
If you look here you can see that LMDB doesn't do as well against LevelDB, in the Random Write benchmarks, http://symas.com/mdb/microbench/
Yeah, i'm quite curious what's your idea
Matt, Random writes are always hard. LevelDB will do particularly well in that scenario simple because of the way it does this. It writes to log file (sequential) and to memory (fast). That means that it has a lot less work to do in order to random writes right. The downside is that down the road, LMDB is probably going to be still about the same speed, and leveldb would start hitting frequent compactions.
Matt: yes, in a way. LMDB was written for OpenLDAP and we just didn't focus much effort on write performance. LevelDB wins most of the write benchmarks in the microbench, and that's what we expected to happen. Really, if you were paying attention to all the results, you should reasonably conclude that LMDB's write speed is not the best. It isn't, except in some very specific conditions (e.g. sequential bulk insert).
But ... just in the past couple days I tripped over some new information, which is going to require a lot of our benchmarks to be rerun.
http://www.openldap.org/lists/openldap-devel/201309/msg00008.html
This has no bearing on the Windows results Ayende is reporting, but all of my benchmarks were run on Linux 3.2 or 3.5, which are all affected by the kernel bug referenced in the above post. This means that switching to Linux 3.10 should give us a fairly significant write performance boost. (And has already been verified by another OpenLDAP team member, as you can see in the followup post on that thread.)
Comment preview