Ayende @ Rahien

It's a girl

Raven Storage–Voron’s Performance

Voron is the code name for another storage engine that we are currently trying, based on LMDB. After taking that for a spin for a while, it is not pretty complete, and I decided to give it some perf test runs. So far, there has been zero performance work. All the usual caveat applies here, with regards to short test runs, etc.

Just like before, we found that this is horribly slow on the first run. The culprit was our debug code that verified the entire structure whenever we added or removed something. One we run it in release mode we started getting more interesting results.

Here is the test code:

using (var env = new StorageEnvironment(new MemoryMapPager("bench.data", flushMode)))
{
    using (var tx = env.NewTransaction(TransactionFlags.ReadWrite))
    {
        var value = new byte[100];
        new Random().NextBytes(value);
        var ms = new MemoryStream(value);
        for (long i = 0; i < Count; i++)
        {
            ms.Position = 0;
            env.Root.Add(tx, i.ToString("0000000000000000"), ms);
        }

        tx.Commit();
    }
     using (var tx = env.NewTransaction(TransactionFlags.ReadWrite))
     {
         DebugStuff.RenderAndShow(tx, tx.GetTreeInformation(env.Root).Root);

         tx.Commit();
     }
}

We write 1 million entries with 100 bytes in size and 8 bytes of the key. We run it in three mode:

fill seq none                      :      9,578 ms    104,404 ops / sec
fill seq buff                      :     10,802 ms     92,575 ops / sec
fill seq sync                      :      9,387 ms    106,528 ops / sec

None means no flushing to disk, let the OS deals with that completely. Buffers means flush to the OS, but not to disk, and full means do a full fsync.

Now, this is pretty stupid way to go about it, I’ve to say. This is doing everything in a single transaction, and we are actually counting the time to close & open the db here as well, but that is okay for now. We aren’t interested in real numbers, just some rough ideas.

Now, let us see how we read it?

using (var env = new StorageEnvironment(new MemoryMapPager("bench.data")))
{
    using (var tx = env.NewTransaction(TransactionFlags.Read))
    {
        var ms = new MemoryStream(100);
        for (int i = 0; i < Count; i++)
        {
            var key = i.ToString("0000000000000000");
            using (var stream = env.Root.Read(tx, key))
            {
                ms.Position = 0;
                stream.CopyTo(ms);
            }
        }

        tx.Commit();
    }
}

And this gives us:

read seq                           :      3,289 ms    304,032 ops / sec

And again, this is with opening & closing the entire db.

We could do better with pre-allocation of space on the disk, etc. But I wanted to keep things realistic and to allow us to grow.

Next, I wanted to see how much we would gain by parallelizing everything, so I wrote the following code:

using (var env = new StorageEnvironment(new MemoryMapPager("bench.data")))
{
    var countdownEvent = new CountdownEvent(parts);
    for (int i = 0; i < parts; i++)
    {
        var currentBase = i;
        ThreadPool.QueueUserWorkItem(state =>
        {
            using (var tx = env.NewTransaction(TransactionFlags.Read))
            {
                var ms = new MemoryStream(100);
                for (int j = 0; j < Count / parts; j++)
                {
                    var current = j * currentBase;
                    var key = current.ToString("0000000000000000");
                    using (var stream = env.Root.Read(tx, key))
                    {
                        ms.Position = 0;
                        stream.CopyTo(ms);
                    }
                }

                tx.Commit();
            }

            countdownEvent.Signal();
        });
    }
    countdownEvent.Wait();
}

I then run it with 1 – 16 parts, so see how it behaves. Here are the details for this machine:

image

And the results pretty much match what I expected:

read seq                           :      3,317 ms    301,424 ops / sec
read parallel 1                    :      2,539 ms    393,834 ops / sec
read parallel 2                    :      1,950 ms    512,711 ops / sec
read parallel 4                    :      2,201 ms    454,172 ops / sec
read parallel 8                    :      2,139 ms    467,387 ops / sec
read parallel 16                   :      2,010 ms    497,408 ops / sec

We get a 2x perf improvement from running on two cores, running on 4 threads require some dancing around, which caused some perf drop, then we see more time spent in thread switching than anything else, pretty much. As you can see, we see a really pretty jump in performance the more cores we use, until we reach the actual machine limitations.

Note that I made sure to clear the OS buffer cache before each test. If we don't do that, we get:

read seq                           :      2,562 ms    390,291 ops / sec
read parallel 1                    :      2,608 ms    383,393 ops / sec
read parallel 2                    :      1,868 ms    535,220 ops / sec
read parallel 4                    :      1,646 ms    607,283 ops / sec
read parallel 8                    :      1,673 ms    597,557 ops / sec
read parallel 16                   :      1,581 ms    632,309 ops / sec

So far, I am pretty happy with those numbers. What I am not happy is the current API, but I’ll talk about this in a separate post.

Comments

Kelly Sommers
07/31/2013 08:50 PM by
Kelly Sommers

Interesting project! Looking forward to seeing how this evolves.

These numbers don't make sense though like the last set of benchmarks.

The delta between fsync and non-fsync writes should be much greater. I suspect you're only writing to the disk cache most of the time and that the writes aren't safe.

Can you describe what kind of disk is being used? I would like to look up the cache details.

Also to see proper CPU usage and disk characteristics in benchmarking I suggest something for windows that's similar to iostats or dstat. Task manager is not going to tell much of a story that's worthwhile.

Ayende Rahien
08/01/2013 04:31 AM by
Ayende Rahien

Kelly, There is just one transaction, so just one set of fsync() calls. Also note that we close the db in the benchmark. That means that we are effectively flushing the file to the OS no matter what. Yes, we are using SSDs, those are pretty early results, just there to show that we are on the right track.

Rafal
09/20/2013 11:22 AM by
Rafal

Great results. BTW, do you employ the write scheduling algorithm you posted as the programming exercise?

Ayende Rahien
09/20/2013 11:24 AM by
Ayende Rahien

Rafal, Not quite, we are doing something different.

Joram
09/20/2013 08:21 PM by
Joram

Dont if youve had a chnace to listen.

http://slideshot.epfl.ch/play/suri_stonebraker#2

Prof. Michael Stonebraker has some opinions on dbs, checkout slide 19 where he shows where the major bottlenects for dbs are, might be of use.

Howard Chu
09/21/2013 02:43 AM by
Howard Chu

Preaching to the choir, Joram. That slide 19, LMDB does no locking, no latching, no recovery, and has no buffer pool. All of that crap is gone, totally unnecessary. All of that's a waste of time.

Howard Chu
09/21/2013 02:47 AM by
Howard Chu

All old news. LMDB's basic principle, single-level-store, is aimed at eliminating the difference between on-disk format and in-memory format.

Howard Chu
09/21/2013 02:57 AM by
Howard Chu

... sorry for the multiple one-line posts. Just came back from cocktail hour and the thoughts are only rolling in one at a time ...

It's important to remember, when talking about "eliminating the difference between on-disk and in-memory format" - this doesn't mean you can just use any arbitrary in-memory format. LMDB is still a page-oriented DB because memory pages are the fundamental unit of the virtual memory system. If you just throw architecture concepts out the window and use a free form in-memory data structure you will lose performance.

Slide 32, saying that disk-oriented data format is slower, is dead wrong. What is correct is to ensure that you have only one format, so there is no decode step needed to transfer from disk to memory, but as I pointed out, your format cannot be oblivious to the realities of the paged memory management system.

In LMDB we have the MDB_FIXEDMAP option for the true single-level-store use. The purpose of this option is to allow you to create records that use live pointers and store them in the memory map, and then use them directly during future reads. The presentation is at least correct in this regard - your aim is to eliminate the decode step.

Howard Chu
09/21/2013 03:08 AM by
Howard Chu

Heh, those slides are from 2010. We first published these principles of LMDB in 2009.

http://www.openldap.org/lists/openldap-devel/200905/msg00036.html

Johnny-come-latelies...

Joram
09/22/2013 03:47 PM by
Joram

Very true Howard.

I was however hoping the info about the cost of multi threading might be usefull to Ayende

Comments have been closed on this topic.