Ayende @ Rahien

It's a girl

HotSpot Shell Game

One of the annoying things about doing performance work is that you sometimes feel like you found the problem. You fix it, and another pops out.

After the previous run of work, we have tackled two important perf issues:

  • Immutable collections – which were replaced with more memory intensive but hopefully faster safe collections.
  • Transaction contention from background work – which we reduced and close to eliminated.

Once we have done that, it is time to actually look at the code in the profiler again. Note that I am writing this while I am traveling on the train (on battery power), so you can pretty much ignore any actual numbers, we want to look at the percentages.

Let us look at the contention issue first. Here is the old code:

image

And the new one:

image

We still have some cost, but that appears to have been drastically cut down. I’ll call it good for now.

What about transaction commit? Here is the old code:

image

And here is the new one:

image

What we can see is that our change didn’t actually improve anything much. In fact, we are worse off. We’ll probably go with a more complex system of chained dictionaries, rather than the just plain copying that appears to be causing so much trouble.

However, there is something else to now. Both Commit and NewTransaction together are less than 60% of the cost of the operation. The major cost is actually in adding item to the in memory tree:

image

In particular, it looks like finding the right page to place the item is what is costing us so much. We added 2 million items (once sequential, once random), which means that we need to do an O(logN) operation 2 million times. I don’t know if we can improve on the way it works for random writes, but we should be able to do something about how it behaves for sequential writes.

a

SafeList, SafeDictionary–fast immutable structures

After the issues we run into with immutable structures, I decided that I still wanted to maintain the same guarantees that immutable data structures gave us, but that still maintain the speed of mutable structures.

I managed to do that by making different design choices when building my implementation. While the BCL immutable collections are based on binary trees, and attempt to spare GC pressure in favor of CPU cycles, I find that for a lot of the things that we do, we can spare the memory in favor of actually being faster overall. In particular, this is true since the immutable collections are order of magnitude slower for reads than their mutable counterparts. I could have dealt with the slow writes, somehow. But we do a lot of reads, and that is utterly unacceptable.

Here is how I implemented SafeList:

   1: public class SafeList<T> : IEnumerable<T>
   2: {
   3:     List<T> _inner = new List<T>();
   4:  
   5:     public static readonly SafeList<T> Empty = new SafeList<T>();
   6:  
   7:     private SafeList()
   8:     {
   9:  
  10:     }
  11:  
  12:     public SafeList<T> AddRange(IEnumerable<T> items)
  13:     {
  14:         var inner = new List<T>(_inner);
  15:         inner.AddRange(items);
  16:         return new SafeList<T>
  17:         {
  18:             _inner = inner
  19:         };
  20:     }
  21:  
  22:     public SafeList<T> Add(T item)
  23:     {
  24:         return new SafeList<T>
  25:         {
  26:             _inner = new List<T>(_inner) {item}
  27:         };
  28:     }
  29:  
  30:     public int Count
  31:     {
  32:         get { return _inner.Count; }
  33:     }
  34:  
  35:     public T this[int i]
  36:     {
  37:         get { return _inner[i]; }
  38:     }
  39:  
  40:  
  41:     public SafeList<T> RemoveAll(Func<T, bool> filter)
  42:     {
  43:         return new SafeList<T>
  44:         {
  45:             _inner = new List<T>(_inner.Where(x => filter(x) == false))
  46:         };
  47:     }
  48:  
  49:     public T Find(Predicate<T> predicate)
  50:     {
  51:         return _inner.Find(predicate);
  52:     }
  53:  
  54:     public SafeList<T> RemoveAllAndGetDiscards(Predicate<T> filter, out List<T> discards)
  55:     {
  56:         var list = new List<T>();
  57:  
  58:         discards = list;
  59:  
  60:         return new SafeList<T>
  61:         {
  62:             _inner = new List<T>(_inner.Where(x =>
  63:             {
  64:                 if (filter(x) == false)
  65:                 {
  66:                     list.Add(x);
  67:                     return false;
  68:                 }
  69:                 return true;
  70:             }))
  71:         };
  72:  
  73:     }
  74: }

The implementation for SafeDictionary is pretty similar as well. Note that I have dedicated & optimized methods for the type of things that I need in my code.

But the key part here is that I gain the safety by always creating another copy of the underlying implementation. We are never actually mutating anything.

Sure, this results in us having to deal with a lot more allocations, but we get the same speed for reads as you do for the standard mutable collections. And more importantly, we are still faster for writes.

Tags:

Published at

Originally posted at

Comments (20)

Immutable collections performance, take II

Why is the performance of an immutable list over 16 times slower than a standard list? I took a peek at what it was actually doing, and it made a lot of sense. In order to maintain efficient indexing access, the actual storage of the data in the immutable list is a binary tree. With the key being used as the indexer.

This result is a much higher cost for pretty much everything. Let us look at the following:

   1: var listsp = Stopwatch.StartNew();
   2: var list = new List<int>(Enumerable.Range(0, 10*1000*1000));
   3:  
   4: for (int i = 0; i < list.Count; i++)
   5: {
   6:     var _ = list[i];
   7: }
   8:  
   9: Console.WriteLine(listsp.Elapsed);
  10:  
  11: var ilist = ImmutableList<int>.Empty.AddRange(list);
  12: listsp.Restart();
  13:  
  14: for (int i = 0; i < ilist.Count; i++)
  15: {
  16:     var _ = ilist[i];
  17: }
  18: Console.WriteLine(listsp.Elapsed);

This List<T> is 0.23 seconds, ImmutableList<T> takes 1.28 seconds. When I use foreach, instead, we get 0.22 seconds vs. 2.29 seconds.

As you can see from the blog post describing them, because immutable collections are mostly implemented as binary trees, I don’t really think that there is a good way to approach this as is. The problem is that the immutable collections actually need to do a lot more than what I need them to do.

Now, it might have been more acceptable to use them if the perf was limited to just writing to them, it might have been acceptable. But as you can see, we have the same problem when we are reading, and that is quite unacceptable.

Tags:

Published at

Originally posted at

Comments (6)

Immutable Collections performance

After finding out that a lot of our costs in Voron is actually related to immutable collections, I decided to run some isolated perf tests.

   1: private static void Dictionary()
   2: {
   3:     var dic = new Dictionary<object, object>();
   4:     var sp = Stopwatch.StartNew();
   5:  
   6:     for (int i = 0; i < 10*1000*1000; i++)
   7:     {
   8:         var obj = new object();
   9:         dic[obj] = obj;
  10:     }
  11:  
  12:     Console.WriteLine(sp.Elapsed);
  13: }                                                          
   1: private static void ImmutableDictionary()
   2: {
   3:     var dic = ImmutableDictionary<object,object>.Empty;
   4:     var sp = Stopwatch.StartNew();
   5:  
   6:     for (int i = 0; i < 10 * 1000 * 1000; i++)
   7:     {
   8:         var obj = new object();
   9:         dic = dic.SetItem(obj, obj);
  10:     }
  11:  
  12:     Console.WriteLine(sp.Elapsed);
  13: }

3.639 seconds

1 minute and 58 seconds

Yes, that is correct, the immutable version is over thirty times slower. And given that we are only using 10 million items, that is a ridiculous high rate.

I decided to do some tests to see if it is just the number of calls that we are making here:

   1: private static void ImmutableDictionary()
   2: {
   3:     var dic = ImmutableDictionary<object,object>.Empty;
   4:     var sp = Stopwatch.StartNew();
   5:  
   6:     dic = dic.SetItems(Enumerable.Range(0, 10*1000*1000).Select(i =>
   7:     {
   8:         var obj = new object();
   9:         return new KeyValuePair<object, object>(obj, obj);
  10:     }));
  11:     
  12:     Console.WriteLine(sp.Elapsed);
  13: }

1 minute

So that it appears that it isn’t the number of calls, but something intrinsic to the way it works. And how about lists?

   1: private static void List()
   2: {
   3:     var list = new List<object>();
   4:     var sp = Stopwatch.StartNew();
   5:  
   6:     for (int i = 0; i < 10 * 1000 * 1000; i++)
   7:     {
   8:         var obj = new object();
   9:         list.Add(obj);
  10:     }
  11:  
  12:     Console.WriteLine(sp.Elapsed);
  13: }                                                           
   1: private static void ImmutableList()
   2: {
   3:     var list = ImmutableList<object>.Empty;
   4:     var sp = Stopwatch.StartNew();
   5:  
   6:     for (int i = 0; i < 10 * 1000 * 1000; i++)
   7:     {
   8:         var obj = new object();
   9:         list = list.Add(obj);
  10:     }
  11:  
  12:     
  13:     Console.WriteLine(sp.Elapsed);
  14: }

0.9 seconds

16 seconds

Ouch.

I think we are going to need to do something about this, even though I love the idea of immutable collections, 16 – 32 times slower is just going to be utterly unacceptable.

But this is for writes, how about reads?

I had written the following for both of them:

   1: var sp = Stopwatch.StartNew();
   2: for (int i = 0; i < 10*1000*1000; i++)
   3: {
   4:     object value;
   5:     dic1.TryGetValue(i, out value);
   6: }
   7:  
   8: Console.WriteLine(sp.Elapsed);
   1: var sp1 = Stopwatch.StartNew();
   2:  
   3: for (int i = 0; i < 10 * 1000 * 1000; i++)
   4: {
   5:     object value;
   6:     dic2.TryGetValue(i, out value);
   7: }
   8: Console.WriteLine(sp1.Elapsed);

0.45 seconds

3.11 seconds

So the summary is, the performance for our workloads is probably going to be simply unacceptable. We’ll have to find another way to handle this. The way to handle this nicely is to basically copy the values. So I wanted to know how long it would take to fully clone a 10 millions items dictionary. The answer: 0.6 seconds. Doing the same with immutable dictionary? 16 seconds.

So we are going to need to do something else Sad smile.

The difference between benchmarks & performance tests

Also known as: Please check the subtitle of this blog.

This post is in response to this one. Kelly took offence with this post about Voron performance. In particular, it appears that the major issues are:

This benchmark doesn’t actually provide much useful information. It is too short and compares fully featured DBMS systems to storage engines. I always stress very much that people never make decisions based on benchmarks like this.

These paint the fully featured DBMS systems in a negative light that isn’t a fair comparison. They are doing a LOT more work. I’m sure the FoundationDB folks will not be happy to know they were roped into an unfair comparison in a benchmark where the code is not even available.

This isn’t a benchmark. This is just an interim step along the way of developing Voron. It is a way for us to see where we stand and where we need to go. A benchmark include full details about what you did (machine specs, running environment, full source code, etc). This is just us putting stress on our machine and comparing where we are at. And yes, we could have done it in isolation, but that wouldn’t really give us any major advantage. We need to see how we compare to other database.

And yes, we compare apples to oranges here when we compare a low level storage engine like Voron to SQL Server. I am well aware of that. But that isn’t the point. For the same reason that we are currently doing a lot of micro benchmarks rather than the 48 hours ones we have in the pipeline.

I am trying to see how users will evaluate Voron down the road. A lot of the time, that means users doing micro benchmarks to see how good we are. Yes, those aren’t very useful, but they are a major way people make decisions. And I want to make sure that we come out in a good light under that scenario.

With regards to Foundation DB, I am sure they are as happy about it as I am about them making silly claims about RavenDB transaction support. And the source code is available if you really want to, in fact, we got the Foundation DB there because we had an explicit customer request, and because they contributed the code for that.

Next, let us move to something else equally important. This is my personal blog. I publish here things that I do on a daily basis. And if I am currently in a performance boost stage, you’re going to be getting a lot of details on that. Those are the results of performance runs, they aren’t benchmarks. They don’t get anywhere beyond this blog. When we’ll put the results on ravendb.net, or something like that, then it will be a proper benchmark.

And while I fully agree that making decisions based on micro benchmarks is a silly way to go about doing so, the reality is that many people do just that. So one of the things that I’m focusing on is exactly those things. It helps that we currently see a lot of places to improve in those micro benchmarks. We already have a plan (and code) to see how we do on a 24 – 48 hours benchmark, which would also allow us to see all sort of interesting things (mixed reads & writes, what happens when you go beyond physical memory size, longevity issues, etc).

Hunting the performance bottleneck

We got a support request from a customer, and this time it was quite a doozie. The were using the FreeDB dataset as a test bed for a lot of experiments, and they found very slow indexing speed with it.  In fact, what they found was utterly unacceptable indexing speed, to be frank. It took days to complete. However, that run contrary to all of the optimizations that we have done in the past few years.

So something there was quite fishy. Luckily, it was fairly simple to reproduce. And the culprit was very easily identified. It was the SQL Replication Bundle. But why? That turned out to be a pretty complex answer. 

The FreeDB dataset currently contains over 3.1 million records, and the sample the customer send us had about 8 indexes, varying in complexity from the trivial to full text analyzed and map reduce indexes. We expect such work load plus the replication to SQL to take a while. But it should be pretty fast process.

What turned out to be the problem was the way the SQL Replication bundle work. Since we don’t know what changes you are going to replicate to SQL, the first thing we do is to delete all the data that might have previously been replicated. In practice, it means that we execute something like:

DELETE FROM Disks WHERE DocumentId in (@p1, @p2, @p3)

INSERT INTO Disks (...) VALUES( ... )
INSERT INTO Disks (...) VALUES( ... )
INSERT INTO Disks (...) VALUES( ... )

And over time, this became slower & slower. Now, SQL Replication need access to a lot of documents (potentially all of them), so we use the same prefetcher technique that we use for indexing. And we also use the same optimizations to decide how much to load.

However, in this case, we had the SQL Replication being slow, and because we use the same optimization, to the optimizer it looked like we were having a very slow index. That calls for reducing the load on the server so we can have greater responsiveness and to reduce overall resource utilization. And that impacted the indexes. In effect, SQL Replication being slow forced us to feed the data into the indexes in small tidbits, and that drastically increased the I/O costs that we had to pay.

So the first thing to do was to actually break it apart, we now have different optimizers instances for indexes and SQL Replication (and RavenDB Replication, for that matter), and they cannot impact one another.

But the root cause was that SQL Replication was slow. And I think you should be able to figure out why from the information outline above.

As we replicated more and more data into SQL, we increased the table size. And as we increased the table size, statements like our DELETE would take more and more time. SQL was doing a lot of table scans.

To be honest, I never really thought about it. RavenDB in the same circumstances would just self optimize and thing would get faster fast. SQL Server (and any other relational database) would just be dog slow until you came and added the appropriate index.

Once that was done, our performance was back on track and we could run things speedily both for indexes and for SQL Replication.

What is your control group?

One of the areas that where we think Voron can be improved is the free space utilization policy. In particular, smarter free space utilization can lead to better performance, since we won’t have to seek so much.

I spent some time working on that, and I got something that on paper, at least, looks much better, performance wise. But… actual benchmarks showed little to no improvement, and in some cases, actual degradation! That was the point when I realize that I actually needed to have some sort of a control, to see what would be the absolute optimal scenario for us. So I wrote a null free space policy. With no free space, Voron will always go to the end of the file, giving us the best case scenario of sequential writes.

This gives us the following behavior:

Flush      1 with   2 pages -   8 kb writes and   1 seeks (  2 leaves,   0 branches,   0 overflows)
Flush      2 with   8 pages -  32 kb writes and   1 seeks (  7 leaves,   1 branches,   0 overflows)
Flush      3 with  10 pages -  40 kb writes and   1 seeks (  9 leaves,   1 branches,   0 overflows)
Flush     27 with  74 pages - 296 kb writes and   1 seeks ( 72 leaves,   2 branches,   0 overflows)
Flush     28 with  74 pages - 296 kb writes and   1 seeks ( 72 leaves,   2 branches,   0 overflows)
Flush     29 with  72 pages - 288 kb writes and   1 seeks ( 70 leaves,   2 branches,   0 overflows)
Flush  1,153 with 155 pages - 620 kb writes and   1 seeks (102 leaves,  53 branches,   0 overflows)
Flush  1,154 with 157 pages - 628 kb writes and   1 seeks (104 leaves,  53 branches,   0 overflows)
Flush  1,155 with 165 pages - 660 kb writes and   1 seeks (108 leaves,  57 branches,   0 overflows)
Flush  4,441 with 191 pages - 764 kb writes and   1 seeks (104 leaves,  87 branches,   0 overflows)
Flush  4,442 with 196 pages - 784 kb writes and   1 seeks (107 leaves,  89 branches,   0 overflows)
Flush  4,443 with 198 pages - 792 kb writes and   1 seeks (108 leaves,  90 branches,   0 overflows)
Flush  7,707 with 200 pages - 800 kb writes and   1 seeks (106 leaves,  94 branches,   0 overflows)
Flush  7,708 with 204 pages - 816 kb writes and   1 seeks (106 leaves,  98 branches,   0 overflows)
Flush  7,709 with 211 pages - 844 kb writes and   1 seeks (113 leaves,  98 branches,   0 overflows)
Flush  9,069 with 209 pages - 836 kb writes and   1 seeks (107 leaves, 102 branches,   0 overflows)
Flush  9,070 with 205 pages - 820 kb writes and   1 seeks (106 leaves,  99 branches,   0 overflows)
Flush  9,071 with 208 pages - 832 kb writes and   1 seeks (108 leaves, 100 branches,   0 overflows)

And with this, 10,000 transactions with 100 random values each

fill rnd buff separate tx          :    106,383 ms      9,400 ops / sec

And that tells me that for the best case scenario, there is something else that is causing this problem, and it ain’t the cost of doing seeks. I dropped the number of transactions to 500 and run it through a profiler, and I got the following:

image

image

In other words, pretty much the entire time was spent just calling FlushViewOfFile. However, I think that we optimized that enough already, didn’t we? Looking at the calls, it seems that we have just one FlushViewOfFile per transaction in this scenario.

In fact, looking at the actual system behavior, we can see:

image

So seeks wise, we are good. What I can’t understand, however, is why we see those ReadFile calls. Looking at the data, it appears that we run into this whenever we access now portion of the file, so this is the mmap subsystem paging the file contents into memory before we start doing that. It is actually pretty great that it is able to page 1 MB at a time.

Next, let us see what else we can do here. I run the 500 tx test on an HDD drive. And it have given me the following results.

fill rnd sync separate tx          :     25,540 ms      1,958 ops / sec

But note that each write has two writes. One at the end of the file, and one at the file beginning (which is the actual final act of the commit). What happened if we just removed that part?

This give me a very different number:

fill rnd sync separate tx          :     21,764 ms      2,297 ops / sec

So just seeking and writing a single page cost us 17% of our performance. Here are the details from running this test:

image

Now, this is a meaningless test, added just to check what the relative costs are. We have to do the header write, otherwise we can’t do real transactions.

For fun, I run the same thing using sequential write, giving me 3,619 ops / sec. Since in both cases we are actually doing sequential writes, the major differences was how much we actually wrote. This is the view of writing sequentially:

image

As you can see, we only have to write 8 – 10 pages per transaction, compare to 110 – 130 in the random case. And that obviously has a lot of implications.

All of this has thought me something very important. In the end, the actual free space policy matters, but not that much. So I need to select something that is good, but that is about it.

Raven Storage–Voron’s Performance

Voron is the code name for another storage engine that we are currently trying, based on LMDB. After taking that for a spin for a while, it is not pretty complete, and I decided to give it some perf test runs. So far, there has been zero performance work. All the usual caveat applies here, with regards to short test runs, etc.

Just like before, we found that this is horribly slow on the first run. The culprit was our debug code that verified the entire structure whenever we added or removed something. One we run it in release mode we started getting more interesting results.

Here is the test code:

using (var env = new StorageEnvironment(new MemoryMapPager("bench.data", flushMode)))
{
    using (var tx = env.NewTransaction(TransactionFlags.ReadWrite))
    {
        var value = new byte[100];
        new Random().NextBytes(value);
        var ms = new MemoryStream(value);
        for (long i = 0; i < Count; i++)
        {
            ms.Position = 0;
            env.Root.Add(tx, i.ToString("0000000000000000"), ms);
        }

        tx.Commit();
    }
     using (var tx = env.NewTransaction(TransactionFlags.ReadWrite))
     {
         DebugStuff.RenderAndShow(tx, tx.GetTreeInformation(env.Root).Root);

         tx.Commit();
     }
}

We write 1 million entries with 100 bytes in size and 8 bytes of the key. We run it in three mode:

fill seq none                      :      9,578 ms    104,404 ops / sec
fill seq buff                      :     10,802 ms     92,575 ops / sec
fill seq sync                      :      9,387 ms    106,528 ops / sec

None means no flushing to disk, let the OS deals with that completely. Buffers means flush to the OS, but not to disk, and full means do a full fsync.

Now, this is pretty stupid way to go about it, I’ve to say. This is doing everything in a single transaction, and we are actually counting the time to close & open the db here as well, but that is okay for now. We aren’t interested in real numbers, just some rough ideas.

Now, let us see how we read it?

using (var env = new StorageEnvironment(new MemoryMapPager("bench.data")))
{
    using (var tx = env.NewTransaction(TransactionFlags.Read))
    {
        var ms = new MemoryStream(100);
        for (int i = 0; i < Count; i++)
        {
            var key = i.ToString("0000000000000000");
            using (var stream = env.Root.Read(tx, key))
            {
                ms.Position = 0;
                stream.CopyTo(ms);
            }
        }

        tx.Commit();
    }
}

And this gives us:

read seq                           :      3,289 ms    304,032 ops / sec

And again, this is with opening & closing the entire db.

We could do better with pre-allocation of space on the disk, etc. But I wanted to keep things realistic and to allow us to grow.

Next, I wanted to see how much we would gain by parallelizing everything, so I wrote the following code:

using (var env = new StorageEnvironment(new MemoryMapPager("bench.data")))
{
    var countdownEvent = new CountdownEvent(parts);
    for (int i = 0; i < parts; i++)
    {
        var currentBase = i;
        ThreadPool.QueueUserWorkItem(state =>
        {
            using (var tx = env.NewTransaction(TransactionFlags.Read))
            {
                var ms = new MemoryStream(100);
                for (int j = 0; j < Count / parts; j++)
                {
                    var current = j * currentBase;
                    var key = current.ToString("0000000000000000");
                    using (var stream = env.Root.Read(tx, key))
                    {
                        ms.Position = 0;
                        stream.CopyTo(ms);
                    }
                }

                tx.Commit();
            }

            countdownEvent.Signal();
        });
    }
    countdownEvent.Wait();
}

I then run it with 1 – 16 parts, so see how it behaves. Here are the details for this machine:

image

And the results pretty much match what I expected:

read seq                           :      3,317 ms    301,424 ops / sec
read parallel 1                    :      2,539 ms    393,834 ops / sec
read parallel 2                    :      1,950 ms    512,711 ops / sec
read parallel 4                    :      2,201 ms    454,172 ops / sec
read parallel 8                    :      2,139 ms    467,387 ops / sec
read parallel 16                   :      2,010 ms    497,408 ops / sec

We get a 2x perf improvement from running on two cores, running on 4 threads require some dancing around, which caused some perf drop, then we see more time spent in thread switching than anything else, pretty much. As you can see, we see a really pretty jump in performance the more cores we use, until we reach the actual machine limitations.

Note that I made sure to clear the OS buffer cache before each test. If we don't do that, we get:

read seq                           :      2,562 ms    390,291 ops / sec
read parallel 1                    :      2,608 ms    383,393 ops / sec
read parallel 2                    :      1,868 ms    535,220 ops / sec
read parallel 4                    :      1,646 ms    607,283 ops / sec
read parallel 8                    :      1,673 ms    597,557 ops / sec
read parallel 16                   :      1,581 ms    632,309 ops / sec

So far, I am pretty happy with those numbers. What I am not happy is the current API, but I’ll talk about this in a separate post.

Optimizing writes in Voron

As I mentioned, one of the things that I have been working on with Voron is optimizing the sad case of random writes.  I discussed some of the issues that we had already, and now I want to explain how we approach resolving them.

With LMDB, free space occur on every write, because we don’t make modifications in place, instead, we make modifications to a copy, and free the existing page to be reclaimed later. The way the free space reclamation work, a new page can be allocated anywhere on the file. That can lead to a lot of seeks. With Voron, we used a more complex policy. The file is divided in 4 MB sections. And we will aggregate free space in each section. When we need more space, we will find a section with enough free space and use that, and we will continue to use that for as long as we can. The end result is that we tend to be much more local in the way we are reusing space.

Here are the original results:

Flush     1 with  12 pages   - 48 kb writes and 1  seeks   (11 leaves, 1 branches, 0 overflows)
Flush     2 with  13 pages   - 52 kb writes and 1  seeks   (12 leaves, 1 branches, 0 overflows)
Flush     3 with  21 pages   - 84 kb writes and 1  seeks   (20 leaves, 1 branches, 0 overflows)
 
Flush    27 with  76 pages   - 304 kb writes and 1 seeks  (75 leaves,  1 branches, 0 overflows)
Flush    28 with  73 pages   - 292 kb writes and 1 seeks  (72 leaves,  1 branches, 0 overflows)
Flush    29 with  84 pages   - 336 kb writes and 1 seeks  (80 leaves,  4 branches, 0 overflows)
 
Flush 1,153 with 158 pages - 632 kb writes and 67  seeks (107 leaves, 51 branches, 0 overflows)
Flush 1,154 with 168 pages - 672 kb writes and 65  seeks (113 leaves, 55 branches, 0 overflows)
Flush 1,155 with 165 pages - 660 kb writes and 76  seeks (113 leaves, 52 branches, 0 overflows)
 
Flush 4,441 with 199 pages - 796 kb writes and 146 seeks (111 leaves, 88 branches, 0 overflows)
Flush 4,442 with 198 pages - 792 kb writes and 133 seeks (113 leaves, 85 branches, 0 overflows)
Flush 4,443 with 196 pages - 784 kb writes and 146 seeks (109 leaves, 87 branches, 0 overflows)
 
Flush 7,707 with 209 pages - 836 kb writes and 170 seeks (111 leaves, 98 branches, 0 overflows)
Flush 7,708 with 217 pages - 868 kb writes and 169 seeks (119 leaves, 98 branches, 0 overflows)
Flush 7,709 with 197 pages - 788 kb writes and 162 seeks (108 leaves, 89 branches, 0 overflows)
 
Flush 9,069 with 204 pages - 816 kb writes and 170 seeks (108 leaves, 96 branches, 0 overflows)
Flush 9,070 with 206 pages - 824 kb writes and 166 seeks (112 leaves, 94 branches, 0 overflows)
Flush 9,071 with 203 pages - 812 kb writes and 169 seeks (105 leaves, 98 branches, 0 overflows)

And here are the improved results:

Flush      1 with   2 pages -     8 kb writes and   1 seeks (  2 leaves,   0 branches,   0 overflows)
Flush      2 with   8 pages -    32 kb writes and   1 seeks (  7 leaves,   1 branches,   0 overflows)
Flush      3 with  10 pages -    40 kb writes and   1 seeks (  9 leaves,   1 branches,   0 overflows)
  
Flush     27 with  73 pages -   292 kb writes and   1 seeks ( 72 leaves,   1 branches,   0 overflows)
Flush     28 with  72 pages -   288 kb writes and   1 seeks ( 71 leaves,   1 branches,   0 overflows)
Flush     29 with  71 pages -   284 kb writes and   1 seeks ( 70 leaves,   1 branches,   0 overflows)
  
Flush  1,153 with 157 pages -   628 kb writes and  11 seeks (105 leaves,  52 branches,   0 overflows)
Flush  1,154 with 159 pages -   636 kb writes and   2 seeks (107 leaves,  52 branches,   0 overflows)
Flush  1,155 with 167 pages -   668 kb writes and  17 seeks (111 leaves,  56 branches,   0 overflows)
  
Flush  4,441 with 210 pages -   840 kb writes and  11 seeks (121 leaves,  86 branches,   3 overflows)
Flush  4,442 with 215 pages -   860 kb writes and   1 seeks (124 leaves,  88 branches,   3 overflows)
Flush  4,443 with 217 pages -   868 kb writes and   9 seeks (126 leaves,  89 branches,   2 overflows)
  
Flush  7,707 with 231 pages -   924 kb writes and   7 seeks (136 leaves,  93 branches,   2 overflows)
Flush  7,708 with 234 pages -   936 kb writes and   9 seeks (136 leaves,  97 branches,   1 overflows)
Flush  7,709 with 241 pages -   964 kb writes and  13 seeks (140 leaves,  97 branches,   4 overflows)

Flush  9,069 with 250 pages - 1,000 kb writes and   6 seeks (144 leaves, 101 branches,   5 overflows)
Flush  9,070 with 250 pages - 1,000 kb writes and  13 seeks (145 leaves,  98 branches,   7 overflows)
Flush  9,071 with 248 pages -   992 kb writes and  12 seeks (143 leaves,  99 branches,   6 overflows)

Let us plot this in a chart, so we can get a better look at things:

image

As you can see, this is a pretty major improvement. But it came at a cost, let us see the cost of size per transaction…

image

So we improved on the seeks / tx, but got worse on the size / tx. That is probably because of the overhead of keeping the state around, but it also relates to some tunable configuration that we added (the amount of free space in a section that will make it eligible for use.

Annoyingly, after spending quite a bit of time & effort on this, we don’t see a major perf boost here. But I am confident that it’ll come.

Seek, and you shall find, or maybe delay a lot

I have been doing a lot more analysis about the actual disk access patterns that we saw in Voron. In the beginning, I strongly suspected that the problem was with how I was using memory mapped files. In particular, some experiments with using my better knowledge of the environment has led to substantial performance gains. Instead of calling FlushViewOfFile(0, fileLen), I was able to call FlushViewOfFile on just the ranges that I knew changed.

That helped, but it wasn’t nearly enough. So I run some quick tests using file stream, and realized that the fault was obviously with the broken way Windows is using Memory Mapped files. So I decided to (no, I didn’t write my own mmap impl, thank you very much) to take manual care of how Voron is writing to disk. I used WriteFile with all the bells and whistles, even had async I/O for a while there. I was able to directly pinpoint the exact locations where we needed to write, pass that to Windows in an efficient manner, and be done with it. It required me to write malloc implementation in managed code, but it was quite efficient looking code.

And then I run it. Just to give you some perspective, the scenario under question here is 10,000 transactions doing 100 random writes each. Using the memory map approach, after the FlushViewOfFile range optimization, I got roughly 7,000 operations / second. Using my own hand written, optimized I/O, I got… 262 operations / second. Okaaaay… so maybe I need to go back to the drawing board.

I sat down and started working on figuring out what is actually going on. I looked at the actual output that we had, in particular, how many writes did we have per transaction? I sat down to analyze what is going on. We are writing 100 records with 16 bytes key and 100 bytes value. That means that the absolute minimum amount we can write would be 11,600 bytes. However, we are writing in 4Kb pages, which bring us to 3 pages and 12,288 bytes per transaction. Of course, this ignore things like the writing of branch pages in the B+Tree, so let us see what the real numbers are.

   1: Flush     1 with  12 pages   - 48 kb writes and 1  seeks   (11 leaves, 1 branches, 0 overflows)
   2: Flush     2 with  13 pages   - 52 kb writes and 1  seeks   (12 leaves, 1 branches, 0 overflows)
   3: Flush     3 with  21 pages   - 84 kb writes and 1  seeks   (20 leaves, 1 branches, 0 overflows)
   4:  
   5: Flush    27 with  76 pages   - 304 kb writes and 1 seeks  (75 leaves,  1 branches, 0 overflows)
   6: Flush    28 with  73 pages   - 292 kb writes and 1 seeks  (72 leaves,  1 branches, 0 overflows)
   7: Flush    29 with  84 pages   - 336 kb writes and 1 seeks  (80 leaves,  4 branches, 0 overflows)
   8:  
   9: Flush 1,153 with 158 pages - 632 kb writes and 67  seeks (107 leaves, 51 branches, 0 overflows)
  10: Flush 1,154 with 168 pages - 672 kb writes and 65  seeks (113 leaves, 55 branches, 0 overflows)
  11: Flush 1,155 with 165 pages - 660 kb writes and 76  seeks (113 leaves, 52 branches, 0 overflows)
  12:  
  13: Flush 4,441 with 199 pages - 796 kb writes and 146 seeks (111 leaves, 88 branches, 0 overflows)
  14: Flush 4,442 with 198 pages - 792 kb writes and 133 seeks (113 leaves, 85 branches, 0 overflows)
  15: Flush 4,443 with 196 pages - 784 kb writes and 146 seeks (109 leaves, 87 branches, 0 overflows)
  16:  
  17: Flush 7,707 with 209 pages - 836 kb writes and 170 seeks (111 leaves, 98 branches, 0 overflows)
  18: Flush 7,708 with 217 pages - 868 kb writes and 169 seeks (119 leaves, 98 branches, 0 overflows)
  19: Flush 7,709 with 197 pages - 788 kb writes and 162 seeks (108 leaves, 89 branches, 0 overflows)
  20:  
  21: Flush 9,069 with 204 pages - 816 kb writes and 170 seeks (108 leaves, 96 branches, 0 overflows)
  22: Flush 9,070 with 206 pages - 824 kb writes and 166 seeks (112 leaves, 94 branches, 0 overflows)
  23: Flush 9,071 with 203 pages - 812 kb writes and 169 seeks (105 leaves, 98 branches, 0 overflows)

The very first transactions are already showing something very interesting, we are actually writing 12 - 21 pages, or 48  - 84 Kb of data, instead of 12 Kb. Why do we write 4 times as much data as we wanted?

The answer is that we are writing data that is random in nature, so it can’t all sit in the same page, we get a lot of page splits and very quickly we end up with a lot of pages. This is pretty much inevitable, since this is how trees work. But look at what happens down the road. In particular look at lines 9 – 10. You can see that we are now at a pretty stable state. We are writing ~160 pages per transaction. And since we write random data, we tend to touch about 100 leaf pages per transaction (the stuff after 100 is usually page splits). But something that is much more interesting can be seen in the count of seeks.

The way LMDB works, we use copy-on-write, so whenever we modify a page, we are actually modify a copy and mark the actual page as available for future transaction to re-use. This has the great advantage in that we don’t ever need to do compaction, but it also means that when we do want to do writes, we have to make them pretty much all over place. And it actually gets worse as more times goes by.

Now, you have to realize that this is pretty much the worst case scenario, a transaction that does a lot of writes all over the place. But it means that toward the end, we are really hurting.

You already know the numbers, right? What about just straight out writing 1MB? What is the cost of seeks? I wrote the following code:

   1: var buffer = new byte[1024*1024];
   2: var random = new Random();
   3: random.NextBytes(buffer);
   4: if(File.Exists("test.bin"))
   5:     File.Delete("test.bin");
   6: using (var fs = new FileStream("test.bin", FileMode.CreateNew, FileAccess.ReadWrite))
   7: {
   8:     fs.SetLength(1024 * 1024 * 768);
   9:     // warm up
  10:     for (int i = 0; i < 200; i++)
  11:     {
  12:         fs.Position = random.Next(0, (int)fs.Length);
  13:         fs.Write(buffer,0, random.Next(0, 1024));
  14:     }
  15:  
  16:     var sp = Stopwatch.StartNew();
  17:     for (int i = 0; i < 200; i++)
  18:     {
  19:           fs.Position = random.Next(0, (int)fs.Length);
  20:           fs.WriteByte(1);
  21:     }
  22:     fs.Flush(true);
  23:     sp.Stop();
  24:     Console.WriteLine("200 seeks & 200 bytes {0:#,#} ms", sp.ElapsedMilliseconds);
  25:  
  26:     sp = Stopwatch.StartNew();
  27:     fs.Position = random.Next(0, (int)fs.Length);
  28:     fs.Write(buffer, 0, buffer.Length);
  29:     fs.Flush(true);
  30:     sp.Stop();
  31:     Console.WriteLine("1 MB write {0:#,#} ms", sp.ElapsedMilliseconds);
  32: }

The results are quite interesting:

   1: 200 seeks & 200 bytes 146 ms
   2: 1 MB write 6 ms

Just to note, this is when running on SSD, the numbers are supposed to be a lot worse when running on HDD.

In other words, it ain’t the size of the write, but how you spread it around that really matters. Just to compare, here are the numbers for when we are doing sequential writes:

   1: Flush      1 with   6 pages -  24 kb writes and   1 seeks (  5 leaves,   1 branches,   0 overflows)
   2: Flush      2 with   6 pages -  24 kb writes and   1 seeks (  5 leaves,   1 branches,   0 overflows)
   3: Flush      3 with   6 pages -  24 kb writes and   1 seeks (  5 leaves,   1 branches,   0 overflows)
   4:  
   5: Flush    159 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
   6: Flush    160 with   8 pages -  32 kb writes and   3 seeks (  6 leaves,   2 branches,   0 overflows)
   7: Flush    161 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
   8: Flush    162 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
   9:  
  10: Flush  1,320 with   8 pages -  32 kb writes and   3 seeks (  6 leaves,   2 branches,   0 overflows)
  11: Flush  1,321 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
  12: Flush  1,322 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
  13:  
  14: Flush  4,316 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
  15: Flush  4,317 with   7 pages -  28 kb writes and   2 seeks (  5 leaves,   2 branches,   0 overflows)
  16: Flush  4,318 with   8 pages -  32 kb writes and   3 seeks (  6 leaves,   2 branches,   0 overflows)
  17:  
  18: Flush  7,409 with   8 pages -  32 kb writes and   4 seeks (  5 leaves,   3 branches,   0 overflows)
  19: Flush  7,410 with   9 pages -  36 kb writes and   2 seeks (  6 leaves,   3 branches,   0 overflows)
  20: Flush  7,411 with   8 pages -  32 kb writes and   4 seeks (  5 leaves,   3 branches,   0 overflows)
  21:  
  22: Flush  9,990 with   8 pages -  32 kb writes and   3 seeks (  5 leaves,   3 branches,   0 overflows)
  23: Flush  9,991 with   9 pages -  36 kb writes and   4 seeks (  6 leaves,   3 branches,   0 overflows)
  24: Flush  9,992 with   8 pages -  32 kb writes and   3 seeks (  5 leaves,   3 branches,   0 overflows)
  25: Flush  9,993 with   8 pages -  32 kb writes and   4 seeks (  5 leaves,   3 branches,   0 overflows)

Because we are always writing at the end, we only need to touch very few pages. Note that even with the small number of pages, we still need to do quite a bit of seeks, relative to the number of pages we write.

I have some ideas about this, but they are still unformed. Mostly we need to balance between the amount of free space that is used to the number of seeks allowed. I think we can get there by being smart about tracking the number of pages modified by a transaction, and waiting until free space become available in sufficient numbers to be relevant. Something like that would allow auto tuning of the amount of garbage we accumulate vs. the number of seeks we require.

Tags:

Published at

Originally posted at

Comments (14)

Degenerate performance scenario for LMDB

I am working peacefully through some stuff with Voron, when I run into some really bad performance in a pretty important scenario. I decided that this is probably something that I am doing wrong, and set out to reproduce the same scenario in LMDB, to figure out what I am doing wrong.

Here it the code:

int main(int argc,char * argv[])
{
    int i = 0, j = 0, rc;
    UUID id;
    MDB_env *env;

    MDB_val key, data;

    MDB_stat mst;
    MDB_cursor *cursor;
    char sval[32];
    clock_t start = clock(), end;

    srandom(time(NULL));

    rc = mdb_env_create(&env);
    rc = mdb_env_set_mapsize(env, 1024*1024*768);
rc = mdb_env_set_flags(env, MDB_WRITEMAP, 1); rc = mdb_env_open(env, "E:\\data\\lmdb2", 0, 0664); key.mv_size = sizeof(UUID); data.mv_size = sizeof(sval); data.mv_data = sval; for (i=0;i<100;i++) { MDB_txn *txn; MDB_dbi dbi; rc = mdb_txn_begin(env, NULL, 0, &txn); rc = mdb_open(txn, NULL, 0, &dbi); for (j= 0; j < 100; j++) { UuidCreate(&id); key.mv_data = &id; rc = mdb_put(txn, dbi, &key, &data, 0); } rc = mdb_txn_commit(txn); mdb_close(env, dbi); } end = clock(); printf("%i", (end - start)); fgetc(stdin); mdb_env_close(env); return 0; }

As you can see, we are inserting 10,000 items (100 transactions of 100 items each). Each item has key of 16 bytes and a value of 100 bytes. Now, you might note that this is probably the worst case scenario for a B+Tree, UUID are unordered, and this generate a lot of fragmentation in the tree. Bad scenario, yes, but also a relatively common one, and one that needs to be handled properly for our needs. It took me a long while to narrow down what is actually going on. At first I was convinced that the problem was with Windows’ implementation of memory mapped files, but eventually I realized that select ain’t broken.

Here is the Process Monitor’s trace of the first few transactions. The highlighted calls are to FlushBuffersFile, which is how FlushFileBuffers look like, which indicate a commit.

image

As I said, those are the first few transactions. You can see that the OS is doing a pretty good job in merging writes and avoid seeks. However, as times goes by…

image

And as more times goes by, we get to… ( there are more rows at the top that I had to remove so I could take the snapshot).

image

In fact, I put the data in Excel and got:

 

image

And those are some really bad numbers, I think you’ll agree. After the very short period of time, the cost of committing a transaction goes to over 50 seeks and writing of 350KB.

Let us compare that to what happens when we are actually writing sequential data (using UuidCreateSequential instead of UuidCreate). The test complete in half the time, and the actual statistics are also very interesting.

image

 

We are writing a lot less, but much more important, we are doing a lot less seeks. For reference, here is the final transaction that we have with the sequential write scenario:

image

I run those tests on a HDD drive, so seeks times matter a lot, but I have gotten similar results when running on SSD.

Now, the interesting question here is what exactly is going on? And I think that a large part of this is the way LMDB allocate pages. It uses a copy on write method, which means that after the transaction is done, the previously used page are free. That means that they can be reclaimed. And the next transaction will do just that. The problem with this approach is that it tends to spread writes all over the file. This saves disk space, but require a lot more seeks when you need to commit a transaction.

That means that in order to optimize this particular scenario, I probably need to do some thinking about the free page allocation behavior. We want to be a lot more conservative about when / how we give out those pages, to make sure that we aren’t generating a really bad situation for the db when commit time comes.

Tags:

Published at

Originally posted at

Comments (15)

With malice aforethought, we can try even better

Continuing on with the same theme from our last post. How can we improve the speed in which we write to disk. In particular, I am currently focused on my worst case scenario:

fill rnd buff 10,000 tx            :    161,812 ms      6,180 ops / sec

This is 10,000 transactions all running one after another, and taking really way too long to go about doing their thing. Now, we did some improvements and we got it all the way to 6,340 ops / sec, but I think you’ll agree that even this optimization is probably still bad. We spent more time there, trying to figure out exactly how we can do micro optimizations, and we got all the way up to 8,078 ops /sec.

That is the point where I decided that I would really like to look at the raw numbers that I can get from this system. So I wrote the following code:

   1: var key = Guid.NewGuid().ToByteArray();
   2: var buffer = new byte[100];
   3: new Random().NextBytes(buffer);
   4:  
   5: using (var fs = new FileStream("test.bin", FileMode.Truncate, FileAccess.ReadWrite))
   6: {
   7:     fs.SetLength(1024*1024*768);
   8:  
   9:     var sp = Stopwatch.StartNew();
  10:  
  11:     for (int i = 0; i < 10*1000; i++)
  12:     {
  13:         for (int j = 0; j < 100; j++)
  14:         {
  15:             fs.Write(key,0, 16);
  16:             fs.Write(buffer, 0, 100);
  17:         }
  18:         fs.Flush(true);
  19:     }
  20:     Console.WriteLine("{0:#,#} ms for {1:#,#} ops / sec", sp.ElapsedMilliseconds, (1000*1000)/sp.Elapsed.TotalSeconds);
  21: }

This code mimic the absolute best scenario we could hope for. Zero cost for managing the data, pure sequential writes. Note that we call to Flush(true) to simulate 10,000 transactions. This code gives me: 147,201 ops / sec.

This is interesting, mostly because I thought the reason our random writes with 10K transactions are bad was the calls to Flush(), but it appears that this is actually working very well. I then tested this with some random writes, by adding the following lines before line 13:

   1: var next = random.Next(0, 1024*1024*512);
   2: fs.Position = next - next%4096;

I then decided to try it with memory mapped files, and I wrote:

   1: using (var fs = new FileStream("test.bin", FileMode.Truncate, FileAccess.ReadWrite))
   2: {
   3:     fs.SetLength(1024 * 1024 * 768);
   4:  
   5:     var memoryMappedFile = MemoryMappedFile.CreateFromFile(fs,
   6:                                     "test", fs.Length, MemoryMappedFileAccess.ReadWrite,
   7:                                     null, HandleInheritability.None, true);
   8:     var memoryMappedViewAccessor = memoryMappedFile.CreateViewAccessor();
   9:  
  10:     byte* p = null;
  11:     memoryMappedViewAccessor.SafeMemoryMappedViewHandle.AcquirePointer(ref p);
  12:  
  13:     var sp = Stopwatch.StartNew();
  14:  
  15:     for (int i = 0; i < 10 * 1000; i++)
  16:     {
  17:         var next = random.Next(0, 1024 * 1024 * 512);
  18:         byte* basePtr = p + next;
  19:         using (var ums = new UnmanagedMemoryStream(basePtr, 12 * 1024,12*1024, FileAccess.ReadWrite))
  20:         {
  21:             for (int j = 0; j < 100; j++)
  22:             {
  23:                 ums.Write(key, 0, 16);
  24:                 ums.Write(buffer, 0, 100);
  25:             }
  26:         }
  27:     }
  28:     Console.WriteLine("{0:#,#} ms for {1:#,#} ops / sec", sp.ElapsedMilliseconds, (1000 * 1000) / sp.Elapsed.TotalSeconds);
  29: }

You’ll note that I am not doing any flushing here. That is intention for now, using this, I am getting 5+ million ops per second. But since I am not doing flushing, this is pretty much me testing how fast I can write to memory.

Adding a single flush cost us 1.8 seconds for a 768MB file. And what about doing the right thing? Adding the following in line 26 means that we are actually flushing the buffers.

   1: FlushViewOfFile(basePtr, new IntPtr(12 * 1024));

Note that we are not flushing to disk, we still need to do that. But for now, let us try doing it. This single line changed the code from 5+ million ops to doing 170,988 ops / sec. And that does NOT include actual flushing to disk. When we do that, too, we get a truly ridiculous number: 20,547 ops / sec. And that explains quite a lot, I think.

For reference, here is the full code:

   1: unsafe class Program
   2: {
   3:     [DllImport("kernel32.dll", SetLastError = true)]
   4:     [return: MarshalAs(UnmanagedType.Bool)]
   5:     extern static bool FlushViewOfFile(byte* lpBaseAddress, IntPtr dwNumberOfBytesToFlush);
   6:  
   7:     static void Main(string[] args)
   8:     {
   9:         var key = Guid.NewGuid().ToByteArray();
  10:         var buffer = new byte[100];
  11:         var random = new Random();
  12:         random.NextBytes(buffer);
  13:  
  14:         using (var fs = new FileStream("test.bin", FileMode.Truncate, FileAccess.ReadWrite))
  15:         {
  16:             fs.SetLength(1024 * 1024 * 768);
  17:  
  18:             var memoryMappedFile = MemoryMappedFile.CreateFromFile(fs,
  19:                                             "test", fs.Length, MemoryMappedFileAccess.ReadWrite,
  20:                                             null, HandleInheritability.None, true);
  21:             var memoryMappedViewAccessor = memoryMappedFile.CreateViewAccessor();
  22:  
  23:             byte* p = null;
  24:             memoryMappedViewAccessor.SafeMemoryMappedViewHandle.AcquirePointer(ref p);
  25:  
  26:             var sp = Stopwatch.StartNew();
  27:  
  28:             for (int i = 0; i < 10 * 1000; i++)
  29:             {
  30:                 var next = random.Next(0, 1024 * 1024 * 512);
  31:                 byte* basePtr = p + next;
  32:                 using (var ums = new UnmanagedMemoryStream(basePtr, 12 * 1024, 12 * 1024, FileAccess.ReadWrite))
  33:                 {
  34:                     for (int j = 0; j < 100; j++)
  35:                     {
  36:                         ums.Write(key, 0, 16);
  37:                         ums.Write(buffer, 0, 100);
  38:                     }
  39:                 }
  40:                 FlushViewOfFile(basePtr, new IntPtr(12 * 1024));
  41:                 fs.Flush(true);
  42:             }
  43:             Console.WriteLine("{0:#,#} ms for {1:#,#} ops / sec", sp.ElapsedMilliseconds, (1000 * 1000) / sp.Elapsed.TotalSeconds);
  44:         }
  45:     }
  46: }

This is about as efficient a way you can get for writing to the disk using memory mapped files if you need to do that using memory mapped files in a transactional manner. And that is the absolute best case scenario, pretty much. Where we know exactly what we wrote and were we wrote it. And we always write a single entry, of a fixed size, etc. In Voron’s case, we might write to multiple pages at the same transaction (in fact, we are pretty much guaranteed to do just that).

This means that I need to think about other ways of doing that.

With a little knowledge and a profiler, let us optimize!

As part of the work we have been doing on Voron, I wrote a few benchmarks and looked at where the hot spots are. One of the major ones was this function:

   1: public override void Flush()
   2: {
   3:     if (_flushMode == FlushMode.None)
   4:         return;
   5:  
   6:     PagerState.Accessor.Flush();
   7:     _fileStream.Flush(_flushMode == FlushMode.Full);
   8: }

This is the “fsync()” call, effectively. Accessor.Flush() call will resolve to a FlushViewOfFile(0, size); and _fileStream.Flush(true) will resolve to FlushFileBuffers on Windows.

It isn’t surprising that this would be THE hotspot, it is the part where we actually have to wait for the hardware to do stuff, after all. But further investigation revealed that it wasn’t the FlushFileBuffers that was really costly, it was the FlushViewOfFile. What FlushViewOfFile will do is scan all of the pages in range, and flush them to the OS (not to disk) if they are dirty. That is great, but it is effectively an O(N) operation. We have more knowledge about what is going on, so we can do better. We already know what are the dirty pages, so we can actually use that, instead of letting the OS do all the work.

But then we run into another problem. If we actually just call FlushViewOfFile for every page separately, we are going to have to spend a lot of time just calling to the OS when we have to do a large write. So we need to balance the amount of data we send to FlushViewOfFile with the number of times we are going to call FlushViewOfFile. Therefor, I came up with the following logic. We are going to group calls to FlushViewOfFile, as long as they are nearby (within 256KB of one another), this will give us the best balance between reducing the number of pages that FlushViewOfFile needs to call and the number of times we call FlushViewOfFile.

This now looks like this:

   1: public override void Flush(List<long> sortedPagesToFlush)
   2: {
   3:     if (_flushMode == FlushMode.None || sortedPagesToFlush.Count == 0)
   4:         return;
   5:  
   6:     // here we try to optimize the amount of work we do, we will only 
   7:     // flush the actual dirty pages, and we will do so in sequential order
   8:     // ideally, this will save the OS the trouble of actually having to flush the 
   9:     // entire range
  10:     long start = sortedPagesToFlush[0];
  11:     long count = 1;
  12:     for (int i = 1; i < sortedPagesToFlush.Count; i++)
  13:     {
  14:         var difference = sortedPagesToFlush[i] - sortedPagesToFlush[i - 1];
  15:         // if the difference between them is not _too_ big, we will just merge it into a single call
  16:         // we are trying to minimize both the size of the range that we flush AND the number of times
  17:         // we call flush, so we need to balance those needs.
  18:         if (difference < 64)
  19:         {
  20:             count += difference;
  21:             continue;
  22:         }
  23:         FlushPages(start, count);
  24:         start = sortedPagesToFlush[i];
  25:         count = 1;
  26:     }
  27:     FlushPages(start, count);
  28:  
  29:     if (_flushMode == FlushMode.Full)
  30:         _fileStream.Flush(true);
  31: }

A side affect of this is that we are more likely to be writing to the disk in a sequential fashion because of this.

The end result of this change was doubling the performance of the system under worse case scenario to “just” 25% faster under best conditions.

Tags:

Published at

Originally posted at

Comments (14)

Memory Mapped Files, File I/O & Performance

I have been testing out several approaches for writing out to files. And I thought that the results are interesting enough to share. In all cases, I was writing a 128Kb buffer of random data to a file with size of 256Mb.

The first thing that I wanted to try was the trivial managed memory map approach:

using (var mmf = MemoryMappedFile.CreateFromFile("test.bin", FileMode.Create, "test", 1024*1024*256))
{
    using (var accessor = mmf.CreateViewAccessor())
    {
        for (int i = 0; i < accessor.Capacity; i += buffer.Length)
        {
            accessor.WriteArray(i, buffer, 0, buffer.Length);
        }
        accessor.Flush();
    }
}

This completed in 3.871 seconds.

Next, I Wanted to see what would happen if I were using direct memory access, and used CopyMemory to do that:

[DllImport("kernel32.dll", EntryPoint = "RtlMoveMemory")]
static extern void CopyMemory(byte* dst, byte* src, long size);

using (var mmf = MemoryMappedFile.CreateFromFile("test.bin", FileMode.Create, "test", 1024*1024*256))
{
    using (var accessor = mmf.CreateViewAccessor())
    {
        byte* p = null;
        accessor.SafeMemoryMappedViewHandle.AcquirePointer(ref p);
        fixed (byte* src = buffer)
        {
            for (int i = 0; i < accessor.Capacity; i += buffer.Length)
            {
                CopyMemory(p + i, src, buffer.Length);
            }
        }
        accessor.SafeMemoryMappedViewHandle.ReleasePointer();
        accessor.Flush();
    }
}

As you can see, this is somewhat more complex, and require unsafe code. But this completed in 2.062 seconds. Nearly twice as fast.

Then I decided to try with raw file IO:

using (var f = new FileStream("test.bin",FileMode.Create))
{
    f.SetLength(1024*1024*256);
    for (int i = 0; i < f.Length; i += buffer.Length)
    {
        f.Write(buffer, 0, buffer.Length);
    }
    f.Flush(true);
}

This is about the most trivial code that you can think of, and this completed in about 1.956 seconds. Slightly faster, but within the margin of error (note, in repeated tests, they were consistently very close, and the file I/O was very near).

So, in other words, the accessor code adds a lot of overhead when using Memory Mapped Files.

Raven Storage, early perf numbers

So, our managed implementation of leveldb is just about ready to go out and socialize. Note that those are just for a relatively short duration, but they give us good indicator of where we are. We are still running longer term perf tests now. Also note that they are early numbers. We did performance work, but it is still not done.

The following tests were done on a HDD, and all include writing a million records (16 bytes key, 100 bytes values) to storage.

  • Writing 1 million sequential keys - 52,152 op/s
  • Writing 1 million random keys -  11,986 op/s
  • Writing 1 million sequential keys with fsync - 17,225 op/s

And now, for the reads portion:

  • Sequential reads - 104,620 op/s
  • Reverse sequential reads - 57,932 op/s
  • Random reads - 3,191 op/s

Note that I am pretty sure that the reason for the later performance is that it is using an HDD, instead of SSD.

Some thoughts about compression & storage

One of the advantages that keeps showing up with leveldb is the notion that it compresses the data on disk by default. Since reading data from disk is way more expensive than the CPU cost of compression & decompression, that is a net benefit.

Or is it? In the managed implementation we are currently working on, we chose to avoid this for now. For a very simple reason. By storing the compressed data on disk, it means that you cannot just give a user the memory mapped buffer and be done with it, you actually have to decompress the data yourself, then hand the user the buffer to the decompressed memory. In other words, instead of having a single read only buffer that the OS will manage / page / optimize for you, you are going to have to allocate memory over & over again, and you’ll pay the cost of decompressing again and again.

I think that it would be better to have the client make that decision. They can send us data that is already compressed, so we won’t need to do anything else, and we would still be able to just hand them a buffer of data. Sure, it sounds like we are just moving the cost around, isn’t it? But what actually happens is that you have a better chance to do optimizations. For example, if I am storing the data compressing via gzip. And I’m exposing the data over the wire, I can just stream the results from the storage directly to the HTTP stream, without having to do anything about it. It can be decompressed on the client.

On the other hand, if I have storage level decompression, I am paying for the cost of reading the compressed data from disk, then allocating new buffer, decompressing the data, then going right ahead and compressing it again for sending over the wire.

What comes after leveldb?

Kellabyte rightfully points out that leveldb was designed primarily for mobile devices, and that there have been many things that can be done that weren’t because of that design goal. I was already pointed out to HyperDex’s port of leveldb, and I am also looking into what Basho is doing with it.

You can read a bit about what HyperDex did to improve things here. And Basho challenges are detailed in the following presentation. The interesting thing about this is why people want to go to leveldb to start with.

To put it simply, it is a really good product, in the sense that it is doing what it needs to do, and it is small enough so you can understand it in a reasonably short order. The fun starts when you disagree with the decisions made for you by leveldb, it is actually quite easy to make changes to the root behavior of the project without making significant changes to the overall architecture. Looking over the HyperDex changes overview, it appears that they were able to make very targeted changes. All the hard problems (storing, searching, atomicity, etc) were already solved, now the problem is dealing with optimizing them.

I’m going to review the changes made by both Basho & HyperDex to leveldb, because I think that it would be a facinating object of study. In particular, I was impressed with how leveldb merged transactions, and I am looking forward at how HyperDex parallelized them.

Published at

Originally posted at

The RavenDB indexing process: Optimization–Tuning? Why, we have auto tuning

The final aspect of RavenDB’s x7 jump in indexing performance is the fact that we made it freakishly smart.

During standard operation, most indexes only update when new information comes in, we are usually talking about a small number of documents for every indexing run. The problem is what happens when you have a sudden outpour of documents into RavenDB? For example, during nightly ETL batch, or just if you suddenly have a flood of users doing write operations.

The problem here is that we actually have to balance a lot of variable at the same time:

  • The number of documents that we have to index*.
  • The current memory utilization**.
  • How any cores I have available to do the index work with?
  • How much time do I have to do this?

Basically, the idea goes like this, if I have a small batch size, I am able to index more quickly, ensuring that we have fresher results. If I have big batch size, I am able to index more documents, and my overall indexing times goes down.

There is a non trivial cost associated with every indexing run, so reducing the number of indexing run is good, but the more documents I shove into a single run, the more memory will I use, and the more time it will take before the results are visible to the users.

* It is non trivial because there is no easy way for us to even know how many documents we have left to index (to find out is costly).

** Memory utilization is hard to figure out in a managed world. I don’t actually have a way to know how much memory I am using for indexing and how much for other stuff, and there is no real way to say “free the memory from the last indexing run”, or even estimate how much memory that took.

What we have decided on doing is to start from a very small (low hundreds) indexing batch size, and see what is actually going on live. If we see that we have more documents to index than the current batch size, we will slowly double the size of the batch. Slowly, because bigger batches requires more memory, and we also have to take into account current utilization, memory usage, and a bunch of other factors as well. We also go the other way around, able to reduce the indexing batch size on demand based on how much work we have to do right now.

We also provide an upper limit, because at some point it make sense to just do a big batch and make the indexing results visible than to try to do everything all at once.

The fun part in all of that is that once we have found the appropriate algorithm for this, it means that RavenDB will automatically adjust itself based on real production load. If you have an low update rate, it will favor small indexing batches and immediately execute indexing on the new documents. However, if you suddenly have a spike in traffic and the update rate goes up, RavenDB will adjust the indexing batch size so it will be able to keep up with your rate.

We have done some (read, a huge amount) testing with regards to this new optimization, and it turns out that under slow update frequency, we are seeing an average of 15 – 25 ms between a document update and it showing up in the indexes. That is pretty good, but what is going on when we have data just pouring in?

We tested this with a 3 million documents and 3 indexes. And it turn out that under this scenario, where we are trying to shove data into RavenDB as fast as it can accept it, we do see an increase in index latency. Under those condition, latency rose all the way to 1.5 seconds.

This is actually something that I am very happy about, because we were able to automatically adjust to the changing conditions, and were still able to index things at a reasonable rate (note that under this scenario, the batch size was usually 8 – 16 thousands documents, vs. the 128 – 256 that it is normally).

Because we were able to adjust the batch size on the fly, we could handle sustained writes at this rate with no interruption in service and no real need to think about this from the users perspective.. Exactly what the RavenDB philosophy calls for.

The RavenDB indexing process: Optimization–Getting documents from disk

As I noted in my previous post, we have done major optimizations for RavenDB. One of the areas where we improved the performance was reading the documents from the disk for indexing.

In Pseudo Code, it looks like this:

while database_is_running:
  stale = find_stale_indexes()
  lastIndexedEtag = find_last_indexed_etag(stale)
  docs_to_index = get_documents_since(lastIndexedEtag, batch_size)
  

As it turned out, we had a major optimization option here, because of the way the data is actually structured on disk. In simple terms, we have an on disk index that lists the documents in the order in which they were updated, and then we have the actual documents themselves, which may be anywhere on the disk.

Instead of loading the documents in the orders in which they were modified, we decided to try something different. We first query the information we need to find the document on disk from the index, then we sort them based on the optimal access pattern, to reduce disk movement and ensure that we have as sequential reads as possible. Then we take those results in memory and sort them based on their last update time again.

This seems to be a perfectly obvious thing to do, assuming that you are aware of such things, but it is actually something that is very easy not to notice. The end result is quite promising, and it contributed to the 7+ times improvements in perf that we had for indexing costs.

But surprisingly, it wasn’t the major factor, I’ll discuss a huge perf boost in this area tomorrow.

The RavenDB indexing process: Optimization–De-parallelizing work

One of the major dangers in doing perf work is that you have a scenario, and you optimize the hell out of that scenario. It is actually pretty easy to do without even noticing it. The problem is that when you do things like that, you are likely to be optimizing a single scenario to perform really well, but you are hurting the overall system performance.

In this example, we have moved heaven and earth to make sure that we are indexing things as fast as possible, and we tested with 3 indexes, on an 4 cores machine. As it turned out, we actually had improved things, for that particular scenario.

Using the same test case on a single core machine was suddenly far more heavy weight, because we were pushing a lot of work at the same time. More than the machine could process. The end result was that it actually got there, but much more slowly than if we would have run things sequentially.

Of course, I give you the outliers, but those are good indicators for what we found out. Initially, we thought that we could resolve that by using the TPL’s MaxDegreeOfParallelism, but it turned out to be more complex than that. We have IO bound and we have CPU bound tasks that we need to execute, and trying to execute IO heavy tasks with this would actually cause issues in this scenario.

We had to manually throttle things ourselves, both to ensure limited number of parallel work, and because we have a lot more information about the actual tasks than the TPL have. We can schedule them in a way that is far more efficient because we can tell what is actually going on.

The end result is that we are actually using less parallelism, overall, but in a more efficient manner.

In my next post, I’ll discuss the auto batch tuning support, which allows us to do some really amazing things from the point of view of system performance.

The RavenDB indexing process: Optimization–Parallelizing work

One of the things that we are doing during the index process for RavenDB is applying triggers and deciding what, if and how a document will be indexed. The actual process is a bit more involved, because we have to do additional things (like figure out which indexes have already indexed those particular documents).

At any rate, the interesting thing is that this is a process which is pretty basic:

for doc in docs:
    matchingIndexes = FindIndexesFor(doc)
    if matchingIndexes.Count > 0:
       doc = ExecuteTriggers(doc) 
       if doc != null:
          yield doc

The interesting thing about this is that this is a set of operations that only works on a single document at a time, and the result is the modified documents.

We were able to gain significant perf boost by simply moving to a Parallel.ForEach call.  This seems simple enough, right? Parallelize the work, get better benefits.

Except that there are issues with this as well, which I’ll touch on my next post.

The RavenDB indexing process: Optimization

The actual process done by RavenDB to index documents is a fairly complex one. In order to understand what exactly happened, I decided to break it apart to pseudo code.

It looks something like this:

while database_is_running:
  stale = find_stale_indexes()
  lastIndexedEtag = find_last_indexed_etag(stale)
  docs_to_index = get_documents_since(lastIndexedEtag, batch_size)
  
  filtered_docs = execute_read_filters(docs_to_index)
  
  indexing_work = []
  
  for index in stale:
    
    index_docs = select_matching_docs(index, filtered_docs)
    
    if index_docs.empty:
      set_indexed(index, lastIndexedEtag)
    else
      indexing_work.add(index, index_docs)
      
  for work in indexing_work:
  
     work.index(work.index_docs)

And now let me show you the areas in which we did some perf work:

while database_is_running:
  stale = find_stale_indexes()
  lastIndexedEtag = find_last_indexed_etag(stale)
  docs_to_index = get_documents_since(lastIndexedEtag, batch_size)
  
  filtered_docs = execute_read_filters(docs_to_index)
  
  indexing_work = []
  
  for index in stale:
    
    index_docs = select_matching_docs(index, filtered_docs)
    
    if index_docs.empty:
      set_indexed(index, lastIndexedEtag)
    else
      indexing_work.add(index, index_docs)
      
  for work in indexing_work:
  
     work.index(work.index_docs)

All of which gives us a major boost in the system performance. I’ll discuss each part of that work in detail, don’t worry Winking smile

Performance implications of method signatures

In my previous post, I asked: What are the performance implications of the two options?

image_thumb

Versus:

image_thumb1

And the answer is quite simple. The chance to optimize how it works.

In the first example, we have to return an unknown amount of information. In the second example, we know how much data we need to return. That means that we can optimize ourselves based on that.

What do I mean by that?

Look at the method signatures, those requires us to scan a secondary index, and get the results back. From there, we need to get back to the actual data. If we knew what the size of the data that we need to return is, we could fetch just the locations from the index, then optimize our disk access pattern to take advantage of sequential reads.

In the first example, we have to assume that every read is the last read. Callers may request one item, or 25 or 713, so we don’t really have a way to optimize things. The moment that we have the amount that the caller wants, things change.

We can scan the index to get just actual position of the document on disk, and then load the documents from the disk based on the optimal access pattern in terms of disk access. It is a very small change, but it allowed us to make a big optimization.