Seek, and you shall find, or maybe delay a lot

time to read 30 min | 5927 words

I have been doing a lot more analysis about the actual disk access patterns that we saw in Voron. In the beginning, I strongly suspected that the problem was with how I was using memory mapped files. In particular, some experiments with using my better knowledge of the environment has led to substantial performance gains. Instead of calling FlushViewOfFile(0, fileLen), I was able to call FlushViewOfFile on just the ranges that I knew changed.

That helped, but it wasn’t nearly enough. So I run some quick tests using file stream, and realized that the fault was obviously with the broken way Windows is using Memory Mapped files. So I decided to (no, I didn’t write my own mmap impl, thank you very much) to take manual care of how Voron is writing to disk. I used WriteFile with all the bells and whistles, even had async I/O for a while there. I was able to directly pinpoint the exact locations where we needed to write, pass that to Windows in an efficient manner, and be done with it. It required me to write malloc implementation in managed code, but it was quite efficient looking code.

And then I run it. Just to give you some perspective, the scenario under question here is 10,000 transactions doing 100 random writes each. Using the memory map approach, after the FlushViewOfFile range optimization, I got roughly 7,000 operations / second. Using my own hand written, optimized I/O, I got… 262 operations / second. Okaaaay… so maybe I need to go back to the drawing board.

I sat down and started working on figuring out what is actually going on. I looked at the actual output that we had, in particular, how many writes did we have per transaction? I sat down to analyze what is going on. We are writing 100 records with 16 bytes key and 100 bytes value. That means that the absolute minimum amount we can write would be 11,600 bytes. However, we are writing in 4Kb pages, which bring us to 3 pages and 12,288 bytes per transaction. Of course, this ignore things like the writing of branch pages in the B+Tree, so let us see what the real numbers are.

   1: Flush     1 with  12 pages   - 48 kb writes and 1  seeks   (11 leaves, 1 branches, 0 overflows)
   2: Flush     2 with  13 pages   - 52 kb writes and 1  seeks   (12 leaves, 1 branches, 0 overflows)
   3: Flush     3 with  21 pages   - 84 kb writes and 1  seeks   (20 leaves, 1 branches, 0 overflows)
   4:  
   5: Flush    27 with  76 pages   - 304 kb writes and 1 seeks  (75 leaves,  1 branches, 0 overflows)
   6: Flush    28 with  73 pages   - 292 kb writes and 1 seeks  (72 leaves,  1 branches, 0 overflows)
   7: Flush    29 with  84 pages   - 336 kb writes and 1 seeks  (80 leaves,  4 branches, 0 overflows)
   8:  
   9: Flush 1,153 with 158 pages - 632 kb writes and 67  seeks (107 leaves, 51 branches, 0 overflows)
  10: Flush 1,154 with 168 pages - 672 kb writes and 65  seeks (113 leaves, 55 branches, 0 overflows)
  11: Flush 1,155 with 165 pages - 660 kb writes and 76  seeks (113 leaves, 52 branches, 0 overflows)
  12:  
  13: Flush 4,441 with 199 pages - 796 kb writes and 146 seeks (111 leaves, 88 branches, 0 overflows)
  14: Flush 4,442 with 198 pages - 792 kb writes and 133 seeks (113 leaves, 85 branches, 0 overflows)
  15: Flush 4,443 with 196 pages - 784 kb writes and 146 seeks (109 leaves, 87 branches, 0 overflows)
  16:  
  17: Flush 7,707 with 209 pages - 836 kb writes and 170 seeks (111 leaves, 98 branches, 0 overflows)
  18: Flush 7,708 with 217 pages - 868 kb writes and 169 seeks (119 leaves, 98 branches, 0 overflows)
  19: Flush 7,709 with 197 pages - 788 kb writes and 162 seeks (108 leaves, 89 branches, 0 overflows)
  20:  
  21: Flush 9,069 with 204 pages - 816 kb writes and 170 seeks (108 leaves, 96 branches, 0 overflows)
  22: Flush 9,070 with 206 pages - 824 kb writes and 166 seeks (112 leaves, 94 branches, 0 overflows)
  23: Flush 9,071 with 203 pages - 812 kb writes and 169 seeks (105 leaves, 98 branches, 0 overflows)

The very first transactions are already showing something very interesting, we are actually writing 12 - 21 pages, or 48  - 84 Kb of data, instead of 12 Kb. Why do we write 4 times as much data as we wanted?

The answer is that we are writing data that is random in nature, so it can’t all sit in the same page, we get a lot of page splits and very quickly we end up with a lot of pages. This is pretty much inevitable, since this is how trees work. But look at what happens down the road. In particular look at lines 9 – 10. You can see that we are now at a pretty stable state. We are writing ~160 pages per transaction. And since we write random data, we tend to touch about 100 leaf pages per transaction (the stuff after 100 is usually page splits). But something that is much more interesting can be seen in the count of seeks.

The way LMDB works, we use copy-on-write, so whenever we modify a page, we are actually modify a copy and mark the actual page as available for future transaction to re-use. This has the great advantage in that we don’t ever need to do compaction, but it also means that when we do want to do writes, we have to make them pretty much all over place. And it actually gets worse as more times goes by.

Now, you have to realize that this is pretty much the worst case scenario, a transaction that does a lot of writes all over the place. But it means that toward the end, we are really hurting.

You already know the numbers, right? What about just straight out writing 1MB? What is the cost of seeks? I wrote the following code:

   1: var buffer = new byte[1024*1024];
   2: var random = new Random();
   3: random.NextBytes(buffer);
   4: if(File.Exists("test.bin"))
   5:     File.Delete("test.bin");
   6: using (var fs = new FileStream("test.bin", FileMode.CreateNew, FileAccess.ReadWrite))
   7: {
   8:     fs.SetLength(1024 * 1024 * 768);
   9:     // warm up
  10:     for (int i = 0; i < 200; i++)
  11:     {
  12:         fs.Position = random.Next(0, (int)fs.Length);
  13:         fs.Write(buffer,0, random.Next(0, 1024));
  14:     }
  15:  
  16:     var sp = Stopwatch.StartNew();
  17:     for (int i = 0; i < 200; i++)
  18:     {
  19:           fs.Position = random.Next(0, (int)fs.Length);
  20:           fs.WriteByte(1);
  21:     }
  22:     fs.Flush(true);
  23:     sp.Stop();
  24:     Console.WriteLine("200 seeks & 200 bytes {0:#,#} ms", sp.ElapsedMilliseconds);
  25:  
  26:     sp = Stopwatch.StartNew();
  27:     fs.Position = random.Next(0, (int)fs.Length);
  28:     fs.Write(buffer, 0, buffer.Length);
  29:     fs.Flush(true);
  30:     sp.Stop();
  31:     Console.WriteLine("1 MB write {0:#,#} ms", sp.ElapsedMilliseconds);
  32: }

The results are quite interesting:

   1: 200 seeks & 200 bytes 146 ms
   2: 1 MB write 6 ms

Just to note, this is when running on SSD, the numbers are supposed to be a lot worse when running on HDD.

In other words, it ain’t the size of the write, but how you spread it around that really matters. Just to compare, here are the numbers for when we are doing sequential writes:

   1: Flush      1 with   6 pages -  24 kb writes and   1 seeks (  5 leaves,   1 branches,   0 overflows)
   2: Flush      2 with   6 pages -  24 kb writes and   1 seeks (  5 leaves,   1 branches,   0 overflows)
   3: Flush      3 with   6 pages -  24 kb writes and   1 seeks (  5 leaves,   1 branches,   0 overflows)
   4:  
   5: Flush    159 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
   6: Flush    160 with   8 pages -  32 kb writes and   3 seeks (  6 leaves,   2 branches,   0 overflows)
   7: Flush    161 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
   8: Flush    162 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
   9:  
  10: Flush  1,320 with   8 pages -  32 kb writes and   3 seeks (  6 leaves,   2 branches,   0 overflows)
  11: Flush  1,321 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
  12: Flush  1,322 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
  13:  
  14: Flush  4,316 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
  15: Flush  4,317 with   7 pages -  28 kb writes and   2 seeks (  5 leaves,   2 branches,   0 overflows)
  16: Flush  4,318 with   8 pages -  32 kb writes and   3 seeks (  6 leaves,   2 branches,   0 overflows)
  17:  
  18: Flush  7,409 with   8 pages -  32 kb writes and   4 seeks (  5 leaves,   3 branches,   0 overflows)
  19: Flush  7,410 with   9 pages -  36 kb writes and   2 seeks (  6 leaves,   3 branches,   0 overflows)
  20: Flush  7,411 with   8 pages -  32 kb writes and   4 seeks (  5 leaves,   3 branches,   0 overflows)
  21:  
  22: Flush  9,990 with   8 pages -  32 kb writes and   3 seeks (  5 leaves,   3 branches,   0 overflows)
  23: Flush  9,991 with   9 pages -  36 kb writes and   4 seeks (  6 leaves,   3 branches,   0 overflows)
  24: Flush  9,992 with   8 pages -  32 kb writes and   3 seeks (  5 leaves,   3 branches,   0 overflows)
  25: Flush  9,993 with   8 pages -  32 kb writes and   4 seeks (  5 leaves,   3 branches,   0 overflows)

Because we are always writing at the end, we only need to touch very few pages. Note that even with the small number of pages, we still need to do quite a bit of seeks, relative to the number of pages we write.

I have some ideas about this, but they are still unformed. Mostly we need to balance between the amount of free space that is used to the number of seeks allowed. I think we can get there by being smart about tracking the number of pages modified by a transaction, and waiting until free space become available in sufficient numbers to be relevant. Something like that would allow auto tuning of the amount of garbage we accumulate vs. the number of seeks we require.