Ayende @ Rahien

Refunds available at head office

Optimizing writes in Voron

As I mentioned, one of the things that I have been working on with Voron is optimizing the sad case of random writes.  I discussed some of the issues that we had already, and now I want to explain how we approach resolving them.

With LMDB, free space occur on every write, because we don’t make modifications in place, instead, we make modifications to a copy, and free the existing page to be reclaimed later. The way the free space reclamation work, a new page can be allocated anywhere on the file. That can lead to a lot of seeks. With Voron, we used a more complex policy. The file is divided in 4 MB sections. And we will aggregate free space in each section. When we need more space, we will find a section with enough free space and use that, and we will continue to use that for as long as we can. The end result is that we tend to be much more local in the way we are reusing space.

Here are the original results:

Flush     1 with  12 pages   - 48 kb writes and 1  seeks   (11 leaves, 1 branches, 0 overflows)
Flush     2 with  13 pages   - 52 kb writes and 1  seeks   (12 leaves, 1 branches, 0 overflows)
Flush     3 with  21 pages   - 84 kb writes and 1  seeks   (20 leaves, 1 branches, 0 overflows)
 
Flush    27 with  76 pages   - 304 kb writes and 1 seeks  (75 leaves,  1 branches, 0 overflows)
Flush    28 with  73 pages   - 292 kb writes and 1 seeks  (72 leaves,  1 branches, 0 overflows)
Flush    29 with  84 pages   - 336 kb writes and 1 seeks  (80 leaves,  4 branches, 0 overflows)
 
Flush 1,153 with 158 pages - 632 kb writes and 67  seeks (107 leaves, 51 branches, 0 overflows)
Flush 1,154 with 168 pages - 672 kb writes and 65  seeks (113 leaves, 55 branches, 0 overflows)
Flush 1,155 with 165 pages - 660 kb writes and 76  seeks (113 leaves, 52 branches, 0 overflows)
 
Flush 4,441 with 199 pages - 796 kb writes and 146 seeks (111 leaves, 88 branches, 0 overflows)
Flush 4,442 with 198 pages - 792 kb writes and 133 seeks (113 leaves, 85 branches, 0 overflows)
Flush 4,443 with 196 pages - 784 kb writes and 146 seeks (109 leaves, 87 branches, 0 overflows)
 
Flush 7,707 with 209 pages - 836 kb writes and 170 seeks (111 leaves, 98 branches, 0 overflows)
Flush 7,708 with 217 pages - 868 kb writes and 169 seeks (119 leaves, 98 branches, 0 overflows)
Flush 7,709 with 197 pages - 788 kb writes and 162 seeks (108 leaves, 89 branches, 0 overflows)
 
Flush 9,069 with 204 pages - 816 kb writes and 170 seeks (108 leaves, 96 branches, 0 overflows)
Flush 9,070 with 206 pages - 824 kb writes and 166 seeks (112 leaves, 94 branches, 0 overflows)
Flush 9,071 with 203 pages - 812 kb writes and 169 seeks (105 leaves, 98 branches, 0 overflows)

And here are the improved results:

Flush      1 with   2 pages -     8 kb writes and   1 seeks (  2 leaves,   0 branches,   0 overflows)
Flush      2 with   8 pages -    32 kb writes and   1 seeks (  7 leaves,   1 branches,   0 overflows)
Flush      3 with  10 pages -    40 kb writes and   1 seeks (  9 leaves,   1 branches,   0 overflows)
  
Flush     27 with  73 pages -   292 kb writes and   1 seeks ( 72 leaves,   1 branches,   0 overflows)
Flush     28 with  72 pages -   288 kb writes and   1 seeks ( 71 leaves,   1 branches,   0 overflows)
Flush     29 with  71 pages -   284 kb writes and   1 seeks ( 70 leaves,   1 branches,   0 overflows)
  
Flush  1,153 with 157 pages -   628 kb writes and  11 seeks (105 leaves,  52 branches,   0 overflows)
Flush  1,154 with 159 pages -   636 kb writes and   2 seeks (107 leaves,  52 branches,   0 overflows)
Flush  1,155 with 167 pages -   668 kb writes and  17 seeks (111 leaves,  56 branches,   0 overflows)
  
Flush  4,441 with 210 pages -   840 kb writes and  11 seeks (121 leaves,  86 branches,   3 overflows)
Flush  4,442 with 215 pages -   860 kb writes and   1 seeks (124 leaves,  88 branches,   3 overflows)
Flush  4,443 with 217 pages -   868 kb writes and   9 seeks (126 leaves,  89 branches,   2 overflows)
  
Flush  7,707 with 231 pages -   924 kb writes and   7 seeks (136 leaves,  93 branches,   2 overflows)
Flush  7,708 with 234 pages -   936 kb writes and   9 seeks (136 leaves,  97 branches,   1 overflows)
Flush  7,709 with 241 pages -   964 kb writes and  13 seeks (140 leaves,  97 branches,   4 overflows)

Flush  9,069 with 250 pages - 1,000 kb writes and   6 seeks (144 leaves, 101 branches,   5 overflows)
Flush  9,070 with 250 pages - 1,000 kb writes and  13 seeks (145 leaves,  98 branches,   7 overflows)
Flush  9,071 with 248 pages -   992 kb writes and  12 seeks (143 leaves,  99 branches,   6 overflows)

Let us plot this in a chart, so we can get a better look at things:

image

As you can see, this is a pretty major improvement. But it came at a cost, let us see the cost of size per transaction…

image

So we improved on the seeks / tx, but got worse on the size / tx. That is probably because of the overhead of keeping the state around, but it also relates to some tunable configuration that we added (the amount of free space in a section that will make it eligible for use.

Annoyingly, after spending quite a bit of time & effort on this, we don’t see a major perf boost here. But I am confident that it’ll come.

Comments

Catalin Pop
09/19/2013 10:57 AM by
Catalin Pop

4 MB seems a bit much to me.

For example the space organizational unit used by SQL Server above page level is the extent that has 64KB (or 8 pages). My guess is that older processors had 64KB max L2 cache lines and they used this also for memory efficiency (maybe).

I'd definitely try to use as low as 256KB or 128KB blocks in order to determine effects (The max L2 cache line sizes of modern x64 CPUs)

Also, with this new optimisation, how well does the storage engine handle random writes with mixed size documents ( < 1KB up to >4MB). I'd imagine a form of fragmentation might arise.

Howard Chu
09/19/2013 03:51 PM by
Howard Chu

I agree with Catalin, 4MB seems too much.

Peak transfer speed on a HDD tends to be around 100MB/s, or 100KB/ms. So, when you want to trade away seeks for sequential performance, and your average seek time is ~12ms, that means the break-even point is around 1.2MB. The exact point depends on the specific HDD, but it's much less than 4MB.

Comments have been closed on this topic.