With a little knowledge and a profiler, let us optimize!

time to read 13 min | 2574 words

As part of the work we have been doing on Voron, I wrote a few benchmarks and looked at where the hot spots are. One of the major ones was this function:

   1: public override void Flush()

   2: {

   3:     if (_flushMode == FlushMode.None)

   4:         return;

5:

   6:     PagerState.Accessor.Flush();

   7:     _fileStream.Flush(_flushMode == FlushMode.Full);

   8: }

This is the “fsync()” call, effectively. Accessor.Flush() call will resolve to a FlushViewOfFile(0, size); and _fileStream.Flush(true) will resolve to FlushFileBuffers on Windows.

It isn’t surprising that this would be THE hotspot, it is the part where we actually have to wait for the hardware to do stuff, after all. But further investigation revealed that it wasn’t the FlushFileBuffers that was really costly, it was the FlushViewOfFile. What FlushViewOfFile will do is scan all of the pages in range, and flush them to the OS (not to disk) if they are dirty. That is great, but it is effectively an O(N) operation. We have more knowledge about what is going on, so we can do better. We already know what are the dirty pages, so we can actually use that, instead of letting the OS do all the work.

But then we run into another problem. If we actually just call FlushViewOfFile for every page separately, we are going to have to spend a lot of time just calling to the OS when we have to do a large write. So we need to balance the amount of data we send to FlushViewOfFile with the number of times we are going to call FlushViewOfFile. Therefor, I came up with the following logic. We are going to group calls to FlushViewOfFile, as long as they are nearby (within 256KB of one another), this will give us the best balance between reducing the number of pages that FlushViewOfFile needs to call and the number of times we call FlushViewOfFile.

This now looks like this:

   1: public override void Flush(List<long> sortedPagesToFlush)

   2: {

   3:     if (_flushMode == FlushMode.None || sortedPagesToFlush.Count == 0)

   4:         return;

5:

   6:     // here we try to optimize the amount of work we do, we will only

   7:     // flush the actual dirty pages, and we will do so in sequential order

   8:     // ideally, this will save the OS the trouble of actually having to flush the

   9:     // entire range

  10:     long start = sortedPagesToFlush[0];

  11:     long count = 1;

  12:     for (int i = 1; i < sortedPagesToFlush.Count; i++)

  13:     {

  14:         var difference = sortedPagesToFlush[i] - sortedPagesToFlush[i - 1];

  15:         // if the difference between them is not _too_ big, we will just merge it into a single call

  16:         // we are trying to minimize both the size of the range that we flush AND the number of times

  17:         // we call flush, so we need to balance those needs.

  18:         if (difference < 64)

  19:         {

  20:             count += difference;

  21:             continue;

  22:         }

  23:         FlushPages(start, count);

  24:         start = sortedPagesToFlush[i];

  25:         count = 1;

  26:     }

  27:     FlushPages(start, count);

28:

  29:     if (_flushMode == FlushMode.Full)

  30:         _fileStream.Flush(true);

  31: }

A side affect of this is that we are more likely to be writing to the disk in a sequential fashion because of this.

The end result of this change was doubling the performance of the system under worse case scenario to “just” 25% faster under best conditions.

Tweet Share Share 14 comments

Tags:

performance

Comments

06 Sep 2013
09:31 AM

Rafal

Hey, I thought that the main benefit of using mmapped files is that the OS knows which pages are dirty and need to be written to the disk. So, are you telling us that Windows OS can't efficiently flush modified pages to disk? This is quite an important operation in al systems so I wonder why it isn't optimized for performance.

06 Sep 2013
11:07 AM

Ayende Rahien

Rafal, In order to check, it likely need to do an O(n) operation. If I know what is different in advance, I can safe that O(n) operation. It isn't an issue with flushing efficiently, it is an issue of knowing what to flush.

06 Sep 2013
12:40 PM

Rafal

I have read Windows Internals some time ago and vaguely remember how virtual memory works in Windows - the state of all virtual memory pages is kept in a page table, which is a tree-like data structure with each entry describing a memory page in process virtual memory space. So checking for dirty pages requires only a traversal of the page map and should be pretty fast imho. I'm shocked you had to optimize it and even more shocked that your optimization works.

06 Sep 2013
12:50 PM

Rafal

... btw, here's an article by OSR guys, who certainly know how everything works in Windows: http://www.osronline.com/article.cfm?article=60

06 Sep 2013
16:49 PM

Sergey Shumov

Ayende, are you planning to make an async version of API in Voron?

07 Sep 2013
11:05 AM

Ayende Rahien

Sergey, Probably not. There is no way to do a memory flush async in Windows, so that wouldn't work

07 Sep 2013
11:07 AM

Ayende Rahien

Rafal, Imagine that I am talking about a 1GB of memory, giving us 262,144 pages. I have the option of saying:

Flush everything, meaning that the OS need to go and check 262,144 pages to see if they are dirty or not. Or I have a list of the dirty pages already (let us say 100 or so), so I can tell it directly what those pages are. Even though checking if a page is dirty or not is a very cheap operation, you need to do it 262,144 times vs 100 times. The rest of the work is pretty much identical. No wonder that this way is more efficent.

07 Sep 2013
19:12 PM

Rafal

Yeah, I know what's the problem, but I'm still surprised your CLR code runs faster than OS internals. Probably no one really thought what happens if FlushViewOfFile is called that frequently - this isn't the behavior of a typical Windows application.

07 Sep 2013
23:28 PM

Howard Chu

I wonder if this would be useful in Linux, may try it later.

So far I've found that even running with NOSYNC, Linux will still flush every 5 seconds by default. If you tweak the writeback parameters to delay for a longer interval, you can boost write throughput by up to 20%. It's a shame that flushing has such an impact.

07 Sep 2013
23:35 PM

Howard Chu

Ah, I forgot, there's no point to trying this in Linux. Linux ignores the addr,len params of msync and always just syncs the entire file.

08 Sep 2013
06:09 AM

Ayende Rahien

Rafal, Think about this from the POV of the OS. There is really no way for it to know without checking. So it has to do an O(n) operation.

08 Sep 2013
06:11 AM

Ayende Rahien

Howard, I think that in most cases, you don't have that impact, since you are using write() to do the actual write, most often, instead of direct memory flushes. That said, you could probably benefit from page merging, but it is likely that the buffers that you already have in the write() call would take care of that.

08 Sep 2013
07:18 AM

Rafal

I know I'm not bringing anything more here, but process virtual memory table is a data structure, just like your btree. So it could be designed to handle dirty bit check in more efficient way - i could think of a log(n) procedure (of course if the algorithm isn't hard-wired into the CPU) The OS knows just as much about modified pages as your code does, but doesn't use this information as efficiently.

08 Sep 2013
07:24 AM

Ayende Rahien

The process table is hardwired into the CPU. Now, the OS can keep track of dirty pages in a separate data structure, but in 1GB, it has 262,144, which means that even if we use just a single bit to keep dirty yes/no, you would need 32KB of memory to track it. For this scenario, it is a perfectly valid answer. But for the general case? Where the OS can figure it out on its own and not waste so much memory? Why do that?

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB