Is select() broken? Memory mapped files with unbufferred writes == race condition?

time to read 24 min | 4734 words

Let me start this post by stating that I am not even sure if what I am trying to do is legal here. But from reading the docs, it does appear to be a valid use of the API, and it does work, most of the time.

The full code can be found here: https://gist.github.com/ayende/7495987

The gist of it is that I am trying to do two things:

Write to a file opened with FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING.
Read from this file using a memory map.
Occasionally, I get into situations where after I wrote to the file, I am not reading what I wrote.

I have a repro, and we reproduced this on multiple machines. Both Windows 7 and Windows 8.

Here is the relevant code (the full code is in the link), explanation on it below:

   1: const uint nNumberOfBytesToWrite = 4096*3;

   2: var buffer = (byte*)(VirtualAlloc(IntPtr.Zero, new UIntPtr(nNumberOfBytesToWrite), AllocationType.COMMIT, MemoryProtection.READWRITE)

   3:             .ToPointer());

4:

   5: for (int i = 0; i < nNumberOfBytesToWrite; i++)

   6: {

   7:     *(buffer + i) = 137;

   8: }

9:

  10: var g = Guid.NewGuid().ToString();

11:

  12: var safeHandle = CreateFile(g,

  13:     NativeFileAccess.GenericRead | NativeFileAccess.GenericWrite,

  14:     NativeFileShare.Read, IntPtr.Zero,

  15:     NativeFileCreationDisposition.OpenAlways,

  16:     NativeFileAttributes.Write_Through | NativeFileAttributes.NoBuffering | NativeFileAttributes.DeleteOnClose,

  17:     IntPtr.Zero);

18:

  19: var fileStream = new FileStream(safeHandle, FileAccess.ReadWrite);

  20: fileStream.SetLength(1024 * 1024 * 1024); // 1gb

21:

  22: if (safeHandle.IsInvalid)

  23: {

  24:     throw new Win32Exception();

  25: }

26:

  27: FileStream mms = fileStream;

  28: //mms = new FileStream(g, FileMode.Open, FileAccess.Read, FileShare.ReadWrite | FileShare.Delete);

  29: var mmf = MemoryMappedFile.CreateFromFile(mms, Guid.NewGuid().ToString(), fileStream.Length,

  30:     MemoryMappedFileAccess.Read, null, HandleInheritability.None, true);

31:

  32: MemoryMappedViewAccessor accessor = mmf.CreateViewAccessor(0, fileStream.Length, MemoryMappedFileAccess.Read);

  33: byte* ptr = null;

  34: accessor.SafeMemoryMappedViewHandle.AcquirePointer(ref ptr);

35:

  36: Task.Factory.StartNew(() =>

  37: {

  38:     long lastPos = 0;

  39:     while (true)

  40:     {

  41:         int count = 0;

  42:         while (true)

  43:         {

  44:             if (*(ptr + lastPos) != 137)

  45:             {

  46:                 break;

  47:             }

  48:             lastPos += 4096;

  49:             count ++;

  50:         }

  51:         Console.WriteLine();

  52:         Console.WriteLine("Verified {0} MB", count * 4 / 1024);

  53:         Console.WriteLine();

  54:         Thread.Sleep(2000);

  55:     }

  56: });

57:

  58: for (int i = 0; i < 1024*64; i++)

  59: {

  60:     var pos = i*nNumberOfBytesToWrite;

  61:     if (i%100 == 0)

  62:         Console.Write("\r{0,10:#,#} kb", pos/1024);

  63:     var nativeOverlapped = new NativeOverlapped

  64:     {

  65:         OffsetHigh = 0,

  66:         OffsetLow = (int) pos

  67:     };

68:

  69:     uint written;

  70:     if (WriteFile(safeHandle, buffer, nNumberOfBytesToWrite, out written, &nativeOverlapped) == false)

  71:         throw new Win32Exception();

72:

  73:     for (int j = 0; j < 3; j++)

  74:     {

  75:         if (*(ptr + pos) != 137)

  76:         {

  77:             throw new Exception("WTF?!");

  78:         }

  79:         pos += 4096;

  80:     }

  81: }

This code is doing the following:

We setup a file handle using NoBuffering | Write_Through, and we also map the file using memory map.
We write 3 pages (12Kb) at a time to the file.
After the write, we are using memory map to verify that we actually wrote what we wanted to the file.
_At the same time_ we are reading from the same memory in another thread.
Occasionally, we get an error where the data we just wrote to the file cannot be read back.

Now, here is what I think is actually happening:

When we do an unbuffered write, Windows has to mark the relevant pages as invalid.
I _think_ that it does so before it actually perform the write.
If you have another thread that access that particular range of memory at the same time, it can load the _previously_ written data.
The WriteFile actually perform the write, but the pages that map to that portion of the file have already been marked as loaded.
At that point, when we use the memory mapped pointer to access the data, we get the data that was there before the write.

As I said, the code above can reproduce this issue (you might have to run it multiple times).

I am not sure if this is something that is valid issue or just me misusing the code. The docs are pretty clear about using regular i/o & memory mapped i/o. The OS is responsible to keeping them coherent with respect to one another. However, that is certainly not the case here.

It might be that I am using a single handle for both, and Windows does less checking when that happens? For what it is worth, I have also tried it using different handles, and I don’t see the problem in the code above, but I have a more complex scenario where I do see the same issue.

Of course, FILE_FLAG_OVERLAPPED is not specified, so what I would actually expect here is serialization of the I/O, according to the docs. But mostly I need a sanity check to tell me if I am crazy.

Tweet Share Share 18 comments

Tags:

bugs
wtf?!

Comments

16 Nov 2013
05:57 AM

Paul Kling

Part of me wonder if you fill up the byte* ptr with 137 without using accessor.SafeMemoryMappedViewHandle.AcquirePointer(ref ptr); does the Task.Factory.StartNew(() and for loop checker still behave the same way?

I think about my old school c++ stuff and overwriting pointers. I do not really see anything like that just curoius to know it if is more the SafeMemoryMappedViewHandler or the code that is checking the results.

Good luck; it is fun finding weird errors.

16 Nov 2013
10:10 AM

Howard Chu

http://msdn.microsoft.com/en-us/library/windows/desktop/aa366761%28v=vs.85%29.aspx

```A mapped view of a file is not guaranteed to be coherent with a file that is being accessed by the ReadFile or WriteFile function.'''

16 Nov 2013
10:14 AM

Howard Chu

This is why we had to avoid using WriteFileGather in LMDB - you break coherency with the map when you do a write with NO_BUFFERING.

16 Nov 2013
10:51 AM

Alois Kraus

If you specify write through you circumvent the OS cache. Consistency is ensured by the cache manager by having only one memory location (page) where any process does write to any file at the same file position. This is especially handy for memory mapped files where you get direct access to the cached pages. I guess this consistency guarantee breaks down when you circumvent the cache. Now you have at one time two different pages with memory which represent the current data of a file. One must be invalid. The cache page will be marked as outdated but only when the data already has been written. This would leave a small gap when you try to read the data as MemoryMapped file which does represent the current cached state. This is my current understanding how things should work but I could be wrong. To get a definitive answer you should ask a kernel dev.

16 Nov 2013
11:25 AM

Ayende Rahien

Howard, Thanks. Now I have to got and hit my face against the wall a few time. I don't know how I missed this. I think that I read the first part of that statement, and somehow missed the second one.

16 Nov 2013
11:27 AM

Ayende Rahien

Howard, The WrieFileGather isn't actually the issue. We aren't even using it. And the annoying thing is that it almost works. In 99% of the cases, it does. But since it isn't guaranteed, I guess that makes a lot of sense. Thanks for point it out. I guess the old select() ain't broken still holds.

16 Nov 2013
11:31 AM

Ayende Rahien

Alois, Yes, you are correct. However, it appears that the Kernel is smart enough to do the right thing (on unbuffered write, it drop it from the cache, the re-read it). Note that the docs calls this out as a general warning to because it is something that works most of the time, but isn't guaranteed.

16 Nov 2013
17:15 PM

alex

From my comment on http://ayende.com/blog/163299/with-malice-aforethought-we-can-try-even-better#comments

"A mapped file and a file that is accessed by using the input and output (I/O) functions (ReadFile and WriteFile) are not necessarily coherent."

If it wasn't for the fact that "NO_BUFFERING" sort of defeats the purpose of memmapping, "WriteFileGather" would indeed have been a nice way of reducing the amount of IO requests needed for syncing the journal with the data file.

16 Nov 2013
18:30 PM

Ayende Rahien

Alex, I haven't given up on this. It just means that I have to jump through a few extra hoops.

16 Nov 2013
19:12 PM

alex

That's good to hear. I had thought about this for a while, could not find a working approach and abandoned this route (instead I implemented a "merging writer" in an attempt to reduce the number of IOs and keep them sequential).

It would be awesome if you got this working, because it would substantially improve data sync performance.

16 Nov 2013
20:24 PM

Ayende Rahien

Alex, The basic idea is that I am going to do the following:

1 file memory mapped as scratch. I create it with Temporary | DeleteOnClose. All pending writes go there.
journal files are written to as before (write_through, unbuffered), but never read from during normal operations.
Writes to the data file are done via mem map, with occational fsync.

So, during normal ops, we never actually have a thread waiting for fsync.

16 Nov 2013
20:45 PM

alex

Oh yeah next to merging writer, also I am not storing the "checkpoint" page (i.e. the meta data page that records the last synced state) in the data file, but in a separate file, to prevent data file writes to always have to include page 0/1. This too seems to improve data sync performance on average.

16 Nov 2013
20:53 PM

alex

Oops only just refreshed after posting comment above. I like the idea you outlined, although it now seems you are now writing the data at least 2+ times.

Reusing the scratch file will certainly speed up writes. I think recycling the journal files when they are no longer needed will also help.

The only vulnerability to durability I see, is if the underlying device does not honor "Write-through" promises (which could be the case for non SCSI devices).

17 Nov 2013
00:21 AM

Ayende Rahien

Alex, I am actually writing the data to memory mapped scratch file (the temp flag should ensure that the data is only written to disk under memory pressure), then to the journal file. Then a background process copy the data from the scratch file to the data file. I don't think we can recycle a journal file, At least not without overwriting it completely first, because it already have valid transactions, so we need to be careful there.

18 Nov 2013
04:10 AM

Sean Hederman

I think Raymond Chen would say something like "so you switched the airbag off, and are confused why it didn't deploy?"

Good article.

23 Nov 2013
22:38 PM

alex

Just a thought that occurred: how bad is the problem you noticed here in practice?

Since the pages being synced should not be read from the data file until the sync is completed, but from the journal instead (or in the new design from the "scratch pages buffer"), couldn't you still use this approach? I.e. do gather writes to the data file (using overlapped, write through, no buffering handle), and do reads from read-only memmap opened on a second non-overlapped handle. The memmap should never be paging in any of the pages being written to until after data sync was completed.

24 Nov 2013
02:34 AM

Ayende Rahien

Alex, The problem is that there isn't a way you can guarantee a write to the data file is actually visible to the memory mapped view. In our tests, we have seen this happen pretty consistently, once every 50,000 ops or so. And the problem isn't the same _pages_, it is related pages. Memory mapped doesn't read 1 page at a time, it reads them in batches. Imagine that we write to page 65. And then another tx tries to read page 64. There is no reason to wait for it, but the memory map decides to load 8 pages, 64 - 52. As far as it is concerned, it already has the latest value in memory. Because of that, when you access page 65 a while later, long after the write is actually completed, you are getting the old information.

24 Nov 2013
15:08 PM

alex

Yes, read-ahead of adjacent pages. That would indeed spoil this.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB