Is select() broken? Memory mapped files with unbufferred writes == race condition?
Let me start this post by stating that I am not even sure if what I am trying to do is legal here. But from reading the docs, it does appear to be a valid use of the API, and it does work, most of the time.
The full code can be found here: https://gist.github.com/ayende/7495987
The gist of it is that I am trying to do two things:
- Write to a file opened with FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING.
- Read from this file using a memory map.
- Occasionally, I get into situations where after I wrote to the file, I am not reading what I wrote.
I have a repro, and we reproduced this on multiple machines. Both Windows 7 and Windows 8.
Here is the relevant code (the full code is in the link), explanation on it below:
1: const uint nNumberOfBytesToWrite = 4096*3;2: var buffer = (byte*)(VirtualAlloc(IntPtr.Zero, new UIntPtr(nNumberOfBytesToWrite), AllocationType.COMMIT, MemoryProtection.READWRITE)3: .ToPointer());
4:
5: for (int i = 0; i < nNumberOfBytesToWrite; i++)6: {
7: *(buffer + i) = 137;
8: }
9:
10: var g = Guid.NewGuid().ToString();
11:
12: var safeHandle = CreateFile(g,
13: NativeFileAccess.GenericRead | NativeFileAccess.GenericWrite,
14: NativeFileShare.Read, IntPtr.Zero,
15: NativeFileCreationDisposition.OpenAlways,
16: NativeFileAttributes.Write_Through | NativeFileAttributes.NoBuffering | NativeFileAttributes.DeleteOnClose,
17: IntPtr.Zero);
18:
19: var fileStream = new FileStream(safeHandle, FileAccess.ReadWrite);20: fileStream.SetLength(1024 * 1024 * 1024); // 1gb21:
22: if (safeHandle.IsInvalid)23: {
24: throw new Win32Exception();25: }
26:
27: FileStream mms = fileStream;
28: //mms = new FileStream(g, FileMode.Open, FileAccess.Read, FileShare.ReadWrite | FileShare.Delete);29: var mmf = MemoryMappedFile.CreateFromFile(mms, Guid.NewGuid().ToString(), fileStream.Length,
30: MemoryMappedFileAccess.Read, null, HandleInheritability.None, true);31:
32: MemoryMappedViewAccessor accessor = mmf.CreateViewAccessor(0, fileStream.Length, MemoryMappedFileAccess.Read);
33: byte* ptr = null;34: accessor.SafeMemoryMappedViewHandle.AcquirePointer(ref ptr);35:
36: Task.Factory.StartNew(() =>
37: {
38: long lastPos = 0;39: while (true)40: {
41: int count = 0;42: while (true)43: {
44: if (*(ptr + lastPos) != 137)45: {
46: break;47: }
48: lastPos += 4096;
49: count ++;
50: }
51: Console.WriteLine();
52: Console.WriteLine("Verified {0} MB", count * 4 / 1024);53: Console.WriteLine();
54: Thread.Sleep(2000);
55: }
56: });
57:
58: for (int i = 0; i < 1024*64; i++)59: {
60: var pos = i*nNumberOfBytesToWrite;
61: if (i%100 == 0)62: Console.Write("\r{0,10:#,#} kb", pos/1024);63: var nativeOverlapped = new NativeOverlapped64: {
65: OffsetHigh = 0,
66: OffsetLow = (int) pos67: };
68:
69: uint written;70: if (WriteFile(safeHandle, buffer, nNumberOfBytesToWrite, out written, &nativeOverlapped) == false)71: throw new Win32Exception();72:
73: for (int j = 0; j < 3; j++)74: {
75: if (*(ptr + pos) != 137)76: {
77: throw new Exception("WTF?!");78: }
79: pos += 4096;
80: }
81: }
This code is doing the following:
- We setup a file handle using NoBuffering | Write_Through, and we also map the file using memory map.
- We write 3 pages (12Kb) at a time to the file.
- After the write, we are using memory map to verify that we actually wrote what we wanted to the file.
- _At the same time_ we are reading from the same memory in another thread.
- Occasionally, we get an error where the data we just wrote to the file cannot be read back.
Now, here is what I think is actually happening:
- When we do an unbuffered write, Windows has to mark the relevant pages as invalid.
- I _think_ that it does so before it actually perform the write.
- If you have another thread that access that particular range of memory at the same time, it can load the _previously_ written data.
- The WriteFile actually perform the write, but the pages that map to that portion of the file have already been marked as loaded.
- At that point, when we use the memory mapped pointer to access the data, we get the data that was there before the write.
As I said, the code above can reproduce this issue (you might have to run it multiple times).
I am not sure if this is something that is valid issue or just me misusing the code. The docs are pretty clear about using regular i/o & memory mapped i/o. The OS is responsible to keeping them coherent with respect to one another. However, that is certainly not the case here.
It might be that I am using a single handle for both, and Windows does less checking when that happens? For what it is worth, I have also tried it using different handles, and I don’t see the problem in the code above, but I have a more complex scenario where I do see the same issue.
Of course, FILE_FLAG_OVERLAPPED is not specified, so what I would actually expect here is serialization of the I/O, according to the docs. But mostly I need a sanity check to tell me if I am crazy.
Comments
Part of me wonder if you fill up the byte* ptr with 137 without using accessor.SafeMemoryMappedViewHandle.AcquirePointer(ref ptr); does the Task.Factory.StartNew(() and for loop checker still behave the same way?
I think about my old school c++ stuff and overwriting pointers. I do not really see anything like that just curoius to know it if is more the SafeMemoryMappedViewHandler or the code that is checking the results.
Good luck; it is fun finding weird errors.
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366761%28v=vs.85%29.aspx
```A mapped view of a file is not guaranteed to be coherent with a file that is being accessed by the ReadFile or WriteFile function.'''
This is why we had to avoid using WriteFileGather in LMDB - you break coherency with the map when you do a write with NO_BUFFERING.
If you specify write through you circumvent the OS cache. Consistency is ensured by the cache manager by having only one memory location (page) where any process does write to any file at the same file position. This is especially handy for memory mapped files where you get direct access to the cached pages. I guess this consistency guarantee breaks down when you circumvent the cache. Now you have at one time two different pages with memory which represent the current data of a file. One must be invalid. The cache page will be marked as outdated but only when the data already has been written. This would leave a small gap when you try to read the data as MemoryMapped file which does represent the current cached state. This is my current understanding how things should work but I could be wrong. To get a definitive answer you should ask a kernel dev.
Howard, Thanks. Now I have to got and hit my face against the wall a few time. I don't know how I missed this. I think that I read the first part of that statement, and somehow missed the second one.
Howard, The WrieFileGather isn't actually the issue. We aren't even using it. And the annoying thing is that it almost works. In 99% of the cases, it does. But since it isn't guaranteed, I guess that makes a lot of sense. Thanks for point it out. I guess the old select() ain't broken still holds.
Alois, Yes, you are correct. However, it appears that the Kernel is smart enough to do the right thing (on unbuffered write, it drop it from the cache, the re-read it). Note that the docs calls this out as a general warning to because it is something that works most of the time, but isn't guaranteed.
From my comment on http://ayende.com/blog/163299/with-malice-aforethought-we-can-try-even-better#comments
"A mapped file and a file that is accessed by using the input and output (I/O) functions (ReadFile and WriteFile) are not necessarily coherent."
If it wasn't for the fact that "NO_BUFFERING" sort of defeats the purpose of memmapping, "WriteFileGather" would indeed have been a nice way of reducing the amount of IO requests needed for syncing the journal with the data file.
Alex, I haven't given up on this. It just means that I have to jump through a few extra hoops.
That's good to hear. I had thought about this for a while, could not find a working approach and abandoned this route (instead I implemented a "merging writer" in an attempt to reduce the number of IOs and keep them sequential).
It would be awesome if you got this working, because it would substantially improve data sync performance.
Alex, The basic idea is that I am going to do the following:
So, during normal ops, we never actually have a thread waiting for fsync.
Oh yeah next to merging writer, also I am not storing the "checkpoint" page (i.e. the meta data page that records the last synced state) in the data file, but in a separate file, to prevent data file writes to always have to include page 0/1. This too seems to improve data sync performance on average.
Oops only just refreshed after posting comment above. I like the idea you outlined, although it now seems you are now writing the data at least 2+ times.
Reusing the scratch file will certainly speed up writes. I think recycling the journal files when they are no longer needed will also help.
The only vulnerability to durability I see, is if the underlying device does not honor "Write-through" promises (which could be the case for non SCSI devices).
Alex, I am actually writing the data to memory mapped scratch file (the temp flag should ensure that the data is only written to disk under memory pressure), then to the journal file. Then a background process copy the data from the scratch file to the data file. I don't think we can recycle a journal file, At least not without overwriting it completely first, because it already have valid transactions, so we need to be careful there.
I think Raymond Chen would say something like "so you switched the airbag off, and are confused why it didn't deploy?"
Good article.
Just a thought that occurred: how bad is the problem you noticed here in practice?
Since the pages being synced should not be read from the data file until the sync is completed, but from the journal instead (or in the new design from the "scratch pages buffer"), couldn't you still use this approach? I.e. do gather writes to the data file (using overlapped, write through, no buffering handle), and do reads from read-only memmap opened on a second non-overlapped handle. The memmap should never be paging in any of the pages being written to until after data sync was completed.
Alex, The problem is that there isn't a way you can guarantee a write to the data file is actually visible to the memory mapped view. In our tests, we have seen this happen pretty consistently, once every 50,000 ops or so. And the problem isn't the same _pages_, it is related pages. Memory mapped doesn't read 1 page at a time, it reads them in batches. Imagine that we write to page 65. And then another tx tries to read page 64. There is no reason to wait for it, but the memory map decides to load 8 pages, 64 - 52. As far as it is concerned, it already has the latest value in memory. Because of that, when you access page 65 a while later, long after the write is actually completed, you are getting the old information.
Yes, read-ahead of adjacent pages. That would indeed spoil this.
Comment preview