Failing to map: a tale of false hopes in mmap land

Jul 08 2024

Failing to map: a tale of false hopes in mmap land

time to read 10 min | 1997 words

I usually talk about the things that I do that were successful. Today I want to discuss something that I tried but failed at. Documenting failed approaches is just as important, though less enjoyable, as documenting what we excel at.

In order to explain the failure, I need to get a bit deeper into how computers handle memory. There is physical memory, the RAM sticks that you have in your machine, and then there is how the OS and CPU present that memory to your code. Usually, the abstraction is quite seamless, and we don’t need to pay attention to it.

Occasionally, we can take advantage of this model. Consider the following memory setup, showing a single physical memory page that was mapped in two different locations:

In this case, it means that you can do things like this:

*page1 = '*';
printf("Same: %d - Val: %c\n", (page1 == page2), *page2); 
// output is:
// Same: 0 - Val: *

In other words, because the two virtual pages point to the same physical page in memory, we can modify memory in one location and see the changes in another. This isn’t spooky action at a distance, it is simply the fact that the memory addresses we use are virtual and they point to the same place.

Note that in the image above, I modified the data using the pointer to Page 1 and then read it from Page 2. The Memory Management Unit (MMU) in the CPU can do a bunch of really interesting things because of this. You’ll note that each virtual page is annotated with an access permission.

In this case, the second page is marked as Copy on Write. That means that when we read from this page, the MMU will happily read the data from the physical page it is pointed to. But when we write, the situation is different.

The MMU will raise an exception to the operating system, telling it that a write was attempted on this page, which is forbidden. At this point, the OS will allocate a new physical page, copy the data to it, and then update the virtual address to point to the new page. Here is what this looks like:

Now we have two distinct mappings. A write to either one of them will not be reflected on the other. Here is what this looks like in code:

*page1 = '1'; // now 
printf("Page1: %c, Page2: %c\n", *page1, *page2); 
// output: Page1: 1, Page2: 1
*page2 = '2'; // force the copy on write to occur
printf("Page1: %c, Page2: %c\n", *page1, *page2); 
// output: Page1: 1, Page2: 2

As long as the modifications happened through the first page address (the orange one in the image), there was no issue and any change would be reflected in both pages. When we make a modification to the second page (the green one in the image), the OS will create a new physical page and effectively split them forever.

Changes made to either page will only be reflected in that page, not both, since they aren’t sharing the same page.

Note that this behavior applies at a page boundary. What happens if I have a buffer, 1GB in size, and I use this technique on it? Let’s assume that we have a buffer that is 1GB in size and I created a copy-on-write mapping on top of it.

The amount of physical memory that I would consume is still just 1GB.

In fact, I would effectively memcpy()very quickly, since I’m not actually copying anything. And for all intents and purposes, it works. I can change the data through the second buffer, and it would not show up in the first buffer. Of particular note is that when I modify the data on the second buffer, only a single page is changed. Here is what this looks like:

So instead of having to copy 1GB all at once, we map the buffer again as copy on write, and we can get a new page whenever we actually modify our “copy” of the data.

So far, this is great, and it is heavily used for many optimizations. It is also something that I want to use to implement cheap snapshots of a potentially large data structure. The idea that I have is that I can use this technique to implement it.

Here is the kind of code that I want to write:

var list = new CopyOnWriteList();
list.Put(1);
list.Put(2);


var snapshot1 = list.CreateSnapshot();


list.Put(3)




var snapshot2 = list.CreateSnapshot();


list.Put(4);

And the idea is that I’ll have (at the same time) the following:

list	snapshot1	snapshot2
1,2,3,4	1,2	1,2,3

I want to have effectively unlimited snapshots, and the map may contain a large amount of data. In graphical form, you can see it here:

We started with Page 1, created a Copy of Write for Page 2, modified Page 2 (breaking the Copy on Write), and then attempted to create a Copy on Write for Page 2. That turns out to be a problem.

Let’s see the code that we need in order to create a copy using copy-on-write mapping on Linux:

int shm_fd = shm_open("/third", O_CREAT | O_RDWR, 0666);
ftruncate(shm_fd, 4096);
char *page1 = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, shm_fd, 0);
page1[0] = 'A'; page1[1] = 'B';
// pages1 = 'AB'
char *page2 = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE, shm_fd, 0);
// pages2 = 'AB'
page1[0]= 'a';
// pages1 = 'aB'
// pages2 = 'aB' (same pagee)
page2[2] = 'C'; // force a private copy creation
// pages1 = 'aB'
// pages2 = 'aBC'
page1[1] = 'b';
// pages1 = 'ab'
// pages2 = 'aBC' (no change here)

The code in Windows is pretty similar and behaves in the same manner:

HANDLE hMapFile = CreateFileMapping(INVALID_HANDLE_VALUE,
    NULL,PAGE_READWRITE,0,4096, TEXT("Local\\MySharedMemory"));
char* page1 = MapViewOfFile(hMapFile,
    FILE_MAP_READ | FILE_MAP_WRITE, 0, 0, 4096);
page1[0] = 'A'; page1[1] = 'B';
// pages1 = 'AB'
char* page2 = MapViewOfFile(hMapFile,
    FILE_MAP_COPY, 0, 0, 4096);
// pages2 = 'AB'
page1[0] = 'a';
// pages1 = 'aB'
// pages2 = 'aB' (same pagee)
page2[2] = 'C'; // force a copy on write 
// pages1 = 'aB'
// pages2 = 'aBC'
page1[1] = 'b';
// pages1 = 'ab'
// pages2 = 'aBC' (no change here)

Take a look at the API we have for creating a copy-on-write:

MapViewOfFile(hMapFile, FILE_MAP_COPY, 0, 0, 4096); // windows
mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE, shm_fd, 0); // linux

A key aspect of the API is that we need to provide a source for the Copy-on-Write operation. That means that we can only create a Copy-on-Write from a single source. We cannot perform a Copy-on-Write on top of a page that was marked as copy-on-write. This is because we cannot refer to it. Basically, I don’t have a source that I can use for this sort of mapping.

I tried being clever and wrote the following code on Linux:

int selfmem = open("/proc/self/mem", O_RDWR);
char *page2 = mmap(NULL, 4096, PROT_READ | PROT_WRITE, 
                   MAP_PRIVATE, selfmem, (off_t)page1);

On Linux, you can use the special file /proc/self/mem to refer to your memory using file I/O. That means that I can get a file descriptor for my own memory, which provides a source for my copy-on-write operation.

I was really excited when I realized that this was a possibility. I spent a lot of time trying to figure out how I could do the same on Windows. However, when I actually ran the code on Linux, I realized that this doesn’t work.

The mmap() call will return ENODEV when I try that. It looks like this isn’t a supported action.

Linux has another call that looks almost right, which is mremap(), but that either zeros out or sets up a userfaulfdhandler for the region. So it can’t serve my needs.

Looking around, I’m not the first person to try this, but it doesn’t seem like there is an actual solution.

This is quite annoying since we are almost there. All the relevant pieces are available, if we had a way to tell the kernel to create the mapping, everything else should just work from there.

Anyway, this is my tale of woe, trying (and failing) to create a snapshot-based system using the Memory Manager Unit. Hopefully, you’ll either learn something from my failure or let me know that there is a way to do this…

Tweet Share Share 1 comments

Tags:

Comments

10 Jul 2024
05:49 AM

Rafal

When you fork a process, at least in Unix, the child process gets a copy-on-write clone of the parent's memory. But if the child process makes a fork again - will it create a third copy-on-write mapping?

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB