re: Are You Sure You Want to Use MMAP in Your Database Management System?

architecture (623) rss
bugs (451) rss
community (382) rss
databases (481) rss
design (899) rss
development (654) rss
hibernating-practices (73) rss
miscellaneous (592) rss
performance (397) rss
programming (1104) rss
raven (1471) rss
ravendb.net (558) rss
reviews (184) rss

2025
- October (4)
- September (10)
- August (6)
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

RavenDB - High-Performance NoSQL Document Database

Jan 14 2022

reAre You Sure You Want to Use MMAP in Your Database Management System?

time to read 12 min | 2246 words

I was pointed to this paper on twitter: Are You Sure You Want to Use MMAP in Your Database Management System?

As you can imagine, this is a topic near and dear to my heart. This is especially the case since I am currently writing the Implementing a file pager in Zig posts series. I implemented the same low level mechanics using mmap, using mmap, I have < 100 lines of code and can start building higher level concepts almost immediately. Writing my own pager is currently a 10 posts series and the end doesn’t seem to be in sight.

I’m going to use this post to respond to the article. As a reminder, I’m the founder of RavenDB and I wrote Voron, a mmap based storage engine, and has been running that across hundreds of clients and literally tens of millions of instances in production. I am also writing a book about building a storage engine that uses mmap internally.

The paper itself does a great job of outlining the issue of using mmap as the buffer pool in DBMS. What it doesn’t cover, however, is the alternative. I will touch on specific points from the paper shortly, but I want to point out that the article compares apples to camels in the benchmarks and conclusions. Note that I don’t necessarily disagree with some of the statements, mmap certainly has challenges that you need to deal with, but if you avoid that, you can’t have wave everything that it brings to the table.

When building a database, using mmap has the following advantages, the OS will take care of:

Reading the data from disk
Concurrency between different threads reading the same data
Caching and buffer management
Eviction of pages from memory
Playing nice with other processes in the machine
Tracking dirty pages and writing to disk*

I put an asterisk on the last one because it probably requires your attention as well.

If you aren’t using mmap, on the other hand, you still need to handle all those issues. That is a key point that I believe isn’t addressed in the paper. Solving those issues properly (and efficiently) is a seriously challenging task. Given that you are building a specialized solution, you can probably do better than the generic mmap, but it will absolutely have a cost. That cost is both in terms of runtime overhead as well as increased development time.

The comparison that was made by the paper was done using fio benchmark tool, which is great if you want to test your storage system, but is pretty much irrelevant if you are trying to benchmark a buffer pool. Consider the following:

For the mmap version, we need to compute the address of the page and that is pretty much it. For the manual buffer pool, the list of tasks that we need to handle is long. And some of them require us to be thread safe. For example, if we handed a page to a transaction, we need to keep track of that page status as being in use. We cannot evict this page until the transaction is done with it. That means that we probably need to do atomic reference counting, which can have very high costs. There are other alternatives, of course, but they all have even higher degrees of complexity.

In practice, data access within a database isn’t actually random, even if you are doing random reads. There are pages that are going to almost always be referenced. The root page in the B+Tree is a good example. It is always going to be used. Under atomic reference counting, that page is going to be a bottleneck.

Ignoring such overhead of the buffer pool management means that you aren’t actually comparing equivalent structures. I should also point out that I’m probably forgetting a few other tasks that the buffer pool needs to manage as well, which complicate its life significantly. Here is an example of such a buffer pool implementation from what is effectively a random GitHub repository. You can see what the code is trying to do here. The reason I point to this is that there is a mutex there (and I/O under the lock), which is fairly typical for many buffer pools. And not accounting for the overhead of buffer pool management is seriously skewing the results of the paper.

All of this said, I absolutely agree that mmap can be challenging. The paper outlines 4 different problems, which I want to address.

Problem #1 – Transactional safety

A database needs to know when the data is persisted to disk. When using mmap, we explicitly give up that knowledge. That can be a challenge, but I don’t see that as a seriously different one from not using mmap. Let’s consider the manner in which Postgres is working. It has its own buffer pool, and may modify the pages as a result of a write. Postgres may need to evict modified pages to disk before the transaction that modified them is committed. The overhead of managing that is just… part of the challenge that we need to deal with.

For RavenDB, as the paper points out, we modify the pages outside of the mmap memory. This is actually not done for the reason the paper describes. I don’t actually care if the data is written to memory behind my back. What I care about is MVCC (a totally separate concern than buffer management). The fact that I’m copying the modified data to the side means that I Can support concurrent transactions with far greater ease. In a similar fashion, Postgres handles MVCC using multiple entries for the same row in the same page.

When the transaction commits and older transactions no longer need the old version of the data, I can push the data from the modified buffers to the mmap region. That tends to be fairly fast (given that I’m basically doing memcpy(), which runs at memory speed) unless I have to page data in, more on that later.

The paper mentions the issue of single writer in LMDB, and I wanted to point out that a single writer model is actually far more common (and again, not really related to the buffer pool issue). Off the top of my head, most embedded databases implement a single writer model.

LMDB
Voron (RavenDB’s storage engine)
LevelDB
Lucene

The one that I can think that doesn’t have a single writer is RocksDB(where allow_concurrent_memtable_write is for writes to the memtable, not related to file I/O).

The reason this matters is that embedded systems can typically assume that all operations in a transaction will complete as a unit. Compare to Postgres, where we may have a transaction spanning multiple network calls, interleaving writes is a must. If we could avoid such concurrency, that would be far preferable. You can get additional concurrency by having sharding writes, but that is usually not needed.

Problem #2 – I/O Stalls

The paper points out, quite correctly, that not having control over the I/O means that you may incur a page fault at any time. In particular, you may end up blocked on I/O without really noticing. This can be a killer especially if you are currently holding a lock and blocked on page fault. Indeed, I consider this to be the most serious issue that you have to deal with mmap based systems.

In practice, however, the situation isn’t so clear cut. Until quite recently, the state of asynchronous I/O on Linux was quite iffy. Until the arrival of io_uring, certain operations that you expected to be async would block occasionally, ruining your day. The paper mentions that you can use async I/O to issue I/O requests to load the next pages (non sequentially) from the disk when you are performing certain operations. You can do the same with mmap as well, and RavenDB does just that. When you start a scan on a B+Tree, RavenDB will ask the OS to ensure that the memory we are interested in is in memory before we actually get to it. On Linux, this is done with madvise(WILL_NEED) call. That call may be blocking, so we actually have a dedicated thread that is meant to handle such a scenario. In practice, this isn’t really that different from how you’ll handle it with async I/O.

Another consideration to deal with is the cost of mapping at the kernel level. I’m not talking about the I/O cost, but if you have many threads that are faulting pages, you’ll run into problems with the page table lock. We have run into that before, this is considered an OS level bug, but it obviously has an impact on the database. In practice, however, the overhead of memory management is the same in most cases. If you are reading via mmap or allocating directly, you’ll need to orchestrate things. Note that the same page table lock is also in effect if you are heavily allocating / freeing, since you’re also modifying the process page table.

Problem #3 – Error Handling

Error handling is a serious concern for a database. The paper points out that databases such as SQL Server may run a checksum when reading data from disk. When you use a buffer pool, the boundary of reading from the disk is obvious and you can easily validate the read from the disk. Voron is using mmap exclusively, and we do have checksums. We validate the page from the disk the first time that we access it (there is an internal bitmap that is used for that). There isn’t a big difference between the two anyway. We only check a given page once per run, because to do otherwise is meaningless. When you use read() to get data from the disk, you have no guarantees that the data wasn’t fetched from a cache along the way. So you may validate the data you read is “correct”, while the on disk representation is broken. For that reason, we only do the check once, instead of each time.

A far greater issue to deal with is I/O errors. What do you do when a read or a write fails? If you are using system calls to manage that, you get a return code and can react accordingly. If you are using mmap, the system will generate a SIGBUS that you’ll have to (somehow) handle.

For a database, dealing with I/O errors has a single correct answer. Crash and then run recovery from scratch. If the I/O system has returned an error, there is no longer any way to know what the state of that is. See: fsync-gate. The only way to recover is to stop, reload everything (apply the WAL, run recovery, etc) and get back into a stable state. SIGBUS isn’t my cup of tea with regards to handling this, but error handling for I/O error isn’t actually something that you do, so just restarting the process ends up more acceptable than you might initially think.

Problem #4 – Performance issues

The paper points out three common reasons for performance issues with mmap usage:

page table contention
single threaded page eviction
TLB shootdowns

The first issue is something that I have run into in the past. It was a bug in the operating system which was fixed. There is no longer a single page table in both Windows and Linux.

The single threaded eviction, on the other hand, is something that we never run into. When using Voron, we map the memory using MAP_SHARED, and most of the time, the memory isn’t dirty. If the system needs memory, it can do that when it assigns a page by just discarding the memory of an unmodified shared page. In this model, we typically see most of the memory as shared, clean. So there isn’t a lot of pressure to evict things, and it can be done on as needed basis.

The TLB shootdown issue is not something that we ever run into as a problem. We have run TB range databases on Raspberry PI with 4GB of RAM and hammered that in benchmarks (far exceeding the memory capacity). The interesting thing here is that the B+Tree nature means that the upper tiers of the tree were already in memory, so we mostly ended up with a single page fault per request. In order to actually observe the cost of TLS Shootdown in a significant manner, you need to have:

really fast I/O
working set that significantly exceeds memory
no other work that needs to be done for processing a request

In practice, if you have really fast I/O, you spent money on that, you’ll more likely get more RAM as well. And you typically need to do something with the data you read, which means that you won’t notice the TLB shootdown as much.

Finally, going back to how I started this post. This assumes that there are no other costs of not using mmap and using direct IO. The benchmark doesn’t account for those extra costs. For example, without mmap, who is doing evictions? In practice, that will lead to the same sort of considerations that you’ll have when dealing with mmap. This is especially the case with TLS shootdown when we start talking about high memory traffic (which likely modifies page allocations for the process, leading to the same scenario).

The paper has been quite interesting to read and it has presented a number of real problems that occur with mmap based systems, but I’m afraid that it doesn’t present the alternatives properly and vastly underestimates both costs and complexity of not using mmap and writing your own buffer pool.

Tweet Share Share 9 comments

Tags:

Comments

15 Jan 2022
04:25 AM

Milosz

Thank you for this whole series, it is very educational :)

The one that I can think that don’t have a single writer is RocksDB(where allow_concurrent_memtable_write is for writes to the memtable, not related to file I/O).

So despite increased complexity of multi writer model I assume there must be some are advantages causing RocksDB to select it?

Do you know maybe which model WiredTiger uses?

15 Jan 2022
11:52 AM

Adam

I don't understand why the paper have you decide to choose either mmap() or async api. My preferred solution would be a hybrid design, using readable mmap() and then write() calls to persist data. And mmap() for reading, as long as the workload is less than the amount of cpu cores or while reading mlocked() regions of memory. And then switch to async buffered io(ex io_uring) when workload exceeds that.

16 Jan 2022
08:24 AM

Oren Eini

Milosz,

The RocksDB model is very different. They are actually writing to a concurrent in-memory data structure, and then writing to WAL separately.

It makes it far easier to do concurrency, since they don't care about concurrency at the file system level.The WiredTiger model is similar to a traditional RDMBS, with page level locks.

16 Jan 2022
08:25 AM

Oren Eini

Adam,

The problem with mmap() is around reading, actually. Because a page fault will block your thread, and you can't _tell_, predict or react to those. Using mlock is not a solution, it is really small allowance, usually. And working with that gives you all the downsides of manual buffer managements, with none on the upsides.

16 Jan 2022
22:58 PM

Claudio Rodrigues

Are you finished Gavran book?

17 Jan 2022
03:44 AM

Milosz

I can totally imagine someone in future benchmarking Voron vs Gavran vs Gavran with Zig file pager

17 Jan 2022
07:21 AM

Oren Eini

Claudio,

Not yet. I'm actually thinking about re-doing that in Zig, since a lot of the facilities I had to write are already there and it is much nicer :-)

17 Jan 2022
09:04 AM

Adam

Oren: Agree, page faults are unpredictable, why I wrote "And then switch to async buffered io(ex io_uring) when workload exceeds that."

Another solution for caching would be using MADV_FREE after reading data, that way data would not evict indexes.

MADV_FREE: The  application no longer requires the pages in the range specified by addr and len.  The kernel can thus free these pages, but the freeing
              could be delayed until memory pressure occurs.

17 Jan 2022
09:26 AM

Oren Eini

Adam,

Syscalls are costly, and the fun part here is that if you are mapping a file, the OS can already freely discard the physical memory cheaply (since it can read it later).I don't think that this would be beneficial, leaving aside the issue that this means that you have to manually manage the memory again at this point. Also, check this out: https://news.ycombinator.com/item?id=23216590

This is interesting as well, but I don't think it actually got merged:https://lwn.net/Articles/590991/

The problem is fairly simple, however. If you are using mmap, you can push a lot of complexity to the kernel, and let it deal with that. If you _don't_, you have to deal with the whole mess as a unit. There isn't really a good middle ground here. There is now the IORING_OP_MADVISE option, which actually gives you the option to do async loads via mmap, which is a really exciting development.However, you may still have accidental page faults and no good way to make sure that everything is in memory.On the other hand, of course, the OS is also free to swap your heap memory as well, so that is probably roughly the same spot.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB

reAre You Sure You Want to Use MMAP in Your Database Management System?

More posts in "re" series:

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

Related posts that you may find interesting:

More posts in "re" series:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication