Ayende @ Rahien

filter by tags archive

architecture (601) rss
bugs (449) rss
challanges (123) rss
community (373) rss
databases (481) rss
design (893) rss
development (639) rss
hibernating-practices (71) rss
miscellaneous (592) rss
performance (393) rss
programming (1080) rss
raven (1434) rss
ravendb.net (517) rss
reviews (184) rss

2025
- April (7)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

RavenDB - High-Performance NoSQL Document Database

Aug 09 2013

Lightning memory-mapped database is Esent

time to read 2 min | 325 words

Tweet Share Share 27 comments

Tags:

Here is something that suddenly dawned at me as I was going through the LMDB codebase. LMDB is actually how Esent works. In fact, going through LMDB codebase I am in a much better position to understand how Esent works.

For example, Esent is limited to 16TB per database. I did some math to figure out the max size of a LMDB db that used 32bits for page number and I got to 2 ** 32 times 4,096 == 16 TB, and that isn’t a coincidence. And all the background information in the Esent API that talk about large values, for example. That is how LMDB does overflow, and tables are implemented as additional B-Trees (what LMDB calls dbs).

At some point I was going through this in my head and realized that indexes are also working in the exact same way, and so does pretty much everything else. What Esent does on top is provide better concurrency (LMDB allows only a single writer), and a whole lot of additional features (schemas, secondary indexes, etc).

But I think that the major difference from my point of view is that I had a theoretical idea about how Esent really worked, under the covers. Now I feel like I actually grok things at a much better level. For instance, Esent will not give back space to the operating system, even if you deleted stuff. I knew that, and just accepted that as how it worked. But now I can understand why it doesn’t do that. Or the MAX_VER_PAGES setting, I knew what it purpose was, and how it impacted operations (it is the amount of changes that can be pending). But after reading LMDB dirty_pages and its limit, I understand the why and how it works at a much deeper level.

To be honest, I kept reading LMDB codebase purely because I wasn’t going to let that code beat me, but it is very educational.

Aug 08 2013

Reviewing Lightning memory-mapped database libraryMVCC

time to read 1 min | 189 words

Tweet Share Share 5 comments

Tags:

MVCC stands for Multi Versioning Concurrency Control. This is how you can have both readers & writer at the same time and not have to arrange locks for the two. With LMDB, MVCC is implemented by always creating a new page, never modifying them in place. That also means that when we “free” a page, we need to make sure not to actually use it until all the transactions that could see it has completed.

To be honest, I can’t follow the code. It is somewhat related to me_pghead , but I just can’t follow what is going on. I think that this is related to the way it manage multiple transactions, but I am just unable to follow the code .Maybe it is just that I overloaded my senses with too much C code, I have been diving into this code, and sometimes it feels like this:

That said, I understand how it has to work, so that should be enough for now. Next, I want to see how to do it myself Smile .

Aug 07 2013

Reviewing Lightning memory-mapped database libraryWhat about free pages?

time to read 6 min | 1063 words

Tweet Share Share 11 comments

Tags:

Okay, so in my last post (and last night, from my perspective), I wrote about the process in which LMDB maintains the B-Tree when it grows. But what about when it shrinks? And how does LMDB handle things like reusing space?

For that matter, we need to discuss another thing. As I understand this right now, LMDB is going to write a page once and only once .And when that is done, that page is now immutable. Updates to this page will result in a new page, and the old one will be marked for reuse. So you don’t actually have to worry about updating the page while others may be reading it. In other words, I would expect this code to use a lot of pages:

Actually, I would expect it to use two pages all the time. Alternating between the two after every transaction.

Indeed, following the code, it is easy to see that this magic happens during mdb_page_touch. We find (somehow, not sure how yet) a free page (or create a new one), and we mark the old one for reuse. That means that when you actually write to the data, you aren’t really writing to the old page, you are creating a new one. This drastically reduces a lot of complexity. However, it does mean that the db will be using more space, since every write transaction will use a minimum of one new page (and probably more, considering that there is the B-Tree to maintain). in fact, on average size dbs, I would be surprised if a single transaction could write less than 3 – 4 pages as a minimum. I’ll test the min number of pages per transaction in a bit, right now I want to focus on the implications of that decision.

Well, to start with, concurrency is easier. We aren’t writing to old data, so we have MVCC. The only thing we need to ensure is that we aren’t reusing a page before all of its transactions are complete. But it also has other implications, I haven’t confirmed my suspicions about transaction commits, but because we are writing to new locations, not modifying existing ones, this means that a crash midway would simply restore us to the previous state, and will not corrupt any data.

Since we don’t modify pages, it means that free pages aren’t just things that happen when we delete data, they happen all the time, so it would be interesting to see how LMDB handles them. Here is the relevant piece of the code:

So either find me a free page, or allocate a new one. Then add the existing page to the free list for the current transaction. A bit lower in the code we copy the data in the old page to the new one, and we are ready to go.

So, when we make updates, at least in this scenario, we are cycling between two pages, always allocating one and freeing the other, and on the next transaction we go the other way. This is really quite interesting, from the perspective of comparing that to other storage engines. CouchDB work in pretty much the same way, B-Tree with all the associated benefits. But CouchDB model is based on append only, and it cannot reuse the space in the file, which require compactions. LMDB model is much nicer in that regard. Since in both cases, in place updates are not allowed, there is a lot of wasted space that goes just for those updates. In LMDB’s case, that wasted space is a lot bigger, because it works in page sizes, and can reuse them. In CouchDB’s case, it doesn’t reuse space, so it doesn’t use pages (more space efficient this way).

Side note: C’s lack of native data structures beyond array is really annoying. You realize that in C, a Stack is a design pattern?!

And about the minimum number of pages per transaction, let us see what is going on about that. So I wrote the following code:

This should generate a deep enough B-Tree to show what I want. And indeed, the action happens here:

Both the root node and the data node are modified, because of the cursor stack. It isn’t really a big deal, to be honest, since usually you would only need to update H pages (where H is the height of the tree) and H is very small for B-Trees.

But this leads to me to consider something else. Usually when talking about on disk structures, one of the more important things that you want to keep in mind is reducing seeks. But the B-Tree model that we have here would result in having pages pretty much all over the file. Now, in something like CouchDB, it doesn’t actually matter, because the file is append only, so you always end up scanning from the end of the file backward. But because LMDB is reusing pages, it is entirely possible for a root page to be in the middle of the file, the next level to be in the start and the actual data in the end.

Now, I am guessing that this isn’t actually an issue, because LMDB relies on the OS page cache to handle this, and that would take care of that. In practice, I expect, the branches of the tree are always going to be in memory, since they were the last written and the most accessed.

And that is enough for now. Next, I want to figure out how we return free pages to the system. In order to do that, I think that I’ll need to look at how we transactions are committed.

Aug 05 2013

Reviewing Lightning memory-mapped database libraryA thoughtful hiatus

time to read 2 min | 386 words

Tweet Share Share 4 comments

Tags:

I thought that I would stop a bit from focusing on what the LMDB code is doing in favor to some observations about the code itself. Going into this codebase it like getting hit in the face with a shovel. Now, this might be my personal experience, as someone who has done a lot of managed code work in the past. But I used to be a pretty horrible C/C++ guy (the fact that I say C/C++ should tell you exactly what my level was).

But I don’t think that it was just that. Even beyond the fact that the code is C, and not C++ (which I am much more used to), there is a problem that only become clear to me well after I read the code for the millionth time. It grew. Looking at the way the code is structured, it looks like it was about as nice a C codebase as you can get (don’t get excited, that isn’t saying much). But overtime, features were added, but the general structure of the codebase wasn’t adjusted to account for that.

I am talking about things like this:

There are actually 22 (!) ‘if(IS_LEAF(mp))’ references in the codebase.

Or what about this?

It looks like certain features (duplicate keys support, for example) was added that had a lot of implication on the code, but it wasn’t refactored accordingly. It make it very hard to go through.

Aug 02 2013

Reviewing Lightning memory-mapped database libraryOn page splits and other painful things

time to read 4 min | 787 words

Tweet Share Share 15 comments

Tags:

I don’t know if you noticed, but the LMDB codebase is giving me serious headache issues. The bloody thing is a very dense piece of code, but at the same time, it is also quite interesting. In particular, B-Trees are pretty much The Answer for a lot of on disk data structures, but they tend to be quite hard to handle properly. So I am looking forward to this piece of code, in which I am going to figure out if I can figure out this code. Just to note mdb_page_split is also another 400 lines method with goto sprinkled all over.

In order to do that, I wrote the following code (in a single transaction):

And then I spent another few minutes trying to get it to compile. Mostly because I started out with “for (int i=0; …” and C doesn’t allow that (blah!).

Anyway, I got the page to split, and now I want to see how it actually behaves. I am currently at the first part where we take the root (currently leaf) page and turn it into a branch. This is an interesting case of a root split.

We start with:

And the first thing we do is to go to this mode:

We add an empty implicit pointer to the previous root. But this is really just the beginning, now we need to divide the entries that used to be in the root between the left & right sides. This is made complex by the fact that you might have it setup so it has a few large & small values, so just cutting them in the middle would produce a result that would be too big. At least, that is what a comment says. I am not sure how that can be. We are going from one page to two pages, so I don’t see how you can get into a situation where that would happen. I guess it is time to slot into more code.

Okay, I got it, the issue is that we do a page split because we need to add a new item. So that is the reason we have to jump through those hops. Because we add a new item (that is already too big for the original page, since that is why we are actually splitting that).

Another complex issue with page splits is that they might be recursive. If we need to split a page, we need to add an entry to the parent page, which might cause that page to split, etc…

An interesting observation on the sizes involved, a LMDB page has 4084 bytes available for storage. The data for the page number is 8 bytes long (it uses pointer size page number) and assuming keys that are 16 bytes keys in size (and including about 8 bytes for node header), we get about 128 keys per branch page. Even considering that B-Tree are specifically designed to make themselves efficient in the presence of memory hierarchies, it is quite impressive.

Consider, assuming a full tree, if we hold just the root and the first level in memory, we would consume about 512kb. And that would give us just one seek to get any of ~2 million items. As an aside, one reason that I like reading this code is that for once, this is a B-Tree implementation that isn’t covered in locking code, which is how this is usually works in RDBMS.

Another aspect that is quite interesting is that this code really shows important aspects for working with relational databases. It is all about keeping things inside the same page, with spill over to overflow pages slowing things down because you need to do another seek. Or a really obvious explanation why page splits are such a bad thing, and a lot of other details that you learn when you go a bit deep into relational databases but (at least for me) have never been real before I started dealing with building databases myself.

Jul 30 2013

Reviewing Lightning memory-mapped database libraryOn disk data

time to read 7 min | 1203 words

Tweet Share Share 10 comments

Tags:

As mentioned, the data in LMDB is actually stored in page. And I am currently tracking through the process in which we add a new item to the database. The first thing that happen is that we allocate a leaf page. I think that I’ll have to go over how pages are allocated now, which should explain a lot about how LMDB reuses disk space.

A new page is allocated by first finding the oldest transaction that is still active. This is how LMDB implements MVCC. Basically, any free page that is older than any existing transaction is eligible for reuse. The first db (although a better term would be to call it btree, or maybe a table) contains the free pages, and at first we search there for an available page. If we can’t find such a page, we use the end of the file for that. This happens in mdb_page_alloc.

An very interesting aspect is the fact that LMDB allows direct memory writes (with the additional risk of corrupting the db if you messed the data), interestingly, this is done in the following code:

If the options allows directly memory writes, you get a point to the mmap file. Otherwise, LMDB will allocate a page (or re-use one that has already been allocated).

I am not really sure what is going on with the insert yet. This is a function pointer (delegate for C#). And it select which behavior will happen later on. I am just not sure what those two different function do yet.

I think I got it, the key is here:

You can see that we use the insert method to add the mid variable to the dirty list. So if we give you a direct pointer, we append it to the list. But if we give you allocate a page, we need to add it in order.

The mdb_mid2l_insert will add an item to the list in the appropriate location to ensure sorting. I think that this is done so when we actually write the dirty page to disk, if we do that using file I/O, we will do that in ascending file order, so we can get good seek times from the disk. A new page has 4,084 bytes available for it (12 bytes are taken by header data).

A database is actually created the first time data is written to it. The root page is allocated and recorded. And then the data is added to the page.

As you might recall, data inside a page is kept in a sorted list. But remember that the data is also stored on the page, and there is the whole notion of overflow pages, so I think that I am going to have to draw a picture to figure out what is going on.

This is more or less how it works. But I don’t think that I really get it. Note that there are two collections here. First is the positions of nodes in the page, and the second is the node themselves. The reason we have this complexity is that we need to maintain both of them in an efficient manner. And we need to do that without too much copying. Therefor, the list of nodes in the page is in the beginning, growing downward, and the actual nodes are in the end, growing upward. The page is full when you can’t fit anything between those two lists.

A node, by the way, is the LMDB internal name for a key (and maybe data). But I think that I might have an error here, because I don’t see the code here to actually add nodes to the page in a sorted fashion, so it might be doing linear search on the content of a node. I am not tracking through the code for adding a second value into a database. And I think that I have to be wrong about linear node search. The code I was looking at was invoked at the the new db creation section, so it might be taking shortcuts that aren’t generally available.

At any rate, the logic for adding an item goes like this… first we need to find the appropriate page. And we do this by starting from the root and asking for the actual page. Which gets us to mdb_page_get, where we have the following:

The really interesting part here is that each transaction have a dirty list of pages, so if you modified a page, you’ll get your own version, even before the data was committed. Otherwise, we will just hand you the current version. This is quite nice.

And then we get to the already familiar mdb_page_search_root function, and I can confirm that the nodes in a page are actually sorted, which only make sense. I am not sure where that happens yet, but I have got the feeling that I am going to see that in a bit.

Okay, I am going to go on a bit of a rant here, mostly against C, to be honest.

Take a look at this code. mdb_page_search does something, but what it does it mutate the state for the cursor (mc), and you are then going to access the mc->mc_pg[mc->mc_top] to get the actual result you wanted. To make things more interesting, the is also a mc->mc_ki, which is the index of the node inside the node. And it drove me crazy, because I just couldn’t see the association between those three values, especially because I am so not used to this type of code that I never even considered this as a possibility.

At any rate, now I know how it gets to insert things in the right order. When doing mdb_page_search, it calls to mdb_node_search, which does the appropriate setup to tell the next call where it need to actually do the insert to get things in a sorted fashion. I am currently in the mdb_cursor_put, which is trice cursed and really hard to follow. A 400+ lines method with gotos galore.

But I am in the section where we are actually going to be writing a new value. (new_sub: goto header, if you care). And the code is pretty straight forward from there.

Next up, I want to see how it handles a page split, but that is a subject to another post.

Jul 25 2013

Reviewing Lightning memory-mapped database libraryStepping through make everything easier

time to read 4 min | 624 words

Tweet Share Share 17 comments

Tags:

Okay, I know that I have been critical about the LMDB codebase so far. But one thing that I really want to point out for it is that it was pretty easy to actually get things working on Windows. It wasn’t smooth, in the sense that I had to muck around with the source a bit (hack endianess, remove a bunch of unix specific header files, etc). But that took less than an hour, and it was pretty much it. Since I am by no means an experienced C developer, I consider this a major win. Compare that to leveldb, which flat out won’t run on Windows no matter how much time I spent trying, and it is a pleasure.

Also, stepping through the code I am starting to get a sense of how it works that is much different than the one I had when I just read the code. It is like one of those 3D images, you suddenly see something.

The first thing that became obvious is that I totally missed the significance of the lock file. LMDB actually create two files:

lock.mdb
data.mdb

Lock.mdb is used to synchronized data between different readers. It seems to mostly be there if you want to have multiple writers using different processes. That is a very interesting model for an embedded database, I’ve to admit. Not something that I think other embedded databases are offering. In order to do that, it create two named mutexes (one for read and one for write).

A side note on Windows support:

LMDB supports Windows, but it is very much a 2nd class citizen. You can see it in things like path not found error turning into a no such process error (because it try to use GetLastError() codes as C codes), or when it doesn’t create a directory even though not creating it would fail.

I am currently debugging through the code and fixing such issues as I go along (but no, I am doing heavy handed magic fixes, just to get past this stage to the next one, otherwise I would have sent a pull request).

Here is one such example. Here is the original code:

But ReadFile in Win32 will return false if the file is empty, so you actually need to write something like this to make the code work:

Past that hurdle, I think that I get a lot more about what is going on with the way LMDB works than before.

Let us start with the way data.mdb works. It is important to note that for pretty much everything in LMDB we use the system page size. By default, that is 4KB.

The data file starts with 2 pages allocated. Those page contain the following information:

Looking back at how CouchDB did things, I am pretty sure that those two pages are going to be pretty important. I am guess that they would always contain the root of the data in the file. There is also the last transaction on them, which is what I imagine determine how something gets committed. I don’t know yet, as I said, guessing based on how CouchDB works.

I’ll continue this review in another time. Next time, transactions…

Jul 24 2013

Reviewing Lightning memory-mapped database librarygoing deeper

time to read 5 min | 842 words

Tweet Share Share 13 comments

Tags:

Okay, I now have a pretty rough idea about how the codebase actually works. I still think that the codebase is quite ugly. For example, take a look at this:

The len parameter for CreateFile is whatever to open or create or just open (read only). But why is it in a parameter called len?

Probably because it was just there, and it would be a sin to create another local variable just for this, I guess (in a codebase where a method had > 18 local variables!). To make things more interesting, in the rest of this method, this is actually a string len variable, sigh.

At any rate, let us actually dig deeper now. The following structure is holding data about a db.

This is actually somewhat misleading, at least with regards to how I would think about a db. This is the entry point for all the pages that belong to a specific db. But a db in LMDB is not really the same thing as a db in SQL Server or RavenDB. It all reside in the same file, and you always have at least two. The first one is the free db, which is used to track all the free pages. The second one is the main db. Then you have additional, named databases.

This is used here:

This define the metadata for the entire environment. Note that we have the two dbs there in mm_dbs. The mm_txnid denotes the last committed transaction id. This value is what gives LMDB its MVCC support. The mm_last_pg value is the last used page, any transaction that wants to write will start writing at that value.

Let us see how we deal with pages here, shall we?

The first part find try to find a dirty page if we are in a read/write transaction and we haven’t specify that we can write directly to memory. This is done by doing a binary search on the list of dirty pages.

Otherwise, we can just hand the user the actual page by accessing it directly.

Next, let us look where this is actually used. Searching for a page with a specific key in it. This is done mostly in mdb_node_search.

This seems to be doing a binary search for the keys inside a specific page (in this case, the page that is at the top of the stack on the cursor). That leads to the conclusion that pages internally have data internally stored as sorted arrays.

And this leads me to another pet peeve with this code base. Take a look at this line:

Sure, this is a well known trick to cheaply divide a number by half, but are you seriously telling me that the compiler isn’t going to optimize (low + high) / 2 ? To my knowledge, no C compiler updated in the last 10 – 15 years managed to miss this optimization. So why write code that is going to be harder to read?

Okay, so now we know how we search for a specific key inside a page, but how do we get to the actual page that we want to search on? This happens on mdb_page_search_root. Let me see if I can take it apart.

When this method is called, the cursor is setup so the first page on the pages stack is the root page.

And… that is enough for me. Up until now, I have been trying to just read the code. Not execute it, not debug through it, nothing .Just go over the code one line at a time and figure out what is going on. I actually think that I have a really good grasp about what is going on in there, but I think that this is pretty much all I can do at that point from just reading the code. So I am going to stop now and setup an debug environment so I can work with it, and report my finding from stepping through the code.

Jul 22 2013

World’s Smallest No SQL Database: Persistent transaction logs

time to read 3 min | 547 words

Tweet Share Share 11 comments

Tags:

As it stand the World’s Smallest No SQL Database will last only as long as you actually have power. The moment that you have a restart, all the data is gone. The is actually how several databases are running, but for now, we are going to assume that this is not desirable. The question now becomes, how do you actually store the data on disk?

This really becomes a pretty complex question, because you need to store the data on disk in a way that is crash safe, allow updates, and doesn’t take all the disk space in the world. Before we will get to the actual on disk data structures, we need to discuss how we implement persistent logs. Persistent logs are the key way that databases gets Durability. And as it turned out, there are just two ways of doing that that I am aware of:

Append only
Transaction log

Append only models rely on the fact that you only append to create a safe way to ensure that the data is either all in or all out. When we write, we don’t overwrite values, we are creating new values, and then we write where the last log entry is located. In CouchDB, for example, this is done by modifying the header portion of the file to point to the new data. In LMDB this is done by updating the page listing with the new pages. In both cases, you actually commit a transaction when the on disk pointer is changed to point to the new data. If you crash midway, nothing bad happened, you didn’t update the on disk pointer, it is still pointing at the old location. Worst case scenario, you wasted some disk space, and you probably have a way to reclaim that anyway.

Transaction logs use a different way to handle this. They are also usually implemented as append only files, into which we write the new data. Afterward, we can apply those changes in memory / on disk safely. A crash would simply mean having to replay the log. This is how leveldb, for example, works. This is also the essential idea behind how SQL Server Oracle works. Although in their case I believe that the transaction log usually contain both prev/new state of the pages they changed for a specific transaction.

One thing that you have to be aware of, either way, is that you should be prepared to handle scenarios where your process crashed midway, or when your entire machine had the plug pulled out. That means using fsync, sure, but it also means that you might have to do log replay, or be ready to see if you can recover something from the append only model.

The good thing about the transaction log approach is that after you have committed the changes to the standard persistent format, you can clear it (usually by just creating a new file and deleting the old one). With the append only model, you usually have to run some sort of compaction to clear things out. Note that the transaction log model vs append only model doesn’t really mean about how the rest of the persistent data is actually stored. We will touch on that on the next post.

Jul 19 2013

On Memory Mapped Files

time to read 8 min | 1539 words

Tweet Share Share 17 comments

Tags:

Tobi had a few questions about memory mapped files. And it is quite an interesting topic.

A memory mapped file is a feature for all modern operating system. It require coordination between the memory manager and the I/O subsystem. Basically, you can tell the OS that some file is the backing store for a certain portion of the process memory. In order to understand that, we have to understand virtual memory.

Here is your physical memory, using 32 bits for ease of use:

Now, you have two processes that are running. Each of them get their own 4 GB address space (actually, only 2 GB is available to the process in 32 bits, but that is good enough). What happen when both of those processes obtain a pointer to 0x0232194?

Well, what actually happens is that the pointer that looks like a pointer to physical memory is actually a virtual pointer, which the CPU and the OS work together to map to physical memory. It is obvious from the image that there is a problem here, what happens if two processes uses 4GB each? There isn’t enough physical memory to handle that. This is the point were the page file gets involved. So far, this is pretty basic OS 101 stuff.

The next stuff is where it gets interesting. So the OS already knows how to evict pages from memory and store them on the file system, because it needs to do that for the page file. The next step is to make use of that for more than just the page file. So you can map any file into your memory space. Once that is done, you can access the part of the memory you have mapped and the OS will load the relevant parts of the file to memory. Again, pretty basic stuff so far.

You can read about this more in this article.

The reason you want to do this sort of thing is that it gives you the ability to work with files as if it was memory. And you don’t have to worry about all that pesky file I/O stuff. The OS will take care of that for you.

And now, to Tobi’s questions:

What are the most important advantages and disadvantages?

Probably the most important is that you don’t have to do manual file I/O. That can drastically reduce the amount of work you have to do, and it can also give you a pretty significant perf boost. Because the I/O is actually being managed by the operation system, you gain a lot of experience in optimizing things. And the OS will do things like give you a page buffer, caching, preloading, etc. It also make it drastically easier to do parallel I/O safely, since you can read/write from “memory” concurrently without having to deal with complex API.

The disadvantages you need to be aware that things like the base addresses would change whenever you re-open the file, and that data structures that are good in memory might not result in good performance if they are stored on disk.

Another problem relates to how the changes actually are saved to the file. It is hard to make sure that the writes you do are coherently stored in the file system. For example, let us say that you made changes to two different memory locations, which reside on two different pages. The OS can decide, at any time, to take one of those pages away from you because it needs that memory for other things. When that happens, it will write that page to disk for you. So when you ask for it the next time, it can load it up again and the application won’t notice.

However, what would happen during a crash? You might have partially written data in that scenario. Usually, when writing to files using file I/O routines, you have very predictable write pattern. When you use memory mapped files for writes, you don’t know for sure in what order that is going to happen. The OS is free to choose that. There are ways you can control that. For example, you might use FILE_MAP_COPY to avoid the OS writing stuff for you, but you would have to handle writes yourself now, and that is decidedly not trivial.

You can use something like FlushViewOfFilew() to make sure that a specific range is flushed, but that still doesn’t mean that they will be written in any order that you might expect. As a db writer, I care, quite deeply, about the exact way the data is written to file, because otherwise it is really hard to reason about how to read it back. Especially when you have to consider failures and corruptions.

Personally, I would be writing stuff using memory mapped file for data that I needed critical stuff for.

Why are using well known products like SQL Server custom caches instead of memory mappings?

SQL Server is actually its own operating system. It managed memory, locks, threads, I/O and a lot more. That comes from the fact that for a long time, it had to do those sort of things. But note that SQL Server probably use memory mapped file quite a lot. But they are also using their own custom caches because it make sense to them. For example, query plan cache. Even when you do have memory mapped files, you usually have multiple layers of caching.

In RavenDB, for example, we have the native page cache (managed by Esent), the documents cache (managed by RavenDB) and the client cache. The reason that we have multiple layers is that we cache different things. The client cache avoid having to call the server. The documents cache avoid having to parse documents and the native page cache avoid having to go to disk.

Why are other products like LevelDB using mmap instead of custom caches?

Because they are drastically simpler products. They really want to just give you the data as quickly as possible, and they don’t need to do additional processing of the data. They can lean on the OS page cache to a much larger extent. Note that when use in real products, there are often higher level caches that they will use anyway, used for storing processed / parsed information.

Are they well suited for managed code?

Memory Mapped Files are usually harder to use from managed code, because we don’t do our own memory management. It does meant that you lose the benefit of just treating this as part of your memory space, because there is a very clear line between managed memory and native memory, and memory mapped files are on the other side of the hatch. That said, you can easily use them via UnmanagedMemoryStream, or read from them directly via Memory Accessor. The managed implementation for leveldb make heavy use of memory mapped files.

How does performance compare with other techniques?

It is generally better, mostly because the OS is really good in managing paging, and that you rely directly on the low level I/O routines. Another thing that you have to remember that if you use file I/O, you need to create a buffer, copy to/from that buffer, etc. Using memory mapped files saves all of that.

Is it possible to use them with mutable data structures and retain crash recoverability?

Yes, but… is probably the best answer. Yes, you can use them for mutable data, but you really want to be careful about how you do it. In particular, you need to make sure that you write in such a way that you can survive partial writes. A good way of doing that is to make writes to specific pages, and you “commit” by recording that those pages are now available on a metadata page, or something like that. This require a lot of really careful work, to be honest. Probably more work than you would like it to be. LMDB does it this way, and even if the code wasn’t a eye tearing pain, what it is doing is quite hard.

Note that in order to actually be sure that you a flushing to disk, you have to call both FlushViewOfFile and FlushFileBuffers on Windows.

What guarantees does the OS make regarding ordering and durability?

Pretty much none regarding ordering, as noted. Windows will guarantee that if both FlushViewOfFile and FlushFileBuffers have been called and successfully completed, you are golden. But there aren’t any promises in the API about what will happen for partway failures, or in what order this happens.

To summarize:

Memory mapped files are a great tool, and for reads, they are excellent. For writes, they are awesome, but since there is no way to ensure in what order dirty pages gets written to disk, it make it hard to generate reliable system using them.

I rather use standard file I/O for writes, since that is far more predictable.

Oren Eini

Oren Eini

CEO of RavenDB

Lightning memory-mapped database is Esent

Reviewing Lightning memory-mapped database libraryMVCC

Reviewing Lightning memory-mapped database libraryWhat about free pages?

Reviewing Lightning memory-mapped database libraryA thoughtful hiatus

Reviewing Lightning memory-mapped database libraryOn page splits and other painful things

Reviewing Lightning memory-mapped database libraryOn disk data

Reviewing Lightning memory-mapped database libraryStepping through make everything easier

Reviewing Lightning memory-mapped database librarygoing deeper

World’s Smallest No SQL Database: Persistent transaction logs

On Memory Mapped Files

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed