Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 4 min | 688 words

The posts about our interviews seems to have strike a nerve in some people. So I thought I would respond here to all the comments at once.

Some stats about our hiring process:

  • We worked with 2 recruitment agencies, looking for a 5+ years experience in C#/.NET, Server Side work, OSS experience / NoSQL experience, multi threading, TCP/IP, REST…
  • We got 23 CVs to review, and did phone interviews for most of them.
  • We actually interviewed 9 candidates, had 3 potential hires and made an offer to 2 of them.

Unprofessional – I don’t know about you, but a large part of running a business is finding good people. And having people that are severely under qualified show up is a problem. In the same way that I talk about problems in code or silly patterns or horrible processes, the part about hiring people is just as important. I am quite careful about not giving away any details, but that is about it. I’ve had people comment that I am gloating. But nothing could be further from the truth, if anything, those are mournful posts. Because if that is the state of the industry…

Those are insulting questions – a few people said that they thought that the questions were insulting. Look at the numbers, we had two thirds of the people fail to answer them. I had a candidate with 13+ years of experience who looked quite promising get up and leave mid interview because the candidate couldn’t solve a string sort problem with full internet access. I had people submit code that made my eyes bleed.  And I think that people don’t understand something about those questions. The coding questions won’t get you hired. The coding questions are there to keep you from getting hired. The assumption is that you should be able to solve those questions blindfolded while dancing the cha-cha and typing with your left earlobe.

Due to the large number of people who can talk the talk but not walk the walk, I am assuming all candidates incompetent until proven otherwise. And yes, I absolutely agree that this is a sad state of affairs.

Those are not relevant questions – “In these days on the Internet, I'm not at all sure I do care someone knows how to sort a string.” Well, we are building databases, we actually care quite a lot about things like sorting. But leaving that aside, we don’t actually care how you do the sorting. Most candidate uses List<T>.Sort, and that was fine. But I think that the key point is, if you have internet access and an hour, and you can’t implement a sort algorithm, you don’t get to work here.

You should be mentoring them, not discouraging them – I think that someone just missed a turn in the road to reality. I didn’t show up at a Teach Kids to Program and laughed at their inability to do three stars code. I was conducting interviews for a senior position. That meant that I was limiting candidates to 5+ years experience. Actually going through the interview process cost us. A lot. It effectively shut us down for a day or two, and took about three full days just to do the interviews.  There is a serious cost here.

And having people that are drastically unqualified show up and take our time is a very costly waste. Look at the number above. Out of 23 people, I posted about 3, because they crossed all the limits that I can think of. In contrast, we had a bunch of other people we talked to or even interviewed that didn’t make the cut for being potential hires. They weren’t bad by any means, but they just weren’t what we were looking for.

I am not sorry, if you don’t show up ready to answer stuff that a first year student should be able to, you are cheating. Certainly me, and probably yourself. At that point, I’m going to stop wasting my time and move on to more productive avenues.

time to read 7 min | 1203 words

As mentioned, the data in LMDB is actually stored in page. And I am currently tracking through the process in which we add a new item to the database. The first thing that happen is that we allocate a leaf page. I think that I’ll have to go over how pages are allocated now, which should explain a lot about how LMDB reuses disk space.

A new page is allocated by first finding the oldest transaction that is still active. This is how LMDB implements MVCC. Basically, any free page that is older than any existing transaction is eligible for reuse. The first db (although a better term would be to call it btree, or maybe a table) contains the free pages, and at first we search there for an available page. If we can’t find such a page, we use the end of the file for that. This happens in mdb_page_alloc.

An very interesting aspect is the fact that LMDB allows direct memory writes (with the additional risk of corrupting the db if you messed the data), interestingly, this is done in the following code:

image

If the options allows directly memory writes, you get a point to the mmap file. Otherwise, LMDB will allocate a page (or re-use one that has already been allocated).

I am not really sure what is going on with the insert yet. This is a function pointer (delegate for C#). And it select which behavior will happen later on. I am just not sure what those two different function do yet.

I think I got it, the key is here:

image

You can see that we use the insert method to add the mid variable to the dirty list. So if we give you a direct pointer, we append it to the list. But if we give you allocate a page, we need to add it in order.

The mdb_mid2l_insert will add an item to the list in the appropriate location to ensure sorting. I think that this is done so when we actually write the dirty page to disk, if we do that using file I/O, we will do that in ascending file order, so we can get good seek times from the disk. A new page has 4,084 bytes available for it (12 bytes are taken by header data).

A database is actually created the first time data is written to it. The root page is allocated and recorded. And then the data is added to the page.

As you might recall, data inside a page is kept in a sorted list. But remember that the data is also stored on the page, and there is the whole notion of overflow pages, so I think that I am going to have to draw a picture to figure out what is going on.

image

This is more or less how it works. But I don’t think that I really get it. Note that there are two collections here. First is the positions of nodes in the page, and the second is the node themselves. The reason we have this complexity is that we need to maintain both of them in an efficient manner. And we need to do that without too much copying. Therefor, the list of nodes in the page is in the beginning, growing downward, and the actual nodes are in the end, growing upward. The page is full when you can’t fit anything between those two lists.

A node, by the way, is the LMDB internal name for a key (and maybe data). But I think that I might have an error here, because I don’t see the code here to actually add nodes to the page in a sorted fashion, so it might be doing linear search on the content of a node. I am not tracking through the code for adding a second value into a database. And I think that I have to be wrong about linear node search. The code I was looking at was invoked at the the new db creation section, so it might be taking shortcuts that aren’t generally available.

At any rate, the logic for adding an item goes like this… first we need to find the appropriate page. And we do this by starting from the root and asking for the actual page. Which gets us to mdb_page_get, where we have the following:

image

The really interesting part here is that each transaction have a dirty list of pages, so if you modified a page, you’ll get your own version, even before the data was committed. Otherwise, we will just hand you the current version. This is quite nice.

And then we get to the already familiar mdb_page_search_root function, and I can confirm that the nodes in a page are actually sorted, which only make sense. I am not sure where that happens yet, but I have got the feeling that I am going to see that in a bit.

Okay, I am going to go on a bit of a rant here, mostly against C, to be honest.

image

Take a look at this code. mdb_page_search does something, but what it does it mutate the state for the cursor (mc), and you are then going to access the mc->mc_pg[mc->mc_top] to get the actual result you wanted. To make things more interesting, the is also a mc->mc_ki, which is the index of the node inside the node. And it drove me crazy, because I just couldn’t see the association between those three values, especially because I am so not used to this type of code that I never even considered this as a possibility.

At any rate, now I know how it gets to insert things in the right order. When doing mdb_page_search, it calls to mdb_node_search, which does the appropriate setup to tell the next call where it need to actually do the insert to get things in a sorted fashion. I am currently in the mdb_cursor_put, which is trice cursed and really hard to follow. A 400+ lines method with gotos galore.

But I am in the section where we are actually going to be writing a new value. (new_sub: goto header, if you care). And the code is pretty straight forward from there.

Next up, I want to see how it handles a page split, but that is a subject to another post.

time to read 3 min | 413 words

So, I just finished interviewing a candidate. His CV states that he has been working professionally for about 6 years or so. The initial interview was pretty well, and the candidate was able to talk well about his past experience. I tend to do a generic “who are you?” section, then give them a couple of questions to solve in front of Visual Studio, an architecture question and then a set of technical questions that test how much the candidate knows.

Mostly, I am looking to get an impression about the candidate, since that is all I usually have a chance to do in the span of the interview. The following is a section from the code exercise that this candidate has completed:

for (int i = 0; i < sortedArrLst.Count; i++)
{
    if (sortedArrLst[i].Contains(escapeSrt[0]))
    {
        if (sortedArrLst[i].IndexOf(escapeSrt[0]) == 0)
        {
            sortedArrLst[i] = sortedArrLst[i].Remove(0, escapeSrt[0].Length+1);
            escapeStrDic.Add(sortedArrLst[i], escapeSrt[0]);
        }
        
    }
    if (sortedArrLst[i].Contains(escapeSrt[1]))
    {
        if (sortedArrLst[i].IndexOf(escapeSrt[1]) == 0)
        {
            sortedArrLst[i] = sortedArrLst[i].Remove(0, escapeSrt[1].Length+1);
            escapeStrDic.Add(sortedArrLst[i], escapeSrt[1]);
        }
    }
}

Thank you, failure to use loops will get your disqualified from working at us.

Then there were the gems such as “mutex is a kind of state machine” and “binary search trees are about recursion” or the “I’ll use perfmon to solve a high CPU usage problem in production”.

Then again, the next candidate after that was quite good. Only 4 – 6 to go now.

time to read 4 min | 747 words

It is that time again, we are looking for more developers. And this time I ended up so pissed after an interview I had to call a sick colleague just to vent.

One candidate I ruled out early during the interview process. It was a somewhat sinking sensation in the pit of my stomach as I spoke with the candidate, and I couldn’t get a single actually technical description about what the candidate is actually doing now. A lot of broad descriptions, and a lot of sweeping statements, but no real technical details. But the candidate did know jQuery mobile back & forth, it appears.

The decision was made final when I asked the candidate what web framework they were using. I asked whatever they were using ASP.NET WebForms, ASP.NET MVC or ASP.Net Web API. Note that from my perspective, it is a list in a ascending worth order, and you you are using something not on it, that is a plus (it means you aren’t just using whatever is available, which is nice). So I was quite excited when the candidate said (confusedly) “none of them”. Then it took me putting on my investigative hat and asking a lot of questions about how they are actually doing things before it finally came out that they were doing ASPX.

Not knowing the name of the environment in which you are working with for the past several years… I am not sure what to call it.

At least the candidate was able to let me know how they were using that in great detail. We do stuff in Page_Init, then we have  a method that load the data from the database and put it in the ViewState or the Session, then we bind it to a grid, and most of the code is in the grid event handlers. I am sure that the candidate is a great web developer, but I would rather that this particular candidate be great at another location.

The second candidate actually passed our phone screen and was invited to an interview. We have a fairly basic interview process. Some background information for both sides, then a few questions that you need to solve in Visual Studio (and yes, you have full MSDN & Google access) and then a technical portion of the interview that include a system design and a more detailed set of technical knowledge questions.

Now, I am sure that you have heard about interesting interview questions like sort a 100 GB file in a machine 32 bits with 512MB of memory. I admit that something like that would be challenging and interesting. I would probably quite enjoy seeing how people deal with that.

That is not the type of questions that we ask. I asked for a “sort these strings” and a “calculate this tax” programs. I gave the candidate about an hour and a half, alone in a room with VS and internet connection. I am not even asking to implement your own sort, just customize the comparison function and run the standard .NET sort. And do some basic math. The candidate was unable to finish either problem on the time allotted. Now, to be fair for the candidate, the way he solved the first problem was correct and much better than many other attempts that I have seen. Note that the issue was an IComparable<string> instead of IComparable<Item> that caused the issue. It is subtle and something that I would expect an newbie to catch. But this candidate came with full 6 years of experience.

Okay, I said to myself, it is natural to be nervous when doing interviews, let us see what the candidate knows beyond that. The CV mentioned that work with mutli threading. So I began with some questions about that. But it appears that “I only know Thread and BackgroundWorker”. But the absolute clincher was when I asked the candidate about what would cause a high CPU problem and how to diagnose that: “Well, I think that there are special tools that will tell you which process is using the CPU…”

Special tools? Well, I guess Task Manager can be called special, but I am not really sure that I would call it that .And if your experience in troubleshooting stuff never go to the point where you actually look at Task Manager to see what is going on… it probably means that you don’t have meaningful experience in actually troubleshooting stuff.

time to read 4 min | 624 words

Okay, I know that I have been critical about the LMDB codebase so far. But one thing that I really want to point out for it is that it was pretty easy to actually get things working on Windows. It wasn’t smooth, in the sense that I had to muck around with the source a bit (hack endianess, remove a bunch of unix specific header files, etc). But that took less than an hour, and it was pretty much it. Since I am by no means an experienced C developer, I consider this a major win. Compare that to leveldb, which flat out won’t run on Windows no matter how much time I spent trying, and it is a pleasure.

Also, stepping through the code I am starting to get a sense of how it works that is much different than the one I had when I just read the code. It is like one of those 3D images, you suddenly see something.

The first thing that became obvious is that I totally missed the significance of the lock file. LMDB actually create two files:

  • lock.mdb
  • data.mdb

Lock.mdb is used to synchronized data between different readers. It seems to mostly be there if you want to have multiple writers using different processes. That is a very interesting model for an embedded database, I’ve to admit. Not something that I think other embedded databases are offering. In order to do that, it create two named mutexes (one for read and one for write).

A side note on Windows support:

LMDB supports Windows, but it is very much a 2nd class citizen. You can see it in things like path not found error turning into a no such process error (because it try to use GetLastError() codes as C codes), or when it doesn’t create a directory even though not creating it would fail.

I am currently debugging through the code and fixing such issues as I go along (but no, I am doing heavy handed magic fixes, just to get past this stage to the next one, otherwise I would have sent a pull request).

Here is one such example. Here is the original code:

image

But ReadFile in Win32 will return false if the file is empty, so you actually need to write something like this to make the code work:

image

Past that hurdle, I think that I get a lot more about what is going on with the way LMDB works than before.

Let us start with the way data.mdb works. It is important to note that for pretty much everything in LMDB we use the system page size. By default, that is 4KB.

The data file starts with 2 pages allocated. Those page contain the following information:

image

Looking back at how CouchDB did things, I am pretty sure that those two pages are going to be pretty important. I am guess that they would always contain the root of the data in the file. There is also the last transaction on them, which is what I imagine determine how something gets committed. I don’t know yet, as I said, guessing based on how CouchDB works.

I’ll continue this review in another time. Next time, transactions…

time to read 5 min | 842 words

Okay, I now have a pretty rough idea about how the codebase actually works. I still think that the codebase is quite ugly. For example, take a look at this:image

The len parameter for CreateFile is whatever to open or create or just open (read only). But why is it in a parameter called len?

Probably because it was just there, and it would be a sin to create another local variable just for this, I guess (in a codebase where a method had > 18 local variables!).  To make things more interesting, in the rest of this method, this is actually a string len variable, sigh.

At any rate, let us actually dig deeper now. The following structure is holding data about a db.

image

This is actually somewhat misleading, at least with regards to how I would think about a db. This is the entry point for all the pages that belong to a specific db. But a db in LMDB is not really the same thing as a db in SQL Server or RavenDB. It all reside in the same file, and you always have at least two. The first one is the free db, which is used to track all the free pages. The second one is the main db. Then you have additional, named databases.

This is used here:

image

This define the metadata for the entire environment. Note that we have the two dbs there in mm_dbs. The mm_txnid denotes the last committed transaction id. This value is what gives LMDB its MVCC support.  The mm_last_pg value is the last used page, any transaction that wants to write will start writing at that value.

Let us see how we deal with pages here, shall we?

image

The first part find try to find a dirty page if we are in a read/write transaction and we haven’t specify that we can write directly to memory. This is done by doing a binary search on the list of dirty pages.

Otherwise, we can just hand the user the actual page by accessing it directly.

Next, let us look where this is actually used. Searching for a page with a specific key in it. This is done mostly in mdb_node_search.

image

This seems to be doing a binary search for the keys inside a specific page (in this case, the page that is at the top of the stack on the cursor). That leads to the conclusion that pages internally have data internally stored as sorted arrays.

And this leads me to another pet peeve with this code base. Take a look at this line:

image

Sure, this is a well known trick to cheaply divide  a number by half, but are you seriously telling me that the compiler isn’t going to optimize  (low + high) / 2 ? To my knowledge, no C compiler updated in the last 10 – 15 years managed to miss this optimization. So why write code that is going to be harder to read?

Okay, so now we know how we search for a specific key inside a page, but how do we get to the actual page that we want to search on? This happens on mdb_page_search_root. Let me see if I can take it apart.

When this method is called, the cursor is setup so the first page on the pages stack is the root page. 

And… that is enough for me. Up until now, I have been trying to just read the code. Not execute it, not debug through it, nothing .Just go over the code one line at a time and figure out what is going on. I actually think that I have a really good grasp about what is going on in there, but I think that this is pretty much all I can do at that point from just reading the code. So I am going to stop now and setup an debug environment so I can work with it, and report my finding from stepping through the code.

time to read 1 min | 157 words

So, our managed implementation of leveldb is just about ready to go out and socialize. Note that those are just for a relatively short duration, but they give us good indicator of where we are. We are still running longer term perf tests now. Also note that they are early numbers. We did performance work, but it is still not done.

The following tests were done on a HDD, and all include writing a million records (16 bytes key, 100 bytes values) to storage.

  • Writing 1 million sequential keys - 52,152 op/s
  • Writing 1 million random keys -  11,986 op/s
  • Writing 1 million sequential keys with fsync - 17,225 op/s

And now, for the reads portion:

  • Sequential reads - 104,620 op/s
  • Reverse sequential reads - 57,932 op/s
  • Random reads - 3,191 op/s

Note that I am pretty sure that the reason for the later performance is that it is using an HDD, instead of SSD.

time to read 3 min | 547 words

As it stand the World’s Smallest No  SQL Database will last only as long as you actually have power. The moment that you have a restart, all the data is gone. The is actually how several databases are running, but for now, we are going to assume that this is not desirable. The question now becomes, how do you actually store the data on disk?

This really becomes a pretty complex question, because you need to store the data on disk in a way that is crash safe, allow updates, and doesn’t take all the disk space in the world. Before we will get to the actual on disk data structures, we need to discuss how we implement persistent logs. Persistent logs are the key way that databases gets Durability. And as it turned out, there are just two ways of doing that that I am aware of:

  • Append only
  • Transaction log

Append only models rely on the fact that you only append to create a safe way to ensure that the data is either all in or all out. When we write, we don’t overwrite values, we are creating new values, and then we write where the last log entry is located. In CouchDB, for example, this is done by modifying the header portion of the file to point to the new data. In LMDB this is done by updating the page listing with the new pages. In both cases, you actually commit a transaction when the on disk pointer is changed to point to the new data. If you crash midway, nothing bad happened, you didn’t update the on disk pointer, it is still pointing at the old location. Worst case scenario, you wasted some disk space, and you probably have a way to reclaim that anyway.

Transaction logs use a different way to handle this. They are also usually implemented as append only files, into which we write the new data. Afterward, we can apply those changes in memory / on disk safely. A crash would simply mean having to replay the log. This is how leveldb, for example, works. This is also the essential idea behind how SQL Server Oracle works. Although in their case I believe that the transaction log usually contain both prev/new state of the pages they changed for a specific transaction.

One thing that you have to be aware of, either way, is that you should be prepared to handle scenarios where your process crashed midway, or when your entire machine had the plug pulled out. That means using fsync, sure, but it also means that you might have to do log replay, or be ready to see if you can recover something from the append only model.

The good thing about the transaction log approach is that after you have committed the changes to the standard persistent format, you can clear it (usually by just creating a new file and deleting the old one). With the append only model, you usually have to run some sort of compaction to clear things out. Note that the transaction log model vs append only model doesn’t really mean about how the rest of the persistent data is actually stored. We will touch on that on the next post.

time to read 8 min | 1539 words

Tobi had a few questions about memory mapped files. And it is quite an interesting topic.

A memory mapped file is a feature for all modern operating system. It require coordination between the memory manager and the I/O subsystem. Basically, you can tell the OS that some file is the backing store for a certain portion of the process memory. In order to understand that, we have to understand virtual memory.

Here is your physical memory, using 32 bits for ease of use:

image

Now, you have two processes that are running. Each of them get their own 4 GB address space (actually, only 2 GB is available to the process in 32 bits, but that is good enough). What happen when both of those processes obtain a pointer to 0x0232194?

Well, what actually happens is that the pointer that looks like a pointer to physical memory is actually a virtual pointer, which the CPU and the OS work together to map to physical memory. It is obvious from the image that there is a problem here, what happens if two processes uses 4GB each? There isn’t enough physical memory to handle that. This is the point were the page file gets involved. So far, this is pretty basic OS 101 stuff.

The next stuff is where it gets interesting. So the OS already knows how to evict pages from memory and store them on the file system, because it needs to do that for the page file. The next step is to make use of that for more than just the page file. So you can map any file into your memory space. Once that is done, you can access the part of the memory you have mapped and the OS will load the relevant parts of the file to memory. Again, pretty basic stuff so far.

You can read about this more in this article.

The reason you want to do this sort of thing is that it gives you the ability to work with files as if it was memory. And you don’t have to worry about all that pesky file I/O stuff. The OS will take care of that for you.

And now, to Tobi’s questions:

  • What are the most important advantages and disadvantages?

Probably the most important is that you don’t have to do manual file I/O. That can drastically reduce the amount of work you have to do, and it can also give you a pretty significant perf boost. Because the I/O is actually being managed by the operation system, you gain a lot of experience in optimizing things. And the OS will do things like give you a page buffer, caching, preloading, etc.  It also make it drastically easier to do parallel I/O safely, since you can read/write from “memory” concurrently without having to deal with complex API.

The disadvantages you need to be aware that things like the base addresses would change whenever you re-open the file, and that data structures that are good in memory might not result in good performance if they are stored on disk.

Another problem relates to how the changes actually are saved to the file. It is hard to make sure that the writes you do are coherently stored in the file system. For example, let us say that you made changes to two different memory locations, which reside on two different pages. The OS can decide, at any time, to take one of those pages away from you because it needs that memory for other things. When that happens, it will write that page to disk for you. So when you ask for it the next time, it can load it up again and the application won’t notice.

However, what would happen during a crash? You might have partially written data in that scenario. Usually, when writing to files using file I/O routines, you have very predictable write pattern. When you use memory mapped files for writes, you don’t know for sure in what order that is going to happen. The OS is free to choose that. There are ways you can control that. For example, you might use FILE_MAP_COPY to avoid the OS writing stuff for you, but you would have to handle writes yourself now, and that is decidedly not trivial.

You can use something like FlushViewOfFilew() to make sure that a specific range is flushed, but that still doesn’t mean that they will be written in any order that you might expect. As a db writer, I care, quite deeply, about the exact way the data is written to file, because otherwise it is really hard to reason about how to read it back. Especially when you have to consider failures and corruptions.

Personally, I would be writing stuff using memory mapped file for data that I needed critical stuff for.

  • Why are using well known products like SQL Server custom caches instead of memory mappings?

SQL Server is actually its own operating system. It managed memory, locks, threads, I/O and a lot more. That comes from the fact that for a long time, it had to do those sort of things. But note that SQL Server probably use memory mapped file quite a lot. But they are also using their own custom caches because it make sense to them. For example, query plan cache. Even when you do have memory mapped files, you usually have multiple layers of caching.

In RavenDB, for example, we have the native page cache (managed by Esent), the documents cache (managed by RavenDB) and the client cache. The reason that we have multiple layers is that we cache different things. The client cache avoid having to call the server. The documents cache avoid having to parse documents and the native page cache avoid having to go to disk.

  • Why are other products like LevelDB using mmap instead of custom caches?

Because they are drastically simpler products. They really want to just give you the data as quickly as possible, and they don’t need to do additional processing of the data. They can lean on the OS page cache to a much larger extent. Note that when use in real products, there are often higher level caches that they will use anyway, used for storing processed / parsed information.

  • Are they well suited for managed code?

Memory Mapped Files are usually harder to use from managed code, because we don’t do our own memory management. It does meant that you lose the benefit of just treating this as part of your memory space, because there is a very clear line between managed memory and native memory, and memory mapped files are on the other side of the hatch. That said, you can easily use them via UnmanagedMemoryStream, or read from them directly via Memory Accessor. The managed implementation for leveldb make heavy use of memory mapped files.

  • How does performance compare with other techniques?

It is generally better, mostly because the OS is really good in managing paging, and that you rely directly on the low level I/O routines. Another thing that you have to remember that if you use file I/O, you need to create a buffer, copy to/from that buffer, etc. Using memory mapped files saves all of that.

  • Is it possible to use them with mutable data structures and retain crash recoverability?

Yes, but… is probably the best answer. Yes, you can use them for mutable data, but you really want to be careful about how you do it. In particular, you need to make sure that you write in such a way that you can survive partial writes. A good way of doing that is to make writes to specific pages, and you “commit” by recording that those pages are now available on a metadata page, or something like that. This require a lot of really careful work, to be honest. Probably more work than you would like it to be. LMDB does it this way, and even if the code wasn’t a eye tearing pain, what it is doing is quite hard.

Note that in order to actually be sure that you a flushing to disk, you have to call both FlushViewOfFile and FlushFileBuffers on Windows.

  • What guarantees does the OS make regarding ordering and durability?

Pretty much none regarding ordering, as noted. Windows will guarantee that if both FlushViewOfFile and FlushFileBuffers  have been called and successfully completed, you are golden. But there aren’t any promises in the API about what will happen for partway failures, or in what order this happens.

To summarize:

Memory mapped files are a great tool, and for reads, they are excellent. For writes, they are awesome, but since there is no way to ensure in what order dirty pages gets written to disk, it make it hard to generate reliable system using them.

I rather use standard file I/O for writes, since that is far more predictable.

time to read 3 min | 488 words

Even for a first attempt, World’s Smallest No  SQL Database is actually pretty good. We got 3 of the 4.

The DB is:

  • Atomic, since change will either go in or be rejected.
  • Consistent, you can’t see changes half way.
  • Isolated, one change can’t modify/view another.

It isn’t durable, because everything is in memory.

It even has transactions, in the sense that a change goes in or not in an atomic fashion.

But, I think you’ll agree, this is very much a poor man’s solution. And those only apply just as long as you need to work with just a single item. In practice, you would usually want to make a lot more than that.

The basic properties you want is the ability to do multi item transactions, so you can modify two items at the same time, and they either go in or not at all. And that is where it gets really  complex, because at that point we need to build our own way to handle atomic updates to multiple values, how to keep them isolated and how to handle this consistently.

There aren’t really simple way to handle that, especially if we want to handle concurrent transactions. You would have to make a lot of decisions all of a sudden. For example, what happens if you have two concurrent transactions trying to update the same value. Is one of them rejected? And if so, at what point? Initial write? Commit time?

And that is just the start.

  • Do you allow a transaction to live across multiple requests?
  • How do you handle transaction expiration?
  • Do you need to have a transaction span more than one node? How do you handle that?

And we haven’t even talked about making transactions durable. There is no point, since we don’t do any persistence. But we will touch on persistence in a future post, so let talk about durability right now.

In general, there are two major ways to handle transactions. Either an append only model or a transaction log. In both cases, concurrent transactions make it… interesting to figure out how to properly write to the log. And the moment that you have either option, you have to build in some form of a recovery tool for system or db crashes. You have better also love fsync, since it is a very important tool in making sure that you can actually get durable transactions.

Oh, and of course, I forgot, if we allow long running transactions…

  • Who can read the yet uncommitted value? Everyone? Just the current transaction? Can you explicitly ask to see uncommitted value?
  • How do you allow it to be read?
  • Is it available for queries?

This is usually the most touchy part in the system, since is so critical. It is also something that you have to take into account in pretty much the entire codebase.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}