Ayende @ Rahien

It's a girl

About the interview room series

The posts about our interviews seems to have strike a nerve in some people. So I thought I would respond here to all the comments at once.

Some stats about our hiring process:

  • We worked with 2 recruitment agencies, looking for a 5+ years experience in C#/.NET, Server Side work, OSS experience / NoSQL experience, multi threading, TCP/IP, REST…
  • We got 23 CVs to review, and did phone interviews for most of them.
  • We actually interviewed 9 candidates, had 3 potential hires and made an offer to 2 of them.

Unprofessional – I don’t know about you, but a large part of running a business is finding good people. And having people that are severely under qualified show up is a problem. In the same way that I talk about problems in code or silly patterns or horrible processes, the part about hiring people is just as important. I am quite careful about not giving away any details, but that is about it. I’ve had people comment that I am gloating. But nothing could be further from the truth, if anything, those are mournful posts. Because if that is the state of the industry…

Those are insulting questions – a few people said that they thought that the questions were insulting. Look at the numbers, we had two thirds of the people fail to answer them. I had a candidate with 13+ years of experience who looked quite promising get up and leave mid interview because the candidate couldn’t solve a string sort problem with full internet access. I had people submit code that made my eyes bleed.  And I think that people don’t understand something about those questions. The coding questions won’t get you hired. The coding questions are there to keep you from getting hired. The assumption is that you should be able to solve those questions blindfolded while dancing the cha-cha and typing with your left earlobe.

Due to the large number of people who can talk the talk but not walk the walk, I am assuming all candidates incompetent until proven otherwise. And yes, I absolutely agree that this is a sad state of affairs.

Those are not relevant questions – “In these days on the Internet, I'm not at all sure I do care someone knows how to sort a string.” Well, we are building databases, we actually care quite a lot about things like sorting. But leaving that aside, we don’t actually care how you do the sorting. Most candidate uses List<T>.Sort, and that was fine. But I think that the key point is, if you have internet access and an hour, and you can’t implement a sort algorithm, you don’t get to work here.

You should be mentoring them, not discouraging them – I think that someone just missed a turn in the road to reality. I didn’t show up at a Teach Kids to Program and laughed at their inability to do three stars code. I was conducting interviews for a senior position. That meant that I was limiting candidates to 5+ years experience. Actually going through the interview process cost us. A lot. It effectively shut us down for a day or two, and took about three full days just to do the interviews.  There is a serious cost here.

And having people that are drastically unqualified show up and take our time is a very costly waste. Look at the number above. Out of 23 people, I posted about 3, because they crossed all the limits that I can think of. In contrast, we had a bunch of other people we talked to or even interviewed that didn’t make the cut for being potential hires. They weren’t bad by any means, but they just weren’t what we were looking for.

I am not sorry, if you don’t show up ready to answer stuff that a first year student should be able to, you are cheating. Certainly me, and probably yourself. At that point, I’m going to stop wasting my time and move on to more productive avenues.


Published at

Originally posted at

Comments (76)

Reviewing Lightning memory-mapped database library: On disk data

As mentioned, the data in LMDB is actually stored in page. And I am currently tracking through the process in which we add a new item to the database. The first thing that happen is that we allocate a leaf page. I think that I’ll have to go over how pages are allocated now, which should explain a lot about how LMDB reuses disk space.

A new page is allocated by first finding the oldest transaction that is still active. This is how LMDB implements MVCC. Basically, any free page that is older than any existing transaction is eligible for reuse. The first db (although a better term would be to call it btree, or maybe a table) contains the free pages, and at first we search there for an available page. If we can’t find such a page, we use the end of the file for that. This happens in mdb_page_alloc.

An very interesting aspect is the fact that LMDB allows direct memory writes (with the additional risk of corrupting the db if you messed the data), interestingly, this is done in the following code:


If the options allows directly memory writes, you get a point to the mmap file. Otherwise, LMDB will allocate a page (or re-use one that has already been allocated).

I am not really sure what is going on with the insert yet. This is a function pointer (delegate for C#). And it select which behavior will happen later on. I am just not sure what those two different function do yet.

I think I got it, the key is here:


You can see that we use the insert method to add the mid variable to the dirty list. So if we give you a direct pointer, we append it to the list. But if we give you allocate a page, we need to add it in order.

The mdb_mid2l_insert will add an item to the list in the appropriate location to ensure sorting. I think that this is done so when we actually write the dirty page to disk, if we do that using file I/O, we will do that in ascending file order, so we can get good seek times from the disk. A new page has 4,084 bytes available for it (12 bytes are taken by header data).

A database is actually created the first time data is written to it. The root page is allocated and recorded. And then the data is added to the page.

As you might recall, data inside a page is kept in a sorted list. But remember that the data is also stored on the page, and there is the whole notion of overflow pages, so I think that I am going to have to draw a picture to figure out what is going on.


This is more or less how it works. But I don’t think that I really get it. Note that there are two collections here. First is the positions of nodes in the page, and the second is the node themselves. The reason we have this complexity is that we need to maintain both of them in an efficient manner. And we need to do that without too much copying. Therefor, the list of nodes in the page is in the beginning, growing downward, and the actual nodes are in the end, growing upward. The page is full when you can’t fit anything between those two lists.

A node, by the way, is the LMDB internal name for a key (and maybe data). But I think that I might have an error here, because I don’t see the code here to actually add nodes to the page in a sorted fashion, so it might be doing linear search on the content of a node. I am not tracking through the code for adding a second value into a database. And I think that I have to be wrong about linear node search. The code I was looking at was invoked at the the new db creation section, so it might be taking shortcuts that aren’t generally available.

At any rate, the logic for adding an item goes like this… first we need to find the appropriate page. And we do this by starting from the root and asking for the actual page. Which gets us to mdb_page_get, where we have the following:


The really interesting part here is that each transaction have a dirty list of pages, so if you modified a page, you’ll get your own version, even before the data was committed. Otherwise, we will just hand you the current version. This is quite nice.

And then we get to the already familiar mdb_page_search_root function, and I can confirm that the nodes in a page are actually sorted, which only make sense. I am not sure where that happens yet, but I have got the feeling that I am going to see that in a bit.

Okay, I am going to go on a bit of a rant here, mostly against C, to be honest.


Take a look at this code. mdb_page_search does something, but what it does it mutate the state for the cursor (mc), and you are then going to access the mc->mc_pg[mc->mc_top] to get the actual result you wanted. To make things more interesting, the is also a mc->mc_ki, which is the index of the node inside the node. And it drove me crazy, because I just couldn’t see the association between those three values, especially because I am so not used to this type of code that I never even considered this as a possibility.

At any rate, now I know how it gets to insert things in the right order. When doing mdb_page_search, it calls to mdb_node_search, which does the appropriate setup to tell the next call where it need to actually do the insert to get things in a sorted fashion. I am currently in the mdb_cursor_put, which is trice cursed and really hard to follow. A 400+ lines method with gotos galore.

But I am in the section where we are actually going to be writing a new value. (new_sub: goto header, if you care). And the code is pretty straight forward from there.

Next up, I want to see how it handles a page split, but that is a subject to another post.


Published at

Originally posted at

Comments (10)

Stories from the interview room, part II

So, I just finished interviewing a candidate. His CV states that he has been working professionally for about 6 years or so. The initial interview was pretty well, and the candidate was able to talk well about his past experience. I tend to do a generic “who are you?” section, then give them a couple of questions to solve in front of Visual Studio, an architecture question and then a set of technical questions that test how much the candidate knows.

Mostly, I am looking to get an impression about the candidate, since that is all I usually have a chance to do in the span of the interview. The following is a section from the code exercise that this candidate has completed:

for (int i = 0; i < sortedArrLst.Count; i++)
    if (sortedArrLst[i].Contains(escapeSrt[0]))
        if (sortedArrLst[i].IndexOf(escapeSrt[0]) == 0)
            sortedArrLst[i] = sortedArrLst[i].Remove(0, escapeSrt[0].Length+1);
            escapeStrDic.Add(sortedArrLst[i], escapeSrt[0]);
    if (sortedArrLst[i].Contains(escapeSrt[1]))
        if (sortedArrLst[i].IndexOf(escapeSrt[1]) == 0)
            sortedArrLst[i] = sortedArrLst[i].Remove(0, escapeSrt[1].Length+1);
            escapeStrDic.Add(sortedArrLst[i], escapeSrt[1]);

Thank you, failure to use loops will get your disqualified from working at us.

Then there were the gems such as “mutex is a kind of state machine” and “binary search trees are about recursion” or the “I’ll use perfmon to solve a high CPU usage problem in production”.

Then again, the next candidate after that was quite good. Only 4 – 6 to go now.

Stories from the interview room

It is that time again, we are looking for more developers. And this time I ended up so pissed after an interview I had to call a sick colleague just to vent.

One candidate I ruled out early during the interview process. It was a somewhat sinking sensation in the pit of my stomach as I spoke with the candidate, and I couldn’t get a single actually technical description about what the candidate is actually doing now. A lot of broad descriptions, and a lot of sweeping statements, but no real technical details. But the candidate did know jQuery mobile back & forth, it appears.

The decision was made final when I asked the candidate what web framework they were using. I asked whatever they were using ASP.NET WebForms, ASP.NET MVC or ASP.Net Web API. Note that from my perspective, it is a list in a ascending worth order, and you you are using something not on it, that is a plus (it means you aren’t just using whatever is available, which is nice). So I was quite excited when the candidate said (confusedly) “none of them”. Then it took me putting on my investigative hat and asking a lot of questions about how they are actually doing things before it finally came out that they were doing ASPX.

Not knowing the name of the environment in which you are working with for the past several years… I am not sure what to call it.

At least the candidate was able to let me know how they were using that in great detail. We do stuff in Page_Init, then we have  a method that load the data from the database and put it in the ViewState or the Session, then we bind it to a grid, and most of the code is in the grid event handlers. I am sure that the candidate is a great web developer, but I would rather that this particular candidate be great at another location.

The second candidate actually passed our phone screen and was invited to an interview. We have a fairly basic interview process. Some background information for both sides, then a few questions that you need to solve in Visual Studio (and yes, you have full MSDN & Google access) and then a technical portion of the interview that include a system design and a more detailed set of technical knowledge questions.

Now, I am sure that you have heard about interesting interview questions like sort a 100 GB file in a machine 32 bits with 512MB of memory. I admit that something like that would be challenging and interesting. I would probably quite enjoy seeing how people deal with that.

That is not the type of questions that we ask. I asked for a “sort these strings” and a “calculate this tax” programs. I gave the candidate about an hour and a half, alone in a room with VS and internet connection. I am not even asking to implement your own sort, just customize the comparison function and run the standard .NET sort. And do some basic math. The candidate was unable to finish either problem on the time allotted. Now, to be fair for the candidate, the way he solved the first problem was correct and much better than many other attempts that I have seen. Note that the issue was an IComparable<string> instead of IComparable<Item> that caused the issue. It is subtle and something that I would expect an newbie to catch. But this candidate came with full 6 years of experience.

Okay, I said to myself, it is natural to be nervous when doing interviews, let us see what the candidate knows beyond that. The CV mentioned that work with mutli threading. So I began with some questions about that. But it appears that “I only know Thread and BackgroundWorker”. But the absolute clincher was when I asked the candidate about what would cause a high CPU problem and how to diagnose that: “Well, I think that there are special tools that will tell you which process is using the CPU…”

Special tools? Well, I guess Task Manager can be called special, but I am not really sure that I would call it that .And if your experience in troubleshooting stuff never go to the point where you actually look at Task Manager to see what is going on… it probably means that you don’t have meaningful experience in actually troubleshooting stuff.

Reviewing Lightning memory-mapped database library: Stepping through make everything easier

Okay, I know that I have been critical about the LMDB codebase so far. But one thing that I really want to point out for it is that it was pretty easy to actually get things working on Windows. It wasn’t smooth, in the sense that I had to muck around with the source a bit (hack endianess, remove a bunch of unix specific header files, etc). But that took less than an hour, and it was pretty much it. Since I am by no means an experienced C developer, I consider this a major win. Compare that to leveldb, which flat out won’t run on Windows no matter how much time I spent trying, and it is a pleasure.

Also, stepping through the code I am starting to get a sense of how it works that is much different than the one I had when I just read the code. It is like one of those 3D images, you suddenly see something.

The first thing that became obvious is that I totally missed the significance of the lock file. LMDB actually create two files:

  • lock.mdb
  • data.mdb

Lock.mdb is used to synchronized data between different readers. It seems to mostly be there if you want to have multiple writers using different processes. That is a very interesting model for an embedded database, I’ve to admit. Not something that I think other embedded databases are offering. In order to do that, it create two named mutexes (one for read and one for write).

A side note on Windows support:

LMDB supports Windows, but it is very much a 2nd class citizen. You can see it in things like path not found error turning into a no such process error (because it try to use GetLastError() codes as C codes), or when it doesn’t create a directory even though not creating it would fail.

I am currently debugging through the code and fixing such issues as I go along (but no, I am doing heavy handed magic fixes, just to get past this stage to the next one, otherwise I would have sent a pull request).

Here is one such example. Here is the original code:


But ReadFile in Win32 will return false if the file is empty, so you actually need to write something like this to make the code work:


Past that hurdle, I think that I get a lot more about what is going on with the way LMDB works than before.

Let us start with the way data.mdb works. It is important to note that for pretty much everything in LMDB we use the system page size. By default, that is 4KB.

The data file starts with 2 pages allocated. Those page contain the following information:


Looking back at how CouchDB did things, I am pretty sure that those two pages are going to be pretty important. I am guess that they would always contain the root of the data in the file. There is also the last transaction on them, which is what I imagine determine how something gets committed. I don’t know yet, as I said, guessing based on how CouchDB works.

I’ll continue this review in another time. Next time, transactions…


Published at

Originally posted at

Comments (17)

Reviewing Lightning memory-mapped database library: going deeper

Okay, I now have a pretty rough idea about how the codebase actually works. I still think that the codebase is quite ugly. For example, take a look at this:image

The len parameter for CreateFile is whatever to open or create or just open (read only). But why is it in a parameter called len?

Probably because it was just there, and it would be a sin to create another local variable just for this, I guess (in a codebase where a method had > 18 local variables!).  To make things more interesting, in the rest of this method, this is actually a string len variable, sigh.

At any rate, let us actually dig deeper now. The following structure is holding data about a db.


This is actually somewhat misleading, at least with regards to how I would think about a db. This is the entry point for all the pages that belong to a specific db. But a db in LMDB is not really the same thing as a db in SQL Server or RavenDB. It all reside in the same file, and you always have at least two. The first one is the free db, which is used to track all the free pages. The second one is the main db. Then you have additional, named databases.

This is used here:


This define the metadata for the entire environment. Note that we have the two dbs there in mm_dbs. The mm_txnid denotes the last committed transaction id. This value is what gives LMDB its MVCC support.  The mm_last_pg value is the last used page, any transaction that wants to write will start writing at that value.

Let us see how we deal with pages here, shall we?


The first part find try to find a dirty page if we are in a read/write transaction and we haven’t specify that we can write directly to memory. This is done by doing a binary search on the list of dirty pages.

Otherwise, we can just hand the user the actual page by accessing it directly.

Next, let us look where this is actually used. Searching for a page with a specific key in it. This is done mostly in mdb_node_search.


This seems to be doing a binary search for the keys inside a specific page (in this case, the page that is at the top of the stack on the cursor). That leads to the conclusion that pages internally have data internally stored as sorted arrays.

And this leads me to another pet peeve with this code base. Take a look at this line:


Sure, this is a well known trick to cheaply divide  a number by half, but are you seriously telling me that the compiler isn’t going to optimize  (low + high) / 2 ? To my knowledge, no C compiler updated in the last 10 – 15 years managed to miss this optimization. So why write code that is going to be harder to read?

Okay, so now we know how we search for a specific key inside a page, but how do we get to the actual page that we want to search on? This happens on mdb_page_search_root. Let me see if I can take it apart.

When this method is called, the cursor is setup so the first page on the pages stack is the root page. 

And… that is enough for me. Up until now, I have been trying to just read the code. Not execute it, not debug through it, nothing .Just go over the code one line at a time and figure out what is going on. I actually think that I have a really good grasp about what is going on in there, but I think that this is pretty much all I can do at that point from just reading the code. So I am going to stop now and setup an debug environment so I can work with it, and report my finding from stepping through the code.


Published at

Originally posted at

Comments (13)

Raven Storage, early perf numbers

So, our managed implementation of leveldb is just about ready to go out and socialize. Note that those are just for a relatively short duration, but they give us good indicator of where we are. We are still running longer term perf tests now. Also note that they are early numbers. We did performance work, but it is still not done.

The following tests were done on a HDD, and all include writing a million records (16 bytes key, 100 bytes values) to storage.

  • Writing 1 million sequential keys - 52,152 op/s
  • Writing 1 million random keys -  11,986 op/s
  • Writing 1 million sequential keys with fsync - 17,225 op/s

And now, for the reads portion:

  • Sequential reads - 104,620 op/s
  • Reverse sequential reads - 57,932 op/s
  • Random reads - 3,191 op/s

Note that I am pretty sure that the reason for the later performance is that it is using an HDD, instead of SSD.

World’s Smallest No SQL Database: Persistent transaction logs

As it stand the World’s Smallest No  SQL Database will last only as long as you actually have power. The moment that you have a restart, all the data is gone. The is actually how several databases are running, but for now, we are going to assume that this is not desirable. The question now becomes, how do you actually store the data on disk?

This really becomes a pretty complex question, because you need to store the data on disk in a way that is crash safe, allow updates, and doesn’t take all the disk space in the world. Before we will get to the actual on disk data structures, we need to discuss how we implement persistent logs. Persistent logs are the key way that databases gets Durability. And as it turned out, there are just two ways of doing that that I am aware of:

  • Append only
  • Transaction log

Append only models rely on the fact that you only append to create a safe way to ensure that the data is either all in or all out. When we write, we don’t overwrite values, we are creating new values, and then we write where the last log entry is located. In CouchDB, for example, this is done by modifying the header portion of the file to point to the new data. In LMDB this is done by updating the page listing with the new pages. In both cases, you actually commit a transaction when the on disk pointer is changed to point to the new data. If you crash midway, nothing bad happened, you didn’t update the on disk pointer, it is still pointing at the old location. Worst case scenario, you wasted some disk space, and you probably have a way to reclaim that anyway.

Transaction logs use a different way to handle this. They are also usually implemented as append only files, into which we write the new data. Afterward, we can apply those changes in memory / on disk safely. A crash would simply mean having to replay the log. This is how leveldb, for example, works. This is also the essential idea behind how SQL Server Oracle works. Although in their case I believe that the transaction log usually contain both prev/new state of the pages they changed for a specific transaction.

One thing that you have to be aware of, either way, is that you should be prepared to handle scenarios where your process crashed midway, or when your entire machine had the plug pulled out. That means using fsync, sure, but it also means that you might have to do log replay, or be ready to see if you can recover something from the append only model.

The good thing about the transaction log approach is that after you have committed the changes to the standard persistent format, you can clear it (usually by just creating a new file and deleting the old one). With the append only model, you usually have to run some sort of compaction to clear things out. Note that the transaction log model vs append only model doesn’t really mean about how the rest of the persistent data is actually stored. We will touch on that on the next post.

On Memory Mapped Files

Tobi had a few questions about memory mapped files. And it is quite an interesting topic.

A memory mapped file is a feature for all modern operating system. It require coordination between the memory manager and the I/O subsystem. Basically, you can tell the OS that some file is the backing store for a certain portion of the process memory. In order to understand that, we have to understand virtual memory.

Here is your physical memory, using 32 bits for ease of use:


Now, you have two processes that are running. Each of them get their own 4 GB address space (actually, only 2 GB is available to the process in 32 bits, but that is good enough). What happen when both of those processes obtain a pointer to 0x0232194?

Well, what actually happens is that the pointer that looks like a pointer to physical memory is actually a virtual pointer, which the CPU and the OS work together to map to physical memory. It is obvious from the image that there is a problem here, what happens if two processes uses 4GB each? There isn’t enough physical memory to handle that. This is the point were the page file gets involved. So far, this is pretty basic OS 101 stuff.

The next stuff is where it gets interesting. So the OS already knows how to evict pages from memory and store them on the file system, because it needs to do that for the page file. The next step is to make use of that for more than just the page file. So you can map any file into your memory space. Once that is done, you can access the part of the memory you have mapped and the OS will load the relevant parts of the file to memory. Again, pretty basic stuff so far.

You can read about this more in this article.

The reason you want to do this sort of thing is that it gives you the ability to work with files as if it was memory. And you don’t have to worry about all that pesky file I/O stuff. The OS will take care of that for you.

And now, to Tobi’s questions:

  • What are the most important advantages and disadvantages?

Probably the most important is that you don’t have to do manual file I/O. That can drastically reduce the amount of work you have to do, and it can also give you a pretty significant perf boost. Because the I/O is actually being managed by the operation system, you gain a lot of experience in optimizing things. And the OS will do things like give you a page buffer, caching, preloading, etc.  It also make it drastically easier to do parallel I/O safely, since you can read/write from “memory” concurrently without having to deal with complex API.

The disadvantages you need to be aware that things like the base addresses would change whenever you re-open the file, and that data structures that are good in memory might not result in good performance if they are stored on disk.

Another problem relates to how the changes actually are saved to the file. It is hard to make sure that the writes you do are coherently stored in the file system. For example, let us say that you made changes to two different memory locations, which reside on two different pages. The OS can decide, at any time, to take one of those pages away from you because it needs that memory for other things. When that happens, it will write that page to disk for you. So when you ask for it the next time, it can load it up again and the application won’t notice.

However, what would happen during a crash? You might have partially written data in that scenario. Usually, when writing to files using file I/O routines, you have very predictable write pattern. When you use memory mapped files for writes, you don’t know for sure in what order that is going to happen. The OS is free to choose that. There are ways you can control that. For example, you might use FILE_MAP_COPY to avoid the OS writing stuff for you, but you would have to handle writes yourself now, and that is decidedly not trivial.

You can use something like FlushViewOfFilew() to make sure that a specific range is flushed, but that still doesn’t mean that they will be written in any order that you might expect. As a db writer, I care, quite deeply, about the exact way the data is written to file, because otherwise it is really hard to reason about how to read it back. Especially when you have to consider failures and corruptions.

Personally, I would be writing stuff using memory mapped file for data that I needed critical stuff for.

  • Why are using well known products like SQL Server custom caches instead of memory mappings?

SQL Server is actually its own operating system. It managed memory, locks, threads, I/O and a lot more. That comes from the fact that for a long time, it had to do those sort of things. But note that SQL Server probably use memory mapped file quite a lot. But they are also using their own custom caches because it make sense to them. For example, query plan cache. Even when you do have memory mapped files, you usually have multiple layers of caching.

In RavenDB, for example, we have the native page cache (managed by Esent), the documents cache (managed by RavenDB) and the client cache. The reason that we have multiple layers is that we cache different things. The client cache avoid having to call the server. The documents cache avoid having to parse documents and the native page cache avoid having to go to disk.

  • Why are other products like LevelDB using mmap instead of custom caches?

Because they are drastically simpler products. They really want to just give you the data as quickly as possible, and they don’t need to do additional processing of the data. They can lean on the OS page cache to a much larger extent. Note that when use in real products, there are often higher level caches that they will use anyway, used for storing processed / parsed information.

  • Are they well suited for managed code?

Memory Mapped Files are usually harder to use from managed code, because we don’t do our own memory management. It does meant that you lose the benefit of just treating this as part of your memory space, because there is a very clear line between managed memory and native memory, and memory mapped files are on the other side of the hatch. That said, you can easily use them via UnmanagedMemoryStream, or read from them directly via Memory Accessor. The managed implementation for leveldb make heavy use of memory mapped files.

  • How does performance compare with other techniques?

It is generally better, mostly because the OS is really good in managing paging, and that you rely directly on the low level I/O routines. Another thing that you have to remember that if you use file I/O, you need to create a buffer, copy to/from that buffer, etc. Using memory mapped files saves all of that.

  • Is it possible to use them with mutable data structures and retain crash recoverability?

Yes, but… is probably the best answer. Yes, you can use them for mutable data, but you really want to be careful about how you do it. In particular, you need to make sure that you write in such a way that you can survive partial writes. A good way of doing that is to make writes to specific pages, and you “commit” by recording that those pages are now available on a metadata page, or something like that. This require a lot of really careful work, to be honest. Probably more work than you would like it to be. LMDB does it this way, and even if the code wasn’t a eye tearing pain, what it is doing is quite hard.

Note that in order to actually be sure that you a flushing to disk, you have to call both FlushViewOfFile and FlushFileBuffers on Windows.

  • What guarantees does the OS make regarding ordering and durability?

Pretty much none regarding ordering, as noted. Windows will guarantee that if both FlushViewOfFile and FlushFileBuffers  have been called and successfully completed, you are golden. But there aren’t any promises in the API about what will happen for partway failures, or in what order this happens.

To summarize:

Memory mapped files are a great tool, and for reads, they are excellent. For writes, they are awesome, but since there is no way to ensure in what order dirty pages gets written to disk, it make it hard to generate reliable system using them.

I rather use standard file I/O for writes, since that is far more predictable.

World’s smallest No SQL Database: ACID & Transactions

Even for a first attempt, World’s Smallest No  SQL Database is actually pretty good. We got 3 of the 4.

The DB is:

  • Atomic, since change will either go in or be rejected.
  • Consistent, you can’t see changes half way.
  • Isolated, one change can’t modify/view another.

It isn’t durable, because everything is in memory.

It even has transactions, in the sense that a change goes in or not in an atomic fashion.

But, I think you’ll agree, this is very much a poor man’s solution. And those only apply just as long as you need to work with just a single item. In practice, you would usually want to make a lot more than that.

The basic properties you want is the ability to do multi item transactions, so you can modify two items at the same time, and they either go in or not at all. And that is where it gets really  complex, because at that point we need to build our own way to handle atomic updates to multiple values, how to keep them isolated and how to handle this consistently.

There aren’t really simple way to handle that, especially if we want to handle concurrent transactions. You would have to make a lot of decisions all of a sudden. For example, what happens if you have two concurrent transactions trying to update the same value. Is one of them rejected? And if so, at what point? Initial write? Commit time?

And that is just the start.

  • Do you allow a transaction to live across multiple requests?
  • How do you handle transaction expiration?
  • Do you need to have a transaction span more than one node? How do you handle that?

And we haven’t even talked about making transactions durable. There is no point, since we don’t do any persistence. But we will touch on persistence in a future post, so let talk about durability right now.

In general, there are two major ways to handle transactions. Either an append only model or a transaction log. In both cases, concurrent transactions make it… interesting to figure out how to properly write to the log. And the moment that you have either option, you have to build in some form of a recovery tool for system or db crashes. You have better also love fsync, since it is a very important tool in making sure that you can actually get durable transactions.

Oh, and of course, I forgot, if we allow long running transactions…

  • Who can read the yet uncommitted value? Everyone? Just the current transaction? Can you explicitly ask to see uncommitted value?
  • How do you allow it to be read?
  • Is it available for queries?

This is usually the most touchy part in the system, since is so critical. It is also something that you have to take into account in pretty much the entire codebase.


Published at

Originally posted at

Comments (6)

World smallest’s No SQL Database: Talk

As I mentioned earlier, I actually gave that talk in USI Events in Paris a short while ago. The recording for this talk is now available, you can find it here:


I am afraid that I had a very short time to go through the details, which is why I am also doing the blog post series.


Published at

Originally posted at

Comments (2)

From the mailing list: Concurrency Checks for Related Unchanged Documents

We got the following question on the mailing list:

…say you have reference data in one document like prices, formulas, etc.

When the user is editing or creating a document you would like to ensure that their edits are based on the current info at the time of edit.

So the document being edited or created is dependent on the other reference document and the commit should raise a concurrency error if that reference document has changed in between the time the edit started and it is committed.

Of course the reference document is not changed with the document being edited/created.

This question is interesting enough to be worth a blog post. Mostly because this was my reaction when I read the post.

Basically, the premise is wrong. The user have a document like “config/tax-rates”, and when they create an order, they need to make sure that the tax rate hasn’t changed. I am assuming here that “time between edit started & committed” is relatively low, counted in milliseconds, and that there isn’t any human involvement in this process.

The reason this is wrong is the following sequence of events (T+1 is time + 1 ms):

Scenario #1 Scenario #2
  • T +0 – read referenced doc
  • T +1 – create order
  • T +2 – config/tax-rates changed
  • T +3 – save order
  • T +4 – concurrency error because the config/tax-rates changed
  • T +0 – read referenced doc
  • T +1 – create order
  • T +2 – save order
  • T +3 – save order completed
  • T +4 – config/tax-rates changed

According to the question above, both scenarios are absolutely correct ways for the system to behave. In practice, scenario #2 is wrong (probably).

Udi Dahan has wrote about this a lot, but I can’t find the appropriate article at this time. Update: http://www.udidahan.com/2010/08/31/race-conditions-dont-exist/

If it actually matter enough for scenario #1 to require an error, scenario #2 is also invalid. This isn’t a technical issue, it is a business issue. If you failed to update the tax rate after you were notified, you are responsible for it, right? Except that there is no way that this works. Tax changes usually take place at midnight, and are announced several days / weeks in advanced, at a minimum. And even if you were in a total news blackout, you are still responsible for the rate change, even if you never heard about it.

The way this is setup is simply wrong. You don’t try to stop orders from being processed, because you are going to miss the just completed order. Instead, you are going to do compensation logic to fix any changes that apply.

You can also tell that this is the case by the fact that this is actually pretty hard to do. Documents are independent from one another. The best way to work with such a scenario is to record the time (or the etag) when you made a decision based on a related document, then at a later point in time, you can make a decision based on whatever it changed, how old it is, etc.

That gives you proper frame of mind, not trying to count milliseconds.


Published at

Originally posted at

Comments (6)

Reviewing Lightning memory-mapped database library: Because, damn it!

I decided to resume my attempt to figure out the LMDB codebase. Okay, I did some more figuring out and I think that I know where I went wrong. The codebase appears to be trying small enough to fit into the CPU caches, in order to do that, they decided to make it hard enough to make sure it won’t fit into my cache. But I’m trying to work through that.

One of the things that I figured out was that I made a mistake, I kept thinking about databases in the same way you’ll see in RavenDB or SQL Server, as complete separate items. Instead, LMDB does something quite different. It basically uses a single file for everything. This is called the environment, and that can contains named dbs. If you have multiple databases, all of them are going to reside in the same file. The way it works, it creates a file and then map it into memory. On Windows, it actually allocate the space for the file upfront (which doesn’t appears to be necessary).

But the basic idea is that you need to specify upfront the size of the file you want, and that is the maximum data size you have. This can change, but only if you close the environment and start it up again with a higher value. This also explains why you have to specify the number of databases you want to have per environment.  Once you created the environment, everything else become an issue of just managing the pages. In particular, because LMDB only allows a single writer, it doesn’t really have to worry about concurrency issues. It has mapped all the data into memory, and then it is just a matter of creating the appropriate data structure.

A lot of the code is dedicated to managing pages of data, and now that I have gone through enough of the codebase to be sure that I sort of have a vague idea about what is going on I can still say that I think that this is way denser than it should. And I shudder to think what it would take to make any sort of change to this codebase.


Published at

Originally posted at

Comments (10)

Reviewing Lightning memory-mapped database library: tries++

I decided to resume my attempt to figure out the LMDB codebase. The previous attempt ended when my eyes started bleeding a little, but I think that I can get a bit further now.

One thing that I figured out already is that because this is C code, memory management is really painful. I noted previous how good it was from leveldb to use std::string for most allocations, since it frees both caller and lib from having to worry about the memory. LMDB only shows how important that was. Take a look at the following method:


Quick, what do you think it does. As it turn out, what is does is update the key to point to the key that the cursor is currently pointing at. The reason for that is that the library doesn’t want to own the memory for key (since then it would have to provide a way to free it).

Now, to be fair, this appears to be an internal method, only used by the lib itself, but such things are pretty pervasive in the codebase, and it is really annoying that the code is all in one spot, with no real structure, since I am still trying to figure out some sort of hierarchy for the code so I could get a grip on what is actually going on.

I decided to just go to the start and figure out how it opens a database:


It is interesting that you need a transaction to open the database. However, the opened database survives the transaction closing, and from the docs, if you open a db in a transaction, that is the only thing you can do in that transaction.

I think that I just figured out something, the code in mdb.c is actually written in reverse depth order. So the closer to the end of the file a function is, the more visible it is.

Okay, it appears that there is the notion of multiple dbs, but there is also the notion of the main db. It looks like the main db is also used for tracking things for child dbs. In particular, when opening a named db, we have the following:image

We search for the db information in the main db. As you can see, mdb_cursor_set is actually setting the data on the cursor to the value we passed in. More interesting is what happens if the db is not there. It appears to merely create the db data and set it in the main db. Okay, looking at this function docs, it appears that this is just about allocating a database handle, not about opening the actual database.

I am not really sure that I understand the reasoning behind it, but a new database is actually created in mdb_cursor_put. So only after the first time you create a value, will the db actually be created.


I guess that explains why this function goes on for over 400 lines of code and has no less than 10 goto(!) and 7 goto sections!


I get that this is C, but come on, seriously!


Published at

Originally posted at

Comments (7)

On the death of Google Reader, blogging & content

On the 30 June, I had just about 30K subscribers to this blog. With the death of Google Reader, I dropped down to less than 10% of that.

This sucks, but it also means that this is a much smaller audience. Which means that it is easier to interact with. In particular, I would like to know what sort of blog posts do you, as a reader, like.

  • Features, like “see how I can do this cool thing in RavenDB”?
  • Reviews for applications, like “cringe at how horrible the code is”?
  • Challenges, like “can you figure out what is wrong with this code”?
  • Mystery codebase, like “let us read a codebase in a language I don’t know and try to figure it out”?
  • Architecture, like “let us see how we should resolve this problem”?

Or, you know, something else. I would appreciate your feedback.

World’s Smallest No SQL Database: Concurrency

I am pretty sure that it would surprise you, but the World’s Smallest No SQL Database has a well defined concurrency model. Basically, it is using Last Write Wins. And we are safe from any concurrency issues. That is pretty much it, right?

Well, not really.  In a real world system, you actually need to do a lot more with concurrency. Some obvious examples:

  • Create this value only if it doesn’t exists already.
  • Update this value only if it didn’t change since I last saw it.

Implementing those is actually going to be pretty simple. All you need to do is to have a metadata field, version, that is incremented on every change. Here is the change that we need to make:

   1: public class Data
   2: {
   3:     public byte[] Value;
   4:     public int Version;
   5: }
   7: static readonly ConcurrentDictionary<string, Data> data = 
   8:    new ConcurrentDictionary<string, Data>(StringComparer.InvariantCultureIgnoreCase); 
  10:  public HttpResponseMessage Get(string key)
  11:  {
  12:      Data value;
  13:      if(data.TryGetValue(key, out value) == false)
  14:          return new HttpResponseMessage(HttpStatusCode.NotFound);
  16:      return new HttpResponseMessage
  17:          {
  18:              Headers = { "Version", value.Version },
  19:             Content = new ByteArrayContent(value.Value)
  20:          };
  21:  }
  23: public void Put(string key, [FromBody]byte[] value, int version)
  24: {
  25:     data.AddOrUpdate(key, () => 
  26:     { // create
  27:        if(version != 0)
  28:            throw new ConcurrencyException();
  29:        return new Data{ Value = value, Version = 1 };
  30:     }, (_, prev) => 
  31:     { // update
  32:         if(prev.Version != version)
  33:           throw new ConcurrencyException();
  34:         return new Data{ Value = value, Version = prev.Version +1 };
  35:     });
  36: }

As you can see, it merely doubled the amount of code that we had to write, but it is pretty obvious how it works. RavenDB actually uses something very similar to that for concurrency control for writes, although the RavenDB ETag mechanism is alos doing a lot more.

But the version system that we have above is actually not enough, it only handle concurrency control for updates. What about concurrency controls for reads?

In particular, how are we going to handle non repeatable reads or phantom reads?

  • Non repeatable reads happen when you are reading a value, it is then deleted, and when you try to read it again, it is gone.
  • Phantom read is the other way around, first you tried, but didn’t find anything, then it was created, and you read it again and find it.

This is actually interesting, because you only care about those for the duration of a single operation / transaction / session. As it stand now, we actually have no way to handle either issue. This can lead to… interesting bugs that only happen under very specific scenarios.

With RavenDB, we actually handle both cases. In a session lifetime, you are guaranteed that if you saw a document, you’ll continue to see this document until the end of the session, which deals with the issue of non repeatable read. Conversely, if you didn’t see a document, you will continue to not see it until the session is closed. This is done for Load, queries are a little bit different.

Another aspect of concurrency that we need to deal with is Locking. Sometimes a user has a really good reason why they want to lock a record for a period of time. This is pretty much the only way to handle “checking out” of a record in a scenario where you have to multiple users wanting to make changes to a record concurrently. Locks can be Write Or ReadWrite locks. A Write lock allows users to read the data, but prevent them from changing that. When used in practice, this is usually going to immediately fail an operation, rather than make you wait for it.

The reasoning behind immediate fail for write is that if you encountered a record with a write lock, it means that it was either already written to or is about to be written to. At that case, your write is going to be operating on stale data, so we might was well fail you immediately. For ReadWrite locks, the situation is a bit different. In this case, we want to also prevent readers from moving on. This is usually done to ensure consistent state system wise, and basically, any operation on the record would have to wait until the lock is removed.

In practice,ReadWrite locks can cause a lot of issues. The moment that you have people start placing locks, you have to deal with lock expiration, manual unlocking, abandoned lock detection, lock maintenance, etc. About the only thing that they are good for is to allow the user to make a set of changes and present them as one unit, if we don’t have better transaction support. But I’ll discuss that in another post. In the meantime, just keep in mind that from my point of view, ReadWrite locks are pretty useless all around.


Published at

Originally posted at

Comments (16)

World’s smallest No SQL Database: Memory

The first item to go over with the World’s smallest No SQL database is how it uses memory. The codebase is really simple, which is pretty much the point, but can you think about the implications of the current codebase with regards to memory?

All the data is kept in a ConcurrentDictionary<string,byte[]>, and there is an unspoken assumption. Both keys and values are assumed to be small. What happens if that isn’t the case? Well, in .NET, you are going to get memory fragmentation pretty quickly. That means that on 32 bits, you are probably going to be able to store a lot less than the 2 GB of virtual memory that you get from the OS.

We will ignore that as well and move to other aspects of the code. What about how the system actually work in production? There is a lot of work that goes into ensuring that the CPU will be able to access memory efficiently. In particular, most dbs uses some form of an in memory data structure that take advantage of the L1, L2, L3 cache of the CPU. Usually something like a B+Tree. How does the dictionary compare here?

For that matter, what happens when we grow really big? What about when we are paging? How would that work? Can we just go over the keys in the dictionary without having to page through the values as well?


I think that you get my drift by now. And this is literally just me throwing stuff off the top of my head. There are a lot of other things that we need to worry about. For example, if we implement the sensible notion of a buffer pool, we might actually have a buffer that was released by a write but is currently being sent to the client, so we can’t just use buffers, we have to use ref counting as well.


Published at

Originally posted at

Comments (2)

Reviewing Lightning memory-mapped database library: Partial

Continuing in my quest to learn more, I decided to go over the LMDB codebase.

LMDB is:

LMDB is an ultra-fast, ultra-compact key-value data store developed by Symas for the OpenLDAP Project. It uses memory-mapped files, so it has the read performance of a pure in-memory database while still offering the persistence of standard disk-based databases.

It has some interesting feature set, and a really small codebase. I am anxious to see how they managed to do so much.

Interestingly, the data model for LMDB is quite different from the usual append-only / transaction log. Instead, it allows only a single concurrent writer, and modify the data in place. There also appears to be a lot of dire warnings regarding usage of long transactions, since they would result in increased file size, presumably because the db couldn’t find the pages the scavenge because they are locked by ongoing transactions.

One thing that I should note already, the code (I am currently about 1/3 of the way of lmdb.h file) is very well commented, and it explains a lot about what is going on. If the rest of the code is like that, this is going to be really nice to read. Okay, I take it back. The main files seems to be lmdb.h and mdb.c. The first one is 1,300 lines and the second is over 7,500 lines. Admittedly, this is impressive in the sense that pretty much everything is done there, and there are a lot of docs. But damn, I wish this was better organized. Right now my head feels like I need to pop up and take a breath.

I read ~2,000 lines of code so far, and I haven’t found anything that does something. It is all headers or comments or macros up until now. I skipped over to the code, and it is really hard to understand what is going on.

Take this code snippet:


It shows a lot of the things that make it hard to work with. effectively random naming convention (under_score, Pascal_naming, SHOUT_NAMING, etc), the goto trick is used WAY too much, lovely variable names such as n2. This is part of a method that goes on for something like 150 – 200 lines. And it include the following code:


Quick, can you tell me how many variables are declared here? And note that they are all local variables. I counted it twice, once getting to 17 and once getting 18.

That is just too much, I am not going to go any deeper. The leveldb codebase was easy to follow, it had structure. This codebase is just a code dump. It might be a really good codebase, for what it needs to do, but I literally can’t follow it.


Published at

Originally posted at

Comments (36)

World’s smallest No SQL Database: The Devil Is In The Details

So, what is the point of the code that I have shown here: http://ayende.com/blog/162691/worlds-smallest-no-sql-database

The point was to show where you start, and how easy it is, but all the details that you need to handle along the way. I decided that it would probably make for an interesting blog series about the sort of things that this example exposes.

In particular, I want to talk about:

  • I/O
  • Memory
  • Concurrency
  • ACID
  • Transactions
  • Searching
  • Scale up
  • Scale out
  • Aggregation
  • High availability
  • Backups
  • Monitoring

I might have a few more items along the way, but those are probably the most important ones.


Published at

Originally posted at

Comments (10)

Optimized, but way out of line

Yesterday I showed the following profiler results.


The first place someone would look at optimizing is the CaseInsensitiveComparator.Compare call, right? That is clearly taking a really long time.

In fact, we re-wrote that method to use pointer magic in order to get it to work faster. But that isn’t the issue. Look at the numbers. Compare is called 68 million times. That means that under the profiler, it is still completing an operation in 0.000073 ms. I am not even sure what time unit that is.

However, we can see that the major caller for Compare was FindGreaterOrEqual, right? And that is called only 11 thousand times. What does this tells you?

That tells me we have a Schlemiel the Painter's algorithm. So I double checked, and SkipList insert perf should be O(logN) on average. But what we were seeing is O(N) costs .Indeed, it appeared that we had a bug in which our skip list implementation would never do skips, essentially turning it into a sorted linked list, with insert perf of O(N). Once we fixed that bug… well, things got better, a lot better.


As you can see, the cost of this went down really hard. And now we can focus on other stuff.


Published at

Originally posted at

Comments (6)

Optimize this, dude!

So, we have a perf problem in the managed leveldb implementation. I pulled out dotTrace and looked at the numbers:


Do you see where the problem is?


Published at

Originally posted at

Comments (29)

World’s Smallest No SQL database

I used the following in a lecture called “Why you should never write your own database”. It has never been run, tested, or anything, but it serves as a good way to discuss the challenges involved in building real world  databases.

Here is the server side code:

   1: public class NoSqlDbController : ApiController
   2: {
   3:     static readonly ConcurrentDictionary<string, byte[]> data = 
   4:         new ConcurrentDictionary<string, byte[]>(StringComparer.InvariantCultureIgnoreCase); 
   6:     public HttpResponseMessage Get(string key)
   7:     {
   8:         byte[] value;
   9:         if(data.TryGetValue(key, out value) == false)
  10:             return new HttpResponseMessage(HttpStatusCode.NotFound);
  12:         return new HttpResponseMessage
  13:             {
  14:                 Content = new ByteArrayContent(value)
  15:             };
  16:     }
  18:     public void Put(string key, [FromBody]byte[] value)
  19:     {
  20:         data.AddOrUpdate(key, value, (_, __) => value);
  21:     }
  23:     public void Delete(string key)
  24:     {
  25:         byte[] value;
  26:         data.TryRemove(key, out value);
  27:     }
  28: }

And the client side code:

   1: public class NoSqlDbClient
   2: {
   3:     private readonly HttpClient[] clients;
   5:     public NoSqlDbClient(string[] urls)
   6:     {
   7:         clients = new HttpClient[urls.Length];
   8:         for (var i = 0; i < urls.Length; i++)
   9:         {
  10:             clients[i] = new HttpClient { BaseAddress = new Uri(urls[i]) };
  11:         }
  12:     }
  14:     public Task PutAsync(string key, byte[] data)
  15:     {
  16:         var client = clients[key.GetHashCode()%clients.Length];
  17:         return client.PutAsync("?key=" + key, new ByteArrayContent(data));
  18:     }
  20:     public Task DeleteAsync(string key, byte[] data)
  21:     {
  22:         var client = clients[key.GetHashCode() % clients.Length];
  23:         return client.DeleteAsync("?key=" + key);
  24:     }
  26:     public async Task<byte[]> GetAsync(string key)
  27:     {
  28:         var client = clients[key.GetHashCode() % clients.Length];
  29:         var r = await client.GetAsync("?key=" + key);
  30:         return await r.Content.ReadAsByteArrayAsync();
  32:     }
  33: }

And yes, that is a fully functional, scale out capable, sharding enabled No SQL Key/Value store in less than 60 lines of code.


Published at

Originally posted at

Comments (22)

Spot the bug…


Do you see it? It took me ages to find it.


Published at

Originally posted at

Comments (11)