Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,495
Comments: 51,046
Privacy Policy · Terms
filter by tags archive
time to read 6 min | 1012 words

One of the most important aspect of a database engine is that it needs to support durability. Now, that is a personal opinion as a database author, but I consider this to be a pretty important consideration when selecting a database. Other database engines disagree, from pure in memory databases that lose all the data on restart to databases that makes “best effort” and will work as long as they don’t crash in the wrong place.

Because there are several different metrics for what durable means, I’ll provide several levels of possible durability requirements.

  • None – if the application / server restarts, all the data is lost. Typically is used for in memory databases, explicitly giving up durability for performance.
  • Try – the data is written to disk, but no attempt is made to make sure that it is coherent / up to date. Typically is used to startup an in memory / near in memory database from cold state.
  • Crash – if the database has confirmed a write, and then immediately crashed, the data is still there. Most databases try to get to this level.
  • Power loss – if the database has confirmed a write, even complete power loss of the entire machine will still keep the written data. This is where you wanna be.

Note that in all those cases, I’m talking about single node databases, distributed stuff is a lot more complex, so I’ll not touch it here.

This is when talking about durability, but there is also the notion of atomicity, in other words, a transaction that is composed of multiple operations should either be there complete (and certain if confirmed to the client) or not be there at all (rollback), there should never be a situation where some of the records went in, and some didn’t.

Finally, there is the paranoid mode for durability, in which you don’t trust the hardware, so you write to multiple locations, hash it and validate. Personally, at that level, I think that this is the purpose of the file system to verify that, and this is where the responsibility of the database ends, but if you are stuck with a poor file system choice (like FAT a decade or two ago), that was certainly something that you’ll have to consider.

At any rate, one of the major problems with gaining durability is that you have to pay so much for it. In order to actually be durable, you have to write your changes to the actual disk platter, and that is something that might actually require physical parts to move, so that is something that is very costly. How costly? A high end (15,000 RPM) hard disk can do a theoretical maximum of 250 such writes per second, and that is an extremely theoretical number. In most cases, even on high end hard disks, you’ll see a maximum of a 100 – 150 per second. You can probably double or triple that for high end SSD drive, but those are still very poor numbers, compared to the number of operations you can do in memory and in the CPU in that time frame.

That puts a pretty hard limit on the number of times you can hit the disk, but that is not something that we can tolerate, so the process of making a write in most operation systems, looks like this:


Notice the number of buffers in the middle? Each of them is going to gather the I/O as it can before issuing a (large) write to the layer below them. With the idea that this way, we can amortize the cost of going to the physical disk among many different writes. It works, it works quite well, in fact, to the point where most of the time, you don’t really think how slow the hardware is really is.

But for databases, this is a horrendous problem. To start with, if I’m issuing a write to the disk, I have no idea when this actually hit the platter. For fun, all through this chain (which is actually simplified), each component is free to re-order the writes in any way it feels like it, so we can’t even rely on the order of the writes to ensure consistency. No, in order to protect itself from data loss / corruption in the presence of power loss, we have to tell the operating system to actually skip all those layers and flush directly to disk.

As I said, this is expensive to do. In particular, the normal way to do to make your writes, and then to to call fsync(fd) in order to flush those changes down the chain, all the way to the disk. This has a few issues, however. In particular, note that in the gap between the file system and the disk driver, we’ll lose the correlation between writes made to a particular file and any other writes made to that device. That end result is that the fsync command forces us to flush the entire disk driver buffer (and the disk buffers, etc). In production systems, that can be hundreds of MB that were innocently sitting there, slowly being written to disk, and suddenly you have this disruptive fsync that need to be written, so everything is flushed to disk, and the fsync (which you expect to be short, because you wrote on only a few dozen KB) now takes a minute, because it is actually flushing 100 MB writes from totally different process.

This post is getting long enough, so I’ll defer the actual discussion on how databases actually achieve durability to the next one, in the meantime, just consider the complexities involved in making sure that the data is on the disk, and how much of the design of modern databases is spent in optimizing just this part.

time to read 6 min | 1046 words

This is an interesting post, because I’m going to lay down some of the options that we have for concurrency inside the database engine, but not really discuss them in depth. In particular, even with concurrency control, you get into… strange situations with databases (see transaction isolation levels and what problems each is trying to solve) because of the nature of the beast.

The ideal is, of course, that each client feels like they are the only client touching the database, and no one can touch the database while it is running. The easiest way to do that is to allow just a single client at a time to enter the database. That is actually something that happens frequently, but in practice, that isn’t something that you really want to do. Even embedded databases allow at least multiple readers, so that isn’t really something that we’ll deal with.

Before we get into actually concurrency discussion, we first need to figure out what we are talking about with regards to concurrency inside the database engine. The holy grail is writers that don’t hold readers, and readers that don’t hold up writers, allowing both reads and writes to proceed without blocking.

Before we get to the actual complexities involved in implementing it, let us see what kind of problems we have after we solved this. In particular, the issue is when we have multiple clients reading / writing to the same value concurrently. What are we supposed to do then? If we have W1 and W2 both trying to mutate the same record, which one will win? Do we serialize the accesses? What happens if we have a W1 and R2 both modifying and reading from the same record? Until the write transaction is completed, do we give the reader the new value, make it wait?

The classic model in databases used to be that the database would take locks. You can think about them as Reader/Writer locks whenever it read / wrote a particular value. The release schedule for those locks would impact the transaction isolation level, but that isn’t really important for this discussion. Note that a lock per value is pretty expensive, so one of the things that a database engine would do will be to escalate the lock, if it noted that there are too many locks in a particular page, it will escalate to a page lock (and onward until the entire tree / table were locked).  That exposed a pretty nasty issue to users, if you had a hotspot in your database (recently modified records), it was easy to get into a situation where all the clients were waiting for the same lock, effectively causing a train.  Note that in this case, both readers and writers are waiting for each other, and the idea is that we gain concurrency by spreading the locks around in the hope that they won’t contend so much.

Another way to handle this is called MVCC (Multi Versioning Concurrency Control), in this manner, instead of overwriting a record immediately on change, we keep the old value, so readers don’t have to wait for the end of the transaction to get the value, they can get the old value (which we can give without waiting). Writers still need to wait for each other if they modify the same record, but we just ensures that writers and readers don’t need to wait for one another. MVCC is a bit complex, because you need to maintain multiple concurrent versions, but it is a very common choice today.

But a lot of the complexity around writers and writers locks is actually embedded in the notion of having a conversation with the database. The ability to issue multiple statements to the database in the same connection, with potentially human reactions behind that makes for a much more complex system, because you have to hold all the locks for the duration of the connection. In most newer databases, there is no such concept, a write transaction is held for the duration of a single command (or a batch of commands), which is processed independently and then commit immediately.

Under that scenario, it actually make a lot more sense to skip the notion of concurrency writers, and move to concurrent readers/single writer mode. In that mode, there can be only a single write transaction, and the only concurrency that you have to deal with is with the readers (which can be solved efficiently with MVCC), so that makes for a much simpler database design. Combine that with the serial nature of I/O which databases depend on for durability (more on that in a future post), this actually make a lot of sense, since it removes a lot of the complexity from the database code.

RavenDB, LMDB, MongoDB, CouchDB are all built with a single concurrent writer in mind. In fact, even databases such LevelDB or RocksDB are effectively single writer (they just do concurrent transaction merges).

Let us talk about transaction merging for a while, shall we? LevelDB in particular is an interesting case, because you can use the notion of WriteBatch to write to it, and multiple threads can submit WriteBatch at the same time, giving us concurrent writes. The way it is implemented, though, is quite interesting. All those threads submitting all those WriteBatch instances will add their batch to a queue, then compete on a lock. The first one that wins will run through all the WriteBatch in the queue and commit them all.

It is a simple, and attractive model, but I seriously dislike it. The problem is that it is you might two WriteBatch that modify the same record, and you don’t really have a good way of knowing about that. So you can’t really reason about it. Since a single lock is taken anyway, I much prefer the single writer model, in which the write transaction know that it is the only one that can modify things, so it doesn’t need to worry about concurrency with other transactions that might have different idea about what is supposed to be in the database.

When we implemented transaction merging, we did this with explicit optimistic concurrency in mind, and it worked, but it was a complex model, and switching to a single writer made the whole thing much simpler.

time to read 7 min | 1249 words

One of the things that a database has to deal with is the fact that the actual physical data is stored on something like this:

Well, that isn’t true today (HDD, SSD, NVM, etc), but all of them have roughly the same external interface. You have a series of blocks of various size (from 512 bytes to 4KB, typically), that you can fill with some data, and that is pretty much it. That doesn’t sound too bad, right?

Well, consider what happens the case of a simple text file, and the need to add a line at the beginning of the file, you can’t ask the hard disk or the file system to do that, you have to re-write the entire file in its entirety, first prepending the new line, then writing the previous content of the file. This is obvious when you look at the basic file system API:


You can open the file, seek to a particular location, and write there. There is no way for you to say fappend() in a particular position, or something like that. Note that I’m using the C API definition because this limitation is pretty much shared by any API in any language that you care to name. Oh, you can fake it, which is what all IDEs pretty much do, but that breaks down when you start working on large files.

Database routinely deal with large files, a small database today is at least a GB or so, and typical database sizes are usually in the hundreds of GB or multiple TB. If we need to update an entry, we can’t just start moving this around. That is why a pretty large part of managing a database is about how you structure the data on disk so you can move it around.

There are typically two ways to do that. Append only and pages.

We’ll start with the append only model, because it is conceptually simpler. Instead of worrying about allocation of data on disk, and how to manage that, we’ll simply always write at the end of the file. That way, we can let the file system worry about where to find the new space for our data. This has the advantage of being pretty simple, and it also is a great help when you are worrying about consistency and durability (a subject for another post). The problem is that obviously, you end up having a lot of wasted space, especially if you have a commonly modified record. Each time that it changes, you write it to the end of the file, but the older version still exists. This requires you to do merges, to get rid of the older data occasionally.

It should be noted that the choice of the algorithm for storing the data and the choice of how to actually manage the data are typically pretty closed together. Most LSM solutions will use some form of append only mode (indeed, that is practically a requirement). While most databases using trees has a tendency to use paging (more on that later).

A good example of why mixing and matching the two doesn’t really work is the problem of the CouchDB file format. CouchDB uses append only mode for crash recovery, but it uses B+Trees as its data structure. That leads to some interesting problems. To start with, every change must modify the full height of the tree. As the database grows, the amount of data that you need to write on every change also grows. Another issue that database authors need to think about is the hot path for the file system cache. Modern database will try to lean on what the operating system can give them, and by always writing to the end of the file, you keep filling the file system cache with new data, so a lot of the old data (which might be useful) ends up being kicked out, eventually forcing you to do more (expensive) I/O.

The other alternative is that instead of writing at the end of the file, we’ll divide the file into evenly sized pieces. Those are typically called pages, and they are 4KB to 4MB in size, on most databases. Note that the actual size matter quite a lot in how you work with them. But for now, we’ll deal strictly with 4KB pages, because that is easiest. In this mode, whenever we need to modify some piece of data, we can modify its page as a single operation (seek to that page position in the file, overwrite the whole page). Since B+Trees are naturally paged, this make it a very easy way to work with them. Except if you have a record whose size exceed the size of the page.

In this case, you allocate as many pages as you need to fit the record, and typically call this overflow. In relational database, those would be TEXT and BLOB columns, for example. They aren’t stored directly inside the B+Tree, instead, they are stored on the side. That additional hop is also why they tend to be more expensive than an embedded value inside the record.

The good thing about pages is that assuming that you are using the same pages all the time, you can teach the file system cache where your hot spots are, and it will try to keep them in memory as much as possible. The problem with managing pages yourself is that you also need to manage free space. If someone went ahead and deleted a bunch of records and freed a whole bunch of pages, you now need to be aware that those are free, and when you next need to allocate a page, you want to allocate it from the free list, not from the end of the file.

Using pages also allow you to do some interesting tricks. For example, you can pre-allocate the file, which will give the file system better opportunity to give you continuous segment of space on the disk, so you reduce the file system fragmentation. Note that so far I’m ignoring concurrency and durability entirely, those are topics for another post.

But there is something that we need to talk about with pages, and that is what happens when you have large pages. If a typical page size is 4KB in size, then just writing the whole page out whenever any value changes there is reasonable. But if your page size is much larger (for example, 2MB), that becomes quite expensive, and not something that you want to do. Instead, what you typically do is you write the changes to the page to a journal file (another topic that I’ll cover in the future), and keep the changes to that page in memory. Whenever a certain threshold is reached, you rearrange the whole page with all of the changes that are required (amortizing the cost of writing the change among many changes) and then write it out once.

As you can imagine, this sort of thing has a major impact on the design of the database. If your page sizes are small (4 KB – 32 KB, let us say ), you will handle things very differently than if your page size is much larger.

time to read 5 min | 845 words

I have talked before about the roles of logs (for debug statements) in RavenDB. In this post, I want to start talking about the actual concrete design we have of them.

Overall guidelines (in no particular order):

  • As little configuration as possible.
  • Development / Operational – only two levels we support.
  • Idiomatic – should be easy to read the log code and see if it is good or not.
  • Builtin understanding of RavenDB mechanics – Scoped to database, etc.
  • Fast & Cheap as possible (especially if disabled).
  • Memory efficient.

The idea that log configuration should be flexible is a great idea, if you are talking about a generic logging framework. In particular, however, we have just one configuration option, writing to a sized limited text file, that is it. (This isn’t strictly true, we also have an HTTP endpoint(s) that we need to take care of, but that is true enough).

So the log file name format is going to be fixed at:

  • 2016-06-26.001.log
  • 2016-06-26.002.log

Each of which is going to be roughly 256MB in size.

How this is done? We spin off a single background logging task that has a list of per thread queues, and it is running over all of those queues. Each time we log, we do all of the work of preparing the log entry on the sending thread, and just send the actual buffer to the background logging thread. The idea is that the background logging thread should be focused on just writing to the log file, nothing else. Technically speaking, it would be much better to have a log file per thread, but that would require us to merge them at a later point in time using dedicated tools, and it isn’t really worth it at this point.

Each thread actually have two queues, one that is used to communicate with the background logging thread, and the other to get the buffers it allocated back, for reuse. This allows us to drastically reduce the amount of memory that is being held by logging, and it also gives us a good way to buffer the amount of in flight logging information.

We can process over 400MB/second for logs using this approach, on a laptop, so I’m pretty happy with the performance.

This has some particular behaviors, though. Because we are processing things on a per thread basis, it means that we’ll group all operations from the same thread in the log for a certain duration (by default, 16 messages), even if in practice they are interleaved. This is actually make it easier to read the log, but it make it harder to see when you want to see the interleaving of different operations at the same time (a key aspect of the log in multi threaded environment). You can get the correlation from the log timestamp, of course, but that isn’t the same.

The limit of max 16 operations per thread in the log should limit that, and the amount of cross thread coordination we want to see if likely to be lower in the RavenDB 4.0 codebase, so that is an acceptable compromise.

In terms of the API we have, we define the following:


The idea is that when you create a logger, you specify what is your logger (the BulkInsertHandler, in this case), as well as which database this logger belongs to. This make it easier to read the logs after the fact.

Then there is the usage:


Note that this looks like a regular log usage, but this API is actually very constrained. We have only two log levels:

  • Information – The full log
  • Operations – Stuff that admins needs to look at

In particular, we don’t have a Warn / Error levels, because they are incredibly hard to get right. What does it means to log an error, if it is handled?

Another thing to note is that this API is actually really primitive, it accepts just a string (relying on the C#’s string interpolation to make it nicer looking), and it assumes that it is always going to be called with a check if the logging is enabled. This enables us to reduce costs when actually logging, because we can skip all those checks and additional API that is there just to save performance when you don’t have the right log level enabled.

In this case, we are forcing the caller to take care of that, making our life much simpler.

Other things that we don’t allow is filtering by logger, hierarchy, etc. We don’t need any of those things, and they cost.

So this is how things stand right now with our logging system.

time to read 1 min | 64 words

The following code has just been written (never run, never tested).

It’s purpose is to serve as a high speed, no locking transport between two threads, once of them producing information, the other consuming it, in a bounded, non blocking manner.

Note that this is done because the default usage of BlockingCollection<T> here generated roughly 80% of the load, which is not ideal

time to read 4 min | 664 words

Greg Young has an interesting reply to my post here. I’m going to reply to it here.

RavenDB nor EventStore should be written in C#.

That may be true for the EventStore, but it isn’t true for RavenDB. Being able to work with the .NET framework makes for a lot of things a lot simpler. Let us take a simple example, every day, I have to clear the unused auto indexes, but only if there has been queries to the collection in question. In C#, here is the code to give me the last query time for all collections in the database:


No, if we wanted to do this in C, as the go to example for system level programming, I would need to select which hash function to use. RavenDB being a non trivial project, I’ll probably have one, for simplicity sake, I selected this one, which gives us the following:


I’ve elided all error handling code, and I’ve introduced a silent potential “use after free” memory issue.

This is something that runs one a day, but the amount of code, effort and potential things to manage is quite a lot, compare to the simplicity we get in C#.

Now, this is somewhat of a straw man’s argument. Because the code would be much simpler in C++ or in Go (no idea about Rust, I’m afraid), but it demonstrate the disparity between the two options.

But Greg isn’t actually talking about code.

A different environment such as C/C++/Go/Rust would be far superior to haven been written in C#. Cross compiling C# is a pain in the ass to say the least whether we talk about supporting linux via mono or .netcore, both are very immature solutions compared to the alternatives. The amount of time saved by writing C# code is lost on the debugging of issues and having unusual viewpoints of them. The amount of time saved by writing in C# is completely lost from an operational perspective of having something that is not-so-standard. We have not yet talked about the amount of time spent in dealing with things like packaging for 5+ package managers and making things idiomatic for varying distributions.

I have one statement in response to this. Mono sucks, a lot.

I suspect that a lot of Greg’s pain has been involved directly with Mono. And while I have some frustrations with .NET Core, I don’t agree with lumping it together with Mono. And Greg’s very market was primarily Linux admins. While I would love to that be a major market for us, we see a lot more usage in Windows and Enterprise shops. The rules seems to be simpler there. I can deploy the .NET Core along with my application, no separate install necessary, so that makes it easier. What we have found customers want is docker instances so they can  just throw RavenDB up there and be happy.

A lot of the pain Greg has experienced isn’t relevant any longer.

What is more, he keep considering just the core product. There are quite a lot of additional things that needs to happen to take a project to a product. Some of this is external (documentation, for example) and some of this is internal (high level of management studio, good monitoring and debugging, etc). I have seen what “system level guys” typically consider sufficient into itself in those regards. Thank you, but no.

A lot of our time is actually invested in making sure the whole thing is better, spreading oil over the machinery, and the more lightweight and easy we can make it, the better. 

time to read 4 min | 651 words

In my previous post, I talked about B+Trees and how they work. Unfortunately, just having a B+Tree isn’t enough. A B+Tree allows you to do queries (including range queries) over a single dimension. In other words, in the case of our users’ data, we can easily use the B+Tree to find a particular entry by the user’s id.


However, if we wanted to find a user by email, or by name, a B+Tree of the users’ ids isn’t very helpful. That means that if we wanted to do a query based on the user’s email, we would need to do a table scan, run over all the records in the database and match each of them to the email we are looking for. That has a pretty significant cost to it, and it is generally something that you want to avoid at all costs.

This is where we introduce another B+Tree, this one using the users’ email as the key, and whose value is the users’ id. It looks something like this:

  • bar@funyun.org –>  users/12
  • foo@example.com –> users/1

So finding a user by their email is going to involve:

  1. Searching for the relevant email in the emails B+Tree.
  2. Once found, searching for the user’s id in the documents B+Tree.

In other words, we have to do 2 B+Tree lookups (at a cost of O(logN) each) to do this lookup.

Congratulation, you just used an index. You have probably heard of indexes, and how they can dramatically improve performance (and kill it). At the most basic level, an index is just another B+Tree in the system that has some of the records’ data indexed. The email B+Tree is what happens when you index an email column, for example.

And it is also the reason why indexes are expensive. Consider what happens when you insert new users into such a system. The database is going to update the two B+Trees. One for the actual data, and the other for the email that has the key of the data. And while we can arrange for the user’s id to be sequential, it is very unlikely that the emails will be sequential, so it is likely to be a bit more fragmented. That said, the index isn’t going to be as big as the actual data, so more entries are going to fit per page. In particular, while we can fit 5 entries per leaf page in the data B+Tree, the email index B+Tree is going to be around 70 entries per leaf page. So that should help keep it shallower than the data B+Tree (thus cheaper).

Note that in database terms, both B+Trees are actually indexes. The data B+Tree is an clustered index on the user id, and the email B+Tree is a non clustered index on the email. What is the difference between the two? A non clustered index contains as its value the key to find the relevant record. A clustered index contain the actual record data. In other words, after looking up a non clustered index, we need to go to the clustered index to find the actual data, while if we are using a clustered index, we are done with just one lookup.

This is also the reason that you often have to do a balancing act, you have to chose what fields you’ll index, because the more fields you have indexes, the faster your queries can be (O(logN) vs. O(N)), but on the other hand, that means that you need to update more B+Tree.

Finally, I’ll invite you think what happens if we have an index (B+Tree) on a date field. For example, LastLogin field, and we wanted say “Gimme all the users who haven’t logged in for the past 30 days”. How would you do that, and what would the cost be, using the system I outlined in this post?

time to read 3 min | 526 words

This post is in reply to Rob’s post, go ahead and read it, I’ll wait.

My answer to Rob’s post can be summarize in a single word:

In particular, this statement:

it is impossible to be a good .NET developer. To work in a development shop with a team is to continually cater for the lowest common denominator of that team and the vast majority of software shops using .NET have a whole lot of lowest common denominator to choose their bad development decisions for.

Um, nope. That only apply to places that are going for the lowest common denominator. To go from there to all .NET shops is quite misleading. I’ll give our own example, of building a high performance database in managed code, which has very little lowest common denominator anything anywhere, but that would actually be too easy.

Looking at the landscape, I can see quite a lot of people doing quite a lot of interesting things at the bleeding edge. Now, it may be that this blog is a self selecting crowd, but when you issue statements as “you can’t be a good .NET developers”, that is a pretty big statement to stand behind.

Personally, I think that I’m pretty good developer, and while I dislike the term “XYZ developer”, I do 99% of my coding in C#.

Now, some shops have different metrics, they care about predictability of results, so they will be extremely conservative in their tooling and language usage, the old “they can’t handle that, so we can’t use it” approach. This has nothing to do with the platform you are using, and all to do with the type of culture of the company you are at.

I can certainly find good reasons for that behavior, by the way, when your typical product lifespan is measured in 5 – 10 years, you have a very different mindset than if you aim at most a year or two away. Making decisions on brand new stuff is dangerous, we lost a lot when we decided to use Silverlight, for example. And the decision to go with CoreCLR for RavenDB was made with explicit back off strategy in case that was sunk too.

Looking at the kind of directions that people leave .NET for, it traditionally have been to the green green hills of Rails, then it was Node.JS, not I think it is Elixir, although I’m not really paying attention. That means that in the time a .NET developer (assuming that they investing in themselves and continuously learning) invested in their platform, learned a lot on how to make it work properly, the person who left for greener pastures has had to learn multiple new frameworks and platforms. If you think that this doesn’t have an impact on productivity, you are kidding yourself.

The reason you see backlash against certain changes (project.json coming, going and then doing disappearing acts worthy of Houdini) is that there is value in all of that experience.

Sure, sometimes change is worth it, but it needs to be measured against its costs. And sometimes there are non trivial.

time to read 11 min | 2098 words

The absolute worst thing about B+Trees is their name. It is like someone maliciously considered all the possible names and then selected the most confusing one.

A B+Tree is a very different animal from a binary tree, in particular. In the previous post, we looked at how we can implement a binary tree using raw file I/O. The problem here is that the cost of doing that is incredibly high. The cost of searching a binary tree is O(logN), which is considered good, but as it turns out, the actual cost of doing that with a file is really high. A seek time on a high end HDD is about 12ms(!) the seek time on a high end SSD is 0.15 ms (!!) *.

* If you ever wondered why pretty much every database vendor jumped on the SSD bandwagon, that kind of difference in performance is a very large part of it.

So, let us assume that we have 100 records in our database, log2(100) would be 7, on an HDD it will take ~ 84ms just to do seeks (and just over 1ms on SSD). Remember, this is when we have a mere hundred records. What happens if we have ten thousand? The log2(10000) would be 14 (13.28, actually, but we need to round up), which would cost us 168ms on HDD and over 2 ms on SSD.

Consider the fact that those data sizes don’t even count as toy databases, they are the stage where you throw everything into a file and do sequential scans over the whole dataset to find anything, because it is faster (dev time and what the user will feel) than any other solution.

So something must have been done about it, because databases larger than a million records exists with response time of under 1/4 of a second. And the answer to that is the B+Tree. Instead of tracking each value individually, we’ll define pages. A page is a fixed size, typically 4KB in size, that can contain a number of values. Instead of working with each individual values, we will work on pages. Let us consider the users example, and see how this works.


Now, all 3 entries are stored in the same page. And we store them sorted inside the page. Assuming a reasonable size of data per record, we can store multiple records in this single page. This allows us to take advantage on an interesting property of I/O, the costly thing is going to the hardware, less so how much work you do there. So the cost of reading 400 bytes or the cost of reading 4KB are basically the same*.

* This is actually a lie, on several levels. Reading 32KB or reading 64KB from disk has roughly the same performance. But reading 400 bytes and 4KB is exactly the same cost, because the OS will never issue a 100 bytes read, it will read 4KB at once, and then server the next reads from the cache (and frequently it will do read ahead, too).

Once we have the page in memory, we do a binary search inside the page to find that particular value. So for the records in the image above, the binary tree method would result in 3 disk accesses, and the page method will have just one. Clearly, the page method is better, but eventually you are going to run out of space in the page, what do you do then?

That is when you introduce a new level, and actually get a tree.


Now we have a much more interesting setup. We have two leaf pages, each of which holds the actual data for the users. And we have a branch (and root) page that holds the lowest key of the page that it points to. The logic for finding a particular value in this setup goes like this:

  1. Read the root page
  2. Do a binary search on the page to find the first key that is larger than the key you are searching for
  3. Load the previous key’s page.
  4. Search for the value in this page.

In other words, if we are searching for “users/4”, we first load the root page, and we find that users/5 is bigger than users/4. So we go to the previous one, which is Page 3. We load Page 1 and do a binary search there to find users/4. The name of the action that happens when the page is full and need more space is called a Page Spilt. In the case above, we had a page split at exactly users/10, which is a random write, as far as the B+Tree is concerned (we are doing lexical comparisons. Let us see how it would split in the case of sequential inserts.


In this case, when we hit a page full we did it with a value that was supposed to go at the end of the page, so it is threated as a sequential insert, so instead of splitting the page, we can just allocate a new page, put the new value there, and then wire it to a parent. This allows us to avoid wasting space when we are doing sequential writes. Note that this particular behavior is a (pretty common) optimization in the implementation that I’m using.

There are a few important things to remember here, the number of entries that can fit into a leaf page is primarily limited by the size of the values. In this case, in the pictures above, each value is exactly 409 bytes in size, and we can fit 9 of them into a single page. But the size of the branch page is the limited by the size of the key alone. So let us see what happens when we start writing a lot of new records. We wrote a total of 1,300 records, and we ended up with a root page that is has about 150 leaf pages.


If I want to find the record for users/1025, I need to first read the root page, and then do a binary search there. I find that the entry users/1030 is the first entry that is larger than it, so I go to page 54, and do a binary search there, finding the right value.


All I had to do was two disk operations. If I were using a binary tree, I would have to do 11 disk requests, and the performance difference would be 14 ms on a HDD for the B+Tree and 154 ms for the binary tree. Note that in terms of comparisons, I have to make 12 comparisons in the B+Tree, and 11 in the binary tree. I don’t count that time, since in memory work is effectively free in the context of any I/O work.

But our tree is now on a precarious situation, what would happen when we continue to add records to it? The root page is going to be full at some point, right? Here is what happens after we continue to add more items:


The depth of the tree has increased. We now have a depth of 3, and all queries will need 3 disk operations to complete. The cost has increased, but we are still an order of magnitude cheaper than the binary tree work.

Note that in this case, we have a total of 590 Kb of data, but the overhead is 736 Kb. This is because we can only fit 9 entries in a page, leaving us about 400 bytes unused. We also insert effectively random data, so we have page splits in the middle, which may have less records than that.


In this particular case, we’ve a total of 31 leaf pages with 5 entries (out of 176 leaf pages). So the total wasted space is about 50KB that are lost because of uneven page splits.

Note that a B+Tree with values of this size is extremely strange. Typically the size of an entry is in the a few tens of bytes (we’ll talk about larger values later), and a single page contains tens of records or more. If we consider 50 entries per page, then the same amount of pages will allow us to hold eight times more data. The key part, regardless of the number of records, is that this allows us to do so in very few disk accesses.

Notice that so far I didn’t really talk about balancing, which is a crucial part of how you deal with in memory trees. A B+Tree doesn’t need rebalancing all the time, but it does have a pretty crucial behavior when we reached the limit of a page. To make things simple, we’ll talk about a leaf page, but the same holds true for a branch page as well. Because the page is full, we need to split it. If your data is inserted in sequential manner, it is easy to see that we’ll only split when the a page is as full as it can be, and new inserts will go to new pages. It also means that we have the minimal possible depth of the tree, because the number of splits we have is so much lower.

On the other hand, in the case above we inserted 1,475 random entries, and created a non optimal B+Tree. It is important to observe that no optimal B+Tree doesn’t mean something catastrophic. In the case of keys like the one we see here, the data actually is going to become sorted (one we have enough digits in the number so it wouldn’t increase often), so it the issue will sort itself out, if you pardon the pun.

In the case of truly random data (guids, names, urls, etc), there is no real good way to have it sort out, and we just have to live with the tree fragmentation. This fragmentation leads to deeper trees, which means more disk seeks. That is why all databases really like it when you feed them sequential data. That is naturally the least work they have to do.

In practice, databases will do things like steal entries from nearby pages to ensure some balance, run cleanup procedures to see if they can trim the tree and in general try to fix this issue. The problem is that all of those behaviors are based of heuristics, and they can result in pretty bad performance. The good thing about fragmented tree is that once you reach a certain fragmentation level, you can usually add new data to the page (which is typically not full) without forcing additional splits. The bad part is that you are using more disk space and more disk operations.  But sometimes that is the cost that you have to pay.

This post is getting long enough as it is, so I’ll continue the  discussion on the usage of B+Trees in databases in my next post. In closing, I’ll note that the pretty pictures that you saw in the post are actually real output generated for a live RavenDB system. You can go to any Voron based database and ask it to print out its internal structure, and it will do as you command. The url to use is: http://[your-ravendb-server]/databases/[your-ravendb-database]/admin/voron/tree?name=documents

This is a debug level command, meant for us to inspect the internal state of the database, don’t use in production, etc.

time to read 1 min | 145 words

The title of this post reminds me of a Bones episode. This blog post is actually from an internal mail made by Federico Lois, as part of bigger perf work.


Here is the calling code, note that this is typically called many times, and _pageStates is a HashSet<PagerState>.


The following change was made:


And here are the results:


That is 230% improvement! I can identify with the sentiment.


No future posts left, oh my!


  1. Recording (13):
    05 Mar 2024 - Technology & Friends - Oren Eini on the Corax Search Engine
  2. Meta Blog (2):
    23 Jan 2024 - I'm a JS Developer now
  3. Production postmortem (51):
    12 Dec 2023 - The Spawn of Denial of Service
  4. Challenge (74):
    13 Oct 2023 - Fastest node selection metastable error state–answer
  5. Filtering negative numbers, fast (4):
    15 Sep 2023 - Beating memcpy()
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats