Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,502
Comments: 51,090
Privacy Policy · Terms
filter by tags archive
time to read 10 min | 1920 words

I just read this (excellent) introduction to the Secure Drop protocol, meant for journalists to have a way to safely accept anonymous information from sources.

Using Tor and something like Tails (with a pretty good guide on how to connect to it from an unusual location) you can probably get to the right website in an anonymous manner. With the current version of Secure Drop, you go to a website, such as TechCrunch, and get a .onion address that you can put into the Tor browser. You then get a codename and can upload files and a message.

You are trusting the server to a very large extent, as far as I can see. If the server you are reaching has been compromised, it can just send anything you add directly to the Bad Guys. I mean, the website recommends using GPG to encrypt the messages locally, but then it gives you the GPG key.

For fun, I tested this with the Washington Post. They have a couple of ways to provide information anonymously. One is via Secure Drop and the other is via GPG email. The keys for both are different. That doesn’t inspire confidence.

For that matter, the TechCrunch instance is running on version 2.5.2, whereas the current version is 2.8.0. I assume that those instances are basically set up once and then never maintained. For context, the TechCrunch current version of Secure Drop was released in Feb 2023, quite a while ago.

Let’s assume that there is a security vulnerability (I know of none, just to be clear) in an old version that hasn’t been updated. What happens if it is attacked and taken over by an adversary? Since the current model is to trust the server, you get access to everything.

I think that the new Secure Drop protocol's whole point is to get end-to-end encryption and break the requirement to trust the server. And more to the point, I think that the way they do that is really interesting.

Here are their stated requirements:

  1. No accounts, and therefore no user authentication.
  2. No message flow metadata, meaning messages can’t be linked together, and different types of messages are indistinguishable from one another.
  3. No changes in server state are observable externally.
  4. No ciphertext collection or information leaks via trial decryption – a given recipient receives pertinent ciphertext only.

Basically, there are three methods:

  • Send
  • Fetch
  • Download

A source needs to send information to a journalist. The journalist publishes their public key somewhere. For example, as a QR code in a physical magazine, etc. Using the journalist's public key, we can start the ball rolling.

They encrypt the data locally and then Send the encrypted file to the server, where it is stored. The next step is for the journalist to Fetch and Download the data.

Each time the source sends information to the server, it is considered a separate action, and there is no way to link them together. To reply back, the sender needs to include their public key in their message.

But there is a problem here. You don’t want the server to know who is the recipient of the messages. Note that this holds both ways, you should not be able to see that a particular journalist got a message or trace a reply to the source.

In other words, the server should have as little information as possible about what is going on. And the way they implemented that feature is absolutely beautiful. Remember, the source will encrypt the message locally using the journalist’s public key.

That is done using an ephemeral key pair, used only for this message. Along with the encrypted message, they send the (ephemeral) public key for the message and also compute a Diffie-Hellman key from the ephemeral private key and the journalist public key.

This is what it would look like:

def send(msg, journalist_public_key):
   tmp_key_pair = generate_key_pair()
   enc_msg= encrypt(msg, journalist_public_key)
   msg_gdh = scalar_mult(tmp_key_pair.private, journalist_public_key)
   return enc_msg, msg_gdh, tmp_key_pair.public

The Gdh here refers to Group Diffie-Hellman. Note that this isn’t actually used for anything, we just send that to the server.

When a journalist tries to fetch messages, they aren’t providing any information to the server. The server has no way of knowing who it is talking to. And yet the idea is to send information that only the journalist can read. In order to do that, the server will generate an ephemeral key pair (only for that request) and Diffie-Hellman key from the current request’s private key and Gdh value provided by the source.

The server will compute another Gdh value. This time between the message’s public key and the private key of the request. It will send the encrypted message, the public key of the request and the server computed Gdh value to the journalist. Here is what the code looks like:

def fetch(msg_id, msg_gdh, msg_public_key):
   tmp_key_pair= generate_key_pair()
   enc_key = scalar_mult(tmp_key_pair.private, msg_gdh)
   msg_srv_gdh = scalar_mult(tmp_key_pair.private, msg_public_key)
   enc_msg_id = encrypt(msg_id, enc_key)
   return enc_msg_id, msg_srv_gdh, tmp_key_pair.public

That looks like utter nonsense, right? What is going on here? Let’s go back to first principles. Classic Diffie-Hellman is based on multiplication on a finite field. Let’s assume that we agreed on Base = 5, and Modulus = 23 (values taken from Wikipedia).

The journalist generates a private key: 13 and computes 513 mod 23 = 21. The journalist thus publishes 21 as its public key.

Journalist key pair
Private: 13Public: 21

The source generates a private key (for a single message): 3 and computes 53 mod 23 = 10. The source also computes a Diffie-Hellman key exchange from the message private key and the journalist public key using:  213 mod 23 = 15. That is marked as the message GDH (Group Diffie-Hellman).

JournalistSource (per message, ephemeral)Adversary knows…
Private: 1Public: 5

Private: 3Public: 10

Message GDH: 15

Journalist public key: 5

Note that usually, the GDH value (15 in our case) would be used as the agreed-upon encryption key between the source and the journalist. Instead of doing that, we are jumping through a few more hoops.

When the source sends a message to the server, it sends it its public key (10) and the GDH (15). This is stored internally and isn’t really doing anything interesting until a journalist needs to fetch messages from the server.

This is where the magic happens. During the processing of the fetch request, the server doesn’t have any idea who the journalist is, there are no keys or authentication happening. A key requirement is that the journalist is able to get new messages without the server knowing what messages they are able to read.

Let’s consider Lois and Clark as two journalists, and assume for simplicity that there is only a single message involved. Both Lois and Clark send a request to the server to get new messages. Since the server cannot tell them apart, it will respond in the same way.

The first step in the server processing the request is to generate an ephemeral private key: 9 and compute 59 mod 23 = 11. It will then compute a Group Diffie-Hellman, between its private key and the message public key: 109 mod 23 = 20, this is called the request GDH. It also computes another Group Diffie-Hellman, this time with the message GDH and its private key: 159 mod 23 = 14. That serves as the encryption key for the message, and the server uses that to send an encrypted message ID to the journalist.

Journalist (Lois)Journalist (Clark)Server
Private: 13Public: 21

Private: 17Public: 15

Request Private: 9Request Public: 11

Request GDH: 20

Encryption key: 14

--Server reply:
--Request GDH: 20encrypted(msg_id, 14)

Note that in the table above, the reply from the server is computed in the same manner to both journalists (the server cannot tell which is which). However, for each request, the server will generate a new request key pair. So on each request, you’ll get a different result, even for the same data.

Now, what can the journalists do with this information? We have the message public key and the request GDH. Let’s have Lois do Group Diffie-Hellman on them: 2013 mod 23 = 14. That gives us the encryption key that was used by the server. We can use that to decrypt the message the server sent (which by itself is a reference to the actual encrypted message held by the oblivious server).

What about Clark? Using his own key, he can compute 2017 mod 23 = 7. And that means that he cannot get the encryption key to figure out what the server sent.

How does this work? Let’s break it down.

Message GDH213 mod 23 = 15        
Encryption key for server159 mod 23 = 14
Request GDH109 mod 23 = 20
Encryption key for journalist2013 mod 23 = 14

The reason we get the same encryption key is that we actually end up computing:

enc_key_server = pow( 
   pow(journalist_public_key, message_private_key),
) % 23 # ((21^3)^9) % 23 = 14

enc_key_journalist = pow( 
   pow(message_public_key, request_private_key),
) % 23 # (((10^9)^13) % 23 = 14

In other words, when the server computes the encryption key, it involves:

  • Message public key
  • Journalist public key ^ message private key
  • Request private key

And when the journalist computes it, it involves:

  • Request public key
  • Journalist public key ^ message private key
  • Journalist private key

That is a really cool result. It makes sense, but it was non-obvious. Hence why I wrote it out in detail to understand it properly.

In terms of cryptographic theory, I’m not aware of any other use of something like that (note, not a cryptographer), and it would probably require a thorough review by actual cryptographers to verify.

In practical terms, actually making use of this is pretty hard. It uses a pretty low-level cryptographic primitive that is usually not exposed by modern encryption API. More to the point, I think that it isn’t a good direction to go toward, and I have in mind a much simpler solution instead.

time to read 4 min | 698 words

I read this blog post from Discord, which presents a really interesting approach to a problem that they ran into. Basically, in the cloud you have the choice between fast and ephemeral disks and persistent (and much slower) disks.

Technically speaking you can try to get really fast persistent disks, but those are freaking expensive. Ultra disks on Azure and io2 disks on AWS and Extreme persistent disks for GCP. The cost for a single 1TB disk with 25K IOPS would be around 1,800 USD / month, for example. That is just for the disk!

To give some context, an r6gd.16xlarge instance with 64 cores and 512 GB (!) of RAM as well as 3.8 TB of NVMe drive will cost you less than that. That machine can also do 80K IOPS, for context.

At Discord scale, they ran into the limitations of I/O in the cloud with their database and had to come up with a solution for that.

My approach would be to… not do anything, to be honest. They are using a database (ScyllaDB) that is meant to run on commodity hardware. That means that losing a node is an expected scenario. The rest of the nodes in the cluster will be able to pitch in and recover the data when the node comes back or is replaced. That looks like the perfect scenario for running with fast ephemeral disks, no?

There are two problems with ephemeral disks. First, they are ephemeral Smile, wich means that a hardware problem can make the whole disk go poof. Not an issue, that is why you are using a replicated database, no? We’ll get to that.

Another issue that is raised by Discord is that they rely on disk snapshots for backups and other important workflows. The nice thing about snapshotting a cloud disk is that it is typically fast and cheap to do. Otherwise you need to deal with higher level backup systems at the database layer. The issue is that you probably do want that, because restoring a system from a set of disk snapshots that were taken at different times and at various states is likely to be… challenging.

Discord's solution to this is to create a set of ephemeral disks that are mirrored into persistent disks. Reads are going to the fast disks and writes are spread around. A failure on the ephemeral disks will lead to a recovery of the data from the network disk.

Their post has the gory details and a lot more explanation aside. I wanted to write this post for a simple reason. As a database guy, this system scares me to pieces.

The issue is that I don’t know what kind of data consistency we can expect in such an environment. Do I get any guarantees about the equivalence of data between the fast disks and the network disk? Can they diverge from one another? What happens when you have partial errors?

Consider a transaction that modifies three different pages that end up going to three different fast disks + the network disk. If one of the fast disks has an error during write, what is written to the persistent disk?

Can I get an out-of-date read from this system if we read from the network disk for some reason? That may mean that two pages that were written in one transaction are coming back as different versions. That will likely violate assumptions and invariants and can lead to all sorts of… interesting problems.

Given that this system is meant to handle the failure modes, it sounds really scary because it is an additional layer of complexity (and one that the database is unaware of) to deal with.

Aside from the snapshots, I assume that the reason for this behavior is to avoid the cost of rebuilding a node when there is a failure. I don’t have enough information to say what is the failure rate and the impact on the overall system, but the solution provided is elegant, beautiful, and quite frankly, pretty scary to me.

There have been quite a few unknowns that we had to deal with in the realm of storage. But this adds a whole new layer of things that can break.

time to read 6 min | 1060 words

I run into this fascinating blog post discussing the performance of BonsaiDb. To summarize the post, the author built a database and benchmarked it, but wasn’t actually calling fsync() to ensure durability. When they added the fsync() call at the right time, the performance didn’t degrade as expected. It turned out that they were running benchmarks on tmpfs, which is running in memory and where fsync() has no impact. Once they tested on a real system, there was a much higher cost, to the point where the author is questioning whether to continue writing the database library he has developed.

Forgetting to call fsync() is an understandable issue (they called the flush() method, but that didn’t translate to an fsync() call). I recall that at one point the C#’s API once had a bug where a Flush() would not call fsync() if you were making writes with 4KB alignment (that is… super bad).

After figuring out the issue, the author has out to figure exactly how to ensure durability. That is a journey that is fraught with peril, and he has some really interesting tidbits there. Including sync_file_range(), fdatasync() and how you cannot write a durable database for Mac or iOS.

From my perspective, you cannot really trust anything beyond O_DIRECT | O_DSYNC or fdatasync() for durability. Almost a decade ago I wrote about performance testing that I did for various databases. My code was the 2nd fastest around for the tested scenarios. It was able to achieve almost 23,000 writes, almost 25% of the next slowest database. However, the fastest database around was Esent, which clocked at 786,782 writes.

I dug deep into how this is done and I realized that there is a fundamental difference between how all other databases were working and how Esent was handling things. All other databases issued fsync() calls (or fdatasync()). While Esent skipped that altogether. Instead, it opened a file with FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_DIRECT (the Unix version is O_DIRECT | O_DSYNC). That change alone was responsible for a major performance difference. When using O_DIRECT | O_DSYNC, the write is sent directly to persistent medium, skipping all buffers. That means that you don’t have to flush anything else that is waiting to be written.

If you are interested, I wrote a whole chapter on the topic of durable writes. It is a big topic.

The other thing that has a huge impact on performance is whether you are doing transaction merging or not. If you have multiple operations running at roughly the same time, are you going to do a separate disk write for each one of them, or will you be able to do that in a single write. The best example that I can think of is the notion of taking the bus. If you send a whole bus for each passenger, you’ll waste a lot of time and fuel. If you pack the bus as much as possible, for almost the same cost, you’ll get a far better bang.

In other words, your design has to include a way for the database to coalesce such operations into a single write.

Yesterday there was an update to this task, which more or less followed that direction. The blog post covers quite a lot of ground and is going in the right direction, in my opinion. However, there are a few things there that I want to comment upon.

First, pre-allocation of disk space can make a huge impact on the overall performance of the system. Voron does that by allocating up to 1GB of disk space at a time, which dramatically reduces the amount of I/O work you have to do. Just to give some context, that turns a single disk write to multiple fsyncs that you have to do, on both your file and the parent directory, on each write. That is insanely expensive. The storage engine discussed here used append only mode, which makes this a bit of a problem, but not that much. You can still preallocate the disk space. You have to scan the file from the end on startup anyway, and the only risk here is the latency for reading the whole pre-allocation size on startup if we shut down immediately after the preallocation happened. It’s not ideal, but it is good enough.

Second, the way you manage writes shouldn’t rely on fsync and friends. That is why we have the log for, and you can get away with a lot by letting just the log handle the durability issue. The log is pre-allocated to a certain size (Voron uses dynamic sizes, with a max of 256MB) and written to using O_DIRECT | O_

O_DSYNC each time. But because this is expensive, we have something like this (Python code, no error handling, demo code, etc):

The idea is that you can call writeToLog() each time and you’ll get a future on the write to the log file. You can continue with your transaction when the log holds the data. Note that in this model, if you have concurrent writes, they will be merged into a single disk write. You can also benefit significantly from reduced amount we write to disk by applying compression.

Third, something that has been raised as an option here is a new storage format. I’m not sure that I 100% get what is the intention, but… what I understand I don’t like. I think that looking at how LMDB does things would help a lot here. It is a COW model (which the append only is very similar to). The key difference is that the new model is going to store every header twice. Probably with a CURRENT and NEXT model, where you can switch between the two configurations. That… works, but it is pretty complex. Given that you have a log involved, there is no need for any of this. You can just store the most recent changes in the log and use that to find what the current version is of the data is.

I don’t like append only models, since they require you to do compaction at some point. A better model is what LMDB does, where it re-used the same pages (with copy on write). It requires you to manage a free list of pages, of course, but that isn’t that big a task.

time to read 12 min | 2246 words

I was pointed to this paper on twitter: Are You Sure You Want to Use MMAP in Your Database Management System?

As you can imagine, this is a topic near and dear to my heart. This is especially the case since I am currently writing the Implementing a file pager in Zig posts series. I implemented the same low level mechanics using mmap, using mmap, I have < 100 lines of code and can start building higher level concepts almost immediately. Writing my own pager is currently a 10 posts series and the end doesn’t seem to be in sight.

I’m going to use this post to respond to the article. As a reminder, I’m the founder of RavenDB and I wrote Voron, a mmap based storage engine, and has been running that across hundreds of clients and literally tens of millions of instances in production. I am also writing a book about building a storage engine that uses mmap internally.

The paper itself does a great job of outlining the issue of using mmap as the buffer pool in DBMS. What it doesn’t cover, however, is the alternative. I will touch on specific points from the paper shortly, but I want to point out that the article compares apples to camels in the benchmarks and conclusions. Note that I don’t necessarily disagree with some of the statements, mmap certainly has challenges that you need to deal with, but if you avoid that, you can’t have wave everything that it brings to the table.

When building a database, using mmap has the following advantages, the OS will take care of:

  • Reading the data from disk
  • Concurrency between different threads reading the same data
  • Caching and buffer management
  • Eviction of pages from memory
  • Playing nice with other processes in the machine
  • Tracking dirty pages and writing to disk*

I put an asterisk on the last one because it probably requires your attention as well.

If you aren’t using mmap, on the other hand, you still need to handle all those issues. That is a key point that I believe isn’t addressed in the paper. Solving those issues properly (and efficiently) is a seriously challenging task. Given that you are building a specialized solution, you can probably do better than the generic mmap, but it will absolutely have a cost. That cost is both in terms of runtime overhead as well as increased development time.

The comparison that was made by the paper was done using fio benchmark tool, which is great if you want to test your storage system, but is pretty much irrelevant if you are trying to benchmark a buffer pool. Consider the following:

For the mmap version, we need to compute the address of the page and that is pretty much it. For the manual buffer pool, the list of tasks that we need to handle is long. And some of them require us to be thread safe. For example, if we handed a page to a transaction, we need to keep track of that page status as being in use. We cannot evict this page until the transaction is done with it. That means that we probably need to do atomic reference counting, which can have very high costs. There are other alternatives, of course, but they all have even higher degrees of complexity.

In practice, data access within a database isn’t actually random, even if you are doing random reads. There are pages that are going to almost always be referenced. The root page in the B+Tree is a good example. It is always going to be used. Under atomic reference counting, that page is going to be a bottleneck.

Ignoring such overhead of the buffer pool management means that you aren’t actually comparing equivalent structures. I should also point out that I’m probably forgetting a few other tasks that the buffer pool needs to manage as well, which complicate its life significantly. Here is an example of such a buffer pool implementation from what is effectively a random GitHub repository. You can see what the code is trying to do here. The reason I point to this is that there is a mutex there (and I/O under the lock), which is fairly typical for many buffer pools. And not accounting for the overhead of buffer pool management is seriously skewing the results of the paper.

All of this said, I absolutely agree that mmap can be challenging. The paper outlines 4 different problems, which I want to address.

Problem #1 – Transactional safety

A database needs to know when the data is persisted to disk. When using mmap, we explicitly give up that knowledge. That can be a challenge, but I don’t see that as a seriously different one from not using mmap. Let’s consider the manner in which Postgres is working. It has its own buffer pool, and may modify the pages as a result of a write. Postgres may need to evict modified pages to disk before the transaction that modified them is committed. The overhead of managing that is just… part of the challenge that we need to deal with.

For RavenDB, as the paper points out, we modify the pages outside of the mmap memory. This is actually not done for the reason the paper describes. I don’t actually care if the data is written to memory behind my back. What I care about is MVCC (a totally separate concern than buffer management). The fact that I’m copying the modified data to the side means that I Can support concurrent transactions with far greater ease. In a similar fashion, Postgres handles MVCC using multiple entries for the same row in the same page.

When the transaction commits and older transactions no longer need the old version of the data, I can push the data from the modified buffers to the mmap region. That tends to be fairly fast (given that I’m basically doing memcpy(), which runs at memory speed) unless I have to page data in, more on that later.

The paper mentions the issue of single writer in LMDB, and I wanted to point out that a single writer model is actually far more common (and again, not really related to the buffer pool issue). Off the top of my head, most embedded databases implement a single writer model.

  • LMDB
  • Voron (RavenDB’s storage engine)
  • LevelDB
  • Lucene

The one that I can think that doesn’t have a single writer is RocksDB(where allow_concurrent_memtable_write is for writes to the memtable, not related to file I/O).

The reason this matters is that embedded systems can typically assume that all operations in a transaction will complete as a unit. Compare to Postgres, where we may have a transaction spanning multiple network calls, interleaving writes is a must. If we could avoid such concurrency, that would be far preferable. You can get additional concurrency by having sharding writes, but that is usually not needed.

Problem #2 – I/O Stalls

The paper points out, quite correctly, that not having control over the I/O means that you may incur a page fault at any time. In particular, you may end up blocked on I/O without really noticing. This can be a killer especially if you are currently holding a lock and blocked on page fault. Indeed, I consider this to be the most serious issue that you have to deal with mmap based systems.

In practice, however, the situation isn’t so clear cut. Until quite recently, the state of asynchronous I/O on Linux was quite iffy. Until the arrival of io_uring, certain operations that you expected to be async would block occasionally, ruining your day. The paper mentions that you can use async I/O to issue I/O requests to load the next pages (non sequentially) from the disk when you are performing certain operations. You can do the same with mmap as well, and RavenDB does just that. When you start a scan on a B+Tree, RavenDB will ask the OS to ensure that the memory we are interested in is in memory before we actually get to it. On Linux, this is done with madvise(WILL_NEED) call. That call may be blocking, so we actually have a dedicated thread that is meant to handle such a scenario.  In practice, this isn’t really that different from how you’ll handle it with async I/O.

Another consideration to deal with is the cost of mapping at the kernel level. I’m not talking about the I/O cost, but if you have many threads that are faulting pages, you’ll run into problems with the page table lock. We have run into that before, this is considered an OS level bug, but it obviously has an impact on the database. In practice, however, the overhead of memory management is the same in most cases. If you are reading via mmap or allocating directly, you’ll need to orchestrate things. Note that the same page table lock is also in effect if you are heavily allocating / freeing, since you’re also modifying the process page table.

Problem #3 – Error Handling

Error handling is a serious concern for a database. The paper points out that databases such as SQL Server may run a checksum when reading data from disk. When you use a buffer pool, the boundary of reading from the disk is obvious and you can easily validate the read from the disk. Voron is using mmap exclusively, and we do have checksums. We validate the page from the disk the first time that we access it (there is an internal bitmap that is used for that).  There isn’t a big difference between the two anyway. We only check a given page once per run, because to do otherwise is meaningless. When you use read() to get data from the disk, you have no guarantees that the data wasn’t fetched from a cache along the way. So you may validate the data you read is “correct”, while the on disk representation is broken. For that reason, we only do the check once, instead of each time.

A far greater issue to deal with is I/O errors. What do you do when a read or a write fails? If you are using system calls to manage that, you get a return code and can react accordingly. If you are using mmap, the system will generate a SIGBUS that you’ll have to (somehow) handle.

For a database, dealing with I/O errors has a single correct answer. Crash and then run recovery from scratch. If the I/O system has returned an error, there is no longer any way to know what the state of that is. See: fsync-gate. The only way to recover is to stop, reload everything (apply the WAL, run recovery, etc) and get back into a stable state. SIGBUS isn’t my cup of tea with regards to handling this, but error handling for I/O error isn’t actually something that you do, so just restarting the process ends up more acceptable than you might initially think.

Problem #4 – Performance issues

The paper points out three common reasons for performance issues with mmap usage:

  1. page table contention
  2. single threaded page eviction
  3. TLB shootdowns

The first issue is something that I have run into in the past. It was a bug in the operating system which was fixed. There is no longer a single page table in both Windows and Linux.

The single threaded eviction, on the other hand, is something that we never run into. When using Voron, we map the memory using MAP_SHARED, and most of the time, the memory isn’t dirty. If the system needs memory, it can do that when it assigns a page by just discarding the memory of an unmodified shared page. In this model, we typically see most of the memory as shared, clean. So there isn’t a lot of pressure to evict things, and it can be done on as needed basis.

The TLB shootdown issue is not something that we ever run into as a problem. We have run TB range databases on Raspberry PI with 4GB of RAM and hammered that in benchmarks (far exceeding the memory capacity). The interesting thing here is that the B+Tree nature means that the upper tiers of the tree were already in memory, so we mostly ended up with a single page fault per request. In order to actually observe the cost of TLS Shootdown in a significant manner, you need to have:

  • really fast I/O
  • working set that significantly exceeds memory
  • no other work that needs to be done for processing a request

In practice, if you have really fast I/O, you spent money on that, you’ll more likely get more RAM as well. And you typically need to do something with the data you read, which means that you won’t notice the TLB shootdown as much.

Finally, going back to how I started this post. This assumes that there are no other costs of not using mmap and using direct IO. The benchmark doesn’t account for those extra costs. For example, without mmap, who is doing evictions? In practice, that will lead to the same sort of considerations that you’ll have when dealing with mmap. This is especially the case with TLS shootdown when we start talking about high memory traffic (which likely modifies page allocations for the process, leading to the same scenario).

The paper has been quite interesting to read and it has presented a number of real problems that occur with mmap based systems, but I’m afraid that it doesn’t present the alternatives properly and vastly underestimates both costs and complexity of not using mmap and writing your own buffer pool.

time to read 2 min | 322 words

I ran into this post and I thought that I would share my thinking on the matter. I don’t actually know anything about IndexedDB, but I have been working with databases for a sufficiently long time to understand what the root problem here is.

The issue the post is talking about is that IndexedDB is slow, to the rate of inserting hundreds of documents takes multiple seconds. When you go a bit deeper into the post, you realize that the trouble is not with the number of documents that are being saved, but the number of transactions.

IndexedDB has the concept of transactions, and I’m going to err on the safe side and assume that it has durability. I dug a bit and found that IndexedDB is based on LevelDB (and I do know that one quite a bit). The underlying issue is that each transaction commit needs to issue an fsync(), and fsync is slow!

Pretty much all database engines have to deal with that, the usual answer is to allow transaction merging of some sort. Many relational databases will write to the log and commit multiple transactions in a single fsync(), RavenDB will do explicit transaction merging in order to batch writes to disk.

However, for pretty much every database out there with durability guarantees, the worst thing you can do is write a lot of data one transaction at a time. Server databases still benefit from being able to amortize multiple operations across different requests, but a client side database is effectively forced to do serial work with no chance for optimization.

From the client perspective, the only viable option here is to batch writes to get better performance. Then again, if you are working on the client side, you are either responding to a user action (in which case you have plenty of time to commit) or running in the background, which means that you have the ability to batch.

time to read 1 min | 88 words

I posted a few weeks ago about a performance regression in our metrics that we tracked down the to the disk being exhausted.

We replaced the hard disk to a new one, can you see what the results were?

image (1)

This is mostly because we were pretty sure that this is the problem, but couldn’t rule out that this was something else. Good to know that we were on track.

time to read 3 min | 533 words

At the beginning of the year, we run into a problematic query. The issue was the use of an in clause vs. a series of OR. You can see the previous investigation results here. We were able to pinpoint the issue pretty well, very deep in the guts of Lucene, our query engine.

Fast Query Slow Query
image image
Time: 1 – 2 ms Time: 60 – 90 ms
image image

The key issue for this query was simple. There are over 600,000 orders with the relevant statuses, but there are no orders for CustomerId “customers/100”. In the OR case, we would evaluate the query lazily. First checking the CustomerId, and given that there have been no results, short circuiting the process and doing no real work for the rest of the query. The IN query, on the other hand, would do things eagerly. That would mean that it would build a data structure that would hold all 600K+ documents that match the query, and then would throw that all away because no one actually needed that.

In order to resolve that, I have to explain a bit about the internals of Lucene. As its core, you can think of Lucene in terms of sorted lists inside dictionaries. I wrote a series of posts on the topic, but the gist of it is:


Note that the ids for documents containing a particular term are sorted. That is important for a lot of optimizations in Lucene, which is also a major problem for the in query. The problem is that each component in the query pipeline needs to maintain this invariant. But when we use an IN query, we need to go over potentially many terms. And then we need to get the results in the proper order to the calling code. I implemented a tiered approach. If we are using an IN clause with a small number of terms in it (under 128), we will use a heap to manage all the terms and effectively do a merge sort on the results.

When we have more than 128 terms, that stops being very useful, however. Instead, we’ll create a bitmap for the possible results and scan through all the terms, filling the bitmap. That can be expensive, of course, so I made sure that this is done lazily by RavenDB.

The results are in:

  OR Query IN Query
Invalid CustomerId 1.39 – 1.5 ms 1.33 – 1.44 ms
Valid CustomerId 17.5 ms 12.3 ms

For the first case, this is now pretty much a wash. The numbers are slightly in favor of the IN query, but it is within the measurement fluctuations.

For the second case, however, there is a huge performance improvement for the IN query. For that matter, the cost is going to be more noticeable the more terms you have in the IN query.

I’m really happy about this optimization, it ended up being quite elegant.

time to read 3 min | 415 words

imageI run into this article that talks about building a cache service in Go to handle millions of entries. Go ahead and read the article, there is also an associated project on GitHub.

I don’t get it. Rather, I don’t get the need here.

The authors seem to want to have a way to store a lot of data (for a given value of lots) that is accessible over REST.  The need to be able to run 5,000 – 10,000 requests per second over this. And also be able to expire things.

I decided to take a look into what it would take to run this in RavenDB. It is pretty late here, so I was lazy. I run the following command against our live-test instance:


This say to create 1,024 connections and get the same document. On the right you can see the live-test machine stats while this was running. It peaked at about 80% CPU. I should note that the live-test instance is pretty much the cheapest one that we could get away with, and it is far from me.

Ping time from my laptop to the live-test is around 230 – 250 ms. Right around the numbers that wrk is reporting. I’m using 1,024 connections here to compensate for the distance. What happens when I’m running this locally, without the huge distance?


So I can do more than 22,000 requests per second (on a 2016 era laptop, mind) with max latency of 5.5 ms (which the original article called for average time). Granted, I’m simplifying things here, because I’m checking a single document and not including writes. But 5,000 – 10,000 requests per second are small numbers for RavenDB. Very easily achievable.

RavenDB even has the @expires feature, which allows you to specify a time a document will automatically be removed.

The nice thing about using RavenDB for this sort of feature is that millions of objects and gigabytes of data are not something that are of particular concern for us. Raise that by an orders of magnitude, and that is our standard benchmark. You’ll need to raise it by a few more orders of magnitudes before we start taking things seriously.

time to read 5 min | 822 words

This post asked an interesting question, why are hash table so prevalent for in memory usage and (relatively) rare in the case of databases. There is some good points in the post, as well as in the Hacker News thread.

Given that I just did a spike of persistent hash table and have been working on database engines for the past decade, I thought that I might throw my own two cents into the ring.

B+Tree is a profoundly simple concept. You can explain it in 30 minutes, and it make sense. There are some tricky bits to a proper implementation, for sure, but they are more related to performance than correctness.

Hash tables sounds simple, but the moment you have to handle collisions gracefully, you are going to run into real challenges. It is easy to get into nasty bugs with hash tables, the kind that silently corrupt your state without you realizing it.

For example, consider the following code:

This is a hash table using linear addressing. Collisions are handled by adding them to the next available node. And in this case, we have a problem. We want to put “ghi” in position zero, but we can’t, because it is already full. We move it to the first available location. That is well understood and easy. But when we delete “def”, we remove the entry from the array, but we forgot to do fixups for the relocated “ghi”, that value is now gone from the table, effectively. This is the kind of bug you need the moon to be in a certain position while a cat sneeze to figure out.

A B+Tree also maps very nicely to persistent model, but it is entirely non obvious how you can go from the notion of a hash table in memory to one on disk. Extendible hashing exists, and has for a very long time. Literally for more time than I’m alive, but it is not very well known / generically used. It is a beautiful algorithm, mind you. But just mapping the concept to a persistence model isn’t enough, typically, you also had a bunch of additional requirements from disk data structure. In particular, concurrency in database systems is frequently tied closely to the structure of the tree (page level locks).

There is also the cost issue. When talking about disk based data access, we are rarely interested in the actual O(N) complexity, we are far more interested in the number of disk seeks that are involved. Using extendible hashing, you’ll typically get 1 – 2 disk seeks. If the directory is in memory, you have only one, which is great. But with a B+Tree, you can easily make sure that the top levels of the tree will also be memory resident (similar to the extendible hash directory), that leads to typical 1 disk access to read the data, so in many cases, they are roughly the same performance for either option.

Related to the cost issue, you have to also consider security risks. There have been a number of attacks against hash tables that relied on generating hash collisions. The typical in memory fix is to randomize the hash to avoid this, but if you are persistent, you have to use the same hash function forever. That means that an attacker can very easily kill your database server, by generating bad keys.

But these are all relatively minor concerns. The key issue is that B+Tree is just so much more useful. A B+Tree can allow me to:

  • Store / retrieve my data by key
  • Perform range queries
  • Index using a single column
  • Index using multiple columns (and then search based on full / partial key)
  • Iterate over the data in specified order

Hashes allow me to:

  • Store / retrieve my data by key

And that is pretty much it. So B+Tree can do everything that Hashes can, but also so much more. They are typically as fast where it matters (disk reads) and more than sufficiently fast regardless.

Hashes are only good for that one particular scenario of doing lookup by exact key. That is actually a lot more limited than what you’ll consider.

Finally, and quite important, you have to consider the fact that B+Tree has certain access patterns that they excel at. For example, inserting sorted data into a B+Tree is going to be a joy. Scanning the B+Tree in order is also trivial and highly performant.

With hashes? There isn’t an optimal access pattern for inserting data into a hash. And while you can scan a hash at roughly the same cost as you would a B+Tree, you are going to get the data out of order. That means that it is a lot less useful than it would appear to upfront.

All of that said, hashes are still widely used in databases. But they tend to be used as specialty tools. Deployed carefully and for very specific tasks. This isn’t the first thing that you’ll reach to, you need to justify its use.

time to read 2 min | 313 words

I run into this blog post talking about how to handle optimistic concurrency in MongoDB and it brought to mind a very fundamental difference in the design philosophy between RavenDB and MongoDB.

If you’ll read the associated blog post, you’ll see guidance on how to build a simple optimistic concurrency using the MongoDB API. It looks like a relatively straightforward thing, but there is a lot of complexity going on here.

With RavenDB, we have decided that the responsibility of such tasks is on us, and not our users. Here is how you’ll write the same thing in RavenDB:

session.Advanced.OptimisticConcurrency = true;

And you are done. There are also options to set it globally (for all actions), for a particular session, as shown above or for a particular document or documents in a bigger transaction. About the only thing that we don’t handle is retries if the update failed, to allow you to re-run your business logic.

The reason I’m writing this is actually at the very end of the post:

This works just fine if I "remember" to include that Where clause correctly, but there's a better way if we want a general solution. For that, I'd do pretty much what I would have in the Life Beyond Distributed Transactions series - introduce a Repository, Unit of Work, and Identity Map.

This is exactly right. It looks trivial to do something like that when you are looking into a trivial scenario, but put it in a real application and the complexity sprouts. For example, try doing the same thing with multiple documents that need to change together. You have to implement quite a lot of code to do so (identity map, unit of work, hopefully not a repository Smile).

With RavenDB, all of that is just there and available for you. No need to do anything, It Just Works.


No future posts left, oh my!


  1. re (33):
    28 May 2024 - Secure Drop protocol
  2. Recording (13):
    05 Mar 2024 - Technology & Friends - Oren Eini on the Corax Search Engine
  3. Meta Blog (2):
    23 Jan 2024 - I'm a JS Developer now
  4. Production postmortem (51):
    12 Dec 2023 - The Spawn of Denial of Service
  5. Challenge (74):
    13 Oct 2023 - Fastest node selection metastable error state–answer
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats