Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,462 | Comments: 50,985

Privacy Policy Terms
filter by tags archive
time to read 4 min | 698 words

I read this blog post from Discord, which presents a really interesting approach to a problem that they ran into. Basically, in the cloud you have the choice between fast and ephemeral disks and persistent (and much slower) disks.

Technically speaking you can try to get really fast persistent disks, but those are freaking expensive. Ultra disks on Azure and io2 disks on AWS and Extreme persistent disks for GCP. The cost for a single 1TB disk with 25K IOPS would be around 1,800 USD / month, for example. That is just for the disk!

To give some context, an r6gd.16xlarge instance with 64 cores and 512 GB (!) of RAM as well as 3.8 TB of NVMe drive will cost you less than that. That machine can also do 80K IOPS, for context.

At Discord scale, they ran into the limitations of I/O in the cloud with their database and had to come up with a solution for that.

My approach would be to… not do anything, to be honest. They are using a database (ScyllaDB) that is meant to run on commodity hardware. That means that losing a node is an expected scenario. The rest of the nodes in the cluster will be able to pitch in and recover the data when the node comes back or is replaced. That looks like the perfect scenario for running with fast ephemeral disks, no?

There are two problems with ephemeral disks. First, they are ephemeral Smile, wich means that a hardware problem can make the whole disk go poof. Not an issue, that is why you are using a replicated database, no? We’ll get to that.

Another issue that is raised by Discord is that they rely on disk snapshots for backups and other important workflows. The nice thing about snapshotting a cloud disk is that it is typically fast and cheap to do. Otherwise you need to deal with higher level backup systems at the database layer. The issue is that you probably do want that, because restoring a system from a set of disk snapshots that were taken at different times and at various states is likely to be… challenging.

Discord's solution to this is to create a set of ephemeral disks that are mirrored into persistent disks. Reads are going to the fast disks and writes are spread around. A failure on the ephemeral disks will lead to a recovery of the data from the network disk.

Their post has the gory details and a lot more explanation aside. I wanted to write this post for a simple reason. As a database guy, this system scares me to pieces.

The issue is that I don’t know what kind of data consistency we can expect in such an environment. Do I get any guarantees about the equivalence of data between the fast disks and the network disk? Can they diverge from one another? What happens when you have partial errors?

Consider a transaction that modifies three different pages that end up going to three different fast disks + the network disk. If one of the fast disks has an error during write, what is written to the persistent disk?

Can I get an out-of-date read from this system if we read from the network disk for some reason? That may mean that two pages that were written in one transaction are coming back as different versions. That will likely violate assumptions and invariants and can lead to all sorts of… interesting problems.

Given that this system is meant to handle the failure modes, it sounds really scary because it is an additional layer of complexity (and one that the database is unaware of) to deal with.

Aside from the snapshots, I assume that the reason for this behavior is to avoid the cost of rebuilding a node when there is a failure. I don’t have enough information to say what is the failure rate and the impact on the overall system, but the solution provided is elegant, beautiful, and quite frankly, pretty scary to me.

There have been quite a few unknowns that we had to deal with in the realm of storage. But this adds a whole new layer of things that can break.

time to read 6 min | 1060 words

I run into this fascinating blog post discussing the performance of BonsaiDb. To summarize the post, the author built a database and benchmarked it, but wasn’t actually calling fsync() to ensure durability. When they added the fsync() call at the right time, the performance didn’t degrade as expected. It turned out that they were running benchmarks on tmpfs, which is running in memory and where fsync() has no impact. Once they tested on a real system, there was a much higher cost, to the point where the author is questioning whether to continue writing the database library he has developed.

Forgetting to call fsync() is an understandable issue (they called the flush() method, but that didn’t translate to an fsync() call). I recall that at one point the C#’s API once had a bug where a Flush() would not call fsync() if you were making writes with 4KB alignment (that is… super bad).

After figuring out the issue, the author has out to figure exactly how to ensure durability. That is a journey that is fraught with peril, and he has some really interesting tidbits there. Including sync_file_range(), fdatasync() and how you cannot write a durable database for Mac or iOS.

From my perspective, you cannot really trust anything beyond O_DIRECT | O_DSYNC or fdatasync() for durability. Almost a decade ago I wrote about performance testing that I did for various databases. My code was the 2nd fastest around for the tested scenarios. It was able to achieve almost 23,000 writes, almost 25% of the next slowest database. However, the fastest database around was Esent, which clocked at 786,782 writes.

I dug deep into how this is done and I realized that there is a fundamental difference between how all other databases were working and how Esent was handling things. All other databases issued fsync() calls (or fdatasync()). While Esent skipped that altogether. Instead, it opened a file with FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_DIRECT (the Unix version is O_DIRECT | O_DSYNC). That change alone was responsible for a major performance difference. When using O_DIRECT | O_DSYNC, the write is sent directly to persistent medium, skipping all buffers. That means that you don’t have to flush anything else that is waiting to be written.

If you are interested, I wrote a whole chapter on the topic of durable writes. It is a big topic.

The other thing that has a huge impact on performance is whether you are doing transaction merging or not. If you have multiple operations running at roughly the same time, are you going to do a separate disk write for each one of them, or will you be able to do that in a single write. The best example that I can think of is the notion of taking the bus. If you send a whole bus for each passenger, you’ll waste a lot of time and fuel. If you pack the bus as much as possible, for almost the same cost, you’ll get a far better bang.

In other words, your design has to include a way for the database to coalesce such operations into a single write.

Yesterday there was an update to this task, which more or less followed that direction. The blog post covers quite a lot of ground and is going in the right direction, in my opinion. However, there are a few things there that I want to comment upon.

First, pre-allocation of disk space can make a huge impact on the overall performance of the system. Voron does that by allocating up to 1GB of disk space at a time, which dramatically reduces the amount of I/O work you have to do. Just to give some context, that turns a single disk write to multiple fsyncs that you have to do, on both your file and the parent directory, on each write. That is insanely expensive. The storage engine discussed here used append only mode, which makes this a bit of a problem, but not that much. You can still preallocate the disk space. You have to scan the file from the end on startup anyway, and the only risk here is the latency for reading the whole pre-allocation size on startup if we shut down immediately after the preallocation happened. It’s not ideal, but it is good enough.

Second, the way you manage writes shouldn’t rely on fsync and friends. That is why we have the log for, and you can get away with a lot by letting just the log handle the durability issue. The log is pre-allocated to a certain size (Voron uses dynamic sizes, with a max of 256MB) and written to using O_DIRECT | O_

O_DSYNC each time. But because this is expensive, we have something like this (Python code, no error handling, demo code, etc):

The idea is that you can call writeToLog() each time and you’ll get a future on the write to the log file. You can continue with your transaction when the log holds the data. Note that in this model, if you have concurrent writes, they will be merged into a single disk write. You can also benefit significantly from reduced amount we write to disk by applying compression.

Third, something that has been raised as an option here is a new storage format. I’m not sure that I 100% get what is the intention, but… what I understand I don’t like. I think that looking at how LMDB does things would help a lot here. It is a COW model (which the append only is very similar to). The key difference is that the new model is going to store every header twice. Probably with a CURRENT and NEXT model, where you can switch between the two configurations. That… works, but it is pretty complex. Given that you have a log involved, there is no need for any of this. You can just store the most recent changes in the log and use that to find what the current version is of the data is.

I don’t like append only models, since they require you to do compaction at some point. A better model is what LMDB does, where it re-used the same pages (with copy on write). It requires you to manage a free list of pages, of course, but that isn’t that big a task.

time to read 12 min | 2246 words

I was pointed to this paper on twitter: Are You Sure You Want to Use MMAP in Your Database Management System?

As you can imagine, this is a topic near and dear to my heart. This is especially the case since I am currently writing the Implementing a file pager in Zig posts series. I implemented the same low level mechanics using mmap, using mmap, I have < 100 lines of code and can start building higher level concepts almost immediately. Writing my own pager is currently a 10 posts series and the end doesn’t seem to be in sight.

I’m going to use this post to respond to the article. As a reminder, I’m the founder of RavenDB and I wrote Voron, a mmap based storage engine, and has been running that across hundreds of clients and literally tens of millions of instances in production. I am also writing a book about building a storage engine that uses mmap internally.

The paper itself does a great job of outlining the issue of using mmap as the buffer pool in DBMS. What it doesn’t cover, however, is the alternative. I will touch on specific points from the paper shortly, but I want to point out that the article compares apples to camels in the benchmarks and conclusions. Note that I don’t necessarily disagree with some of the statements, mmap certainly has challenges that you need to deal with, but if you avoid that, you can’t have wave everything that it brings to the table.

When building a database, using mmap has the following advantages, the OS will take care of:

  • Reading the data from disk
  • Concurrency between different threads reading the same data
  • Caching and buffer management
  • Eviction of pages from memory
  • Playing nice with other processes in the machine
  • Tracking dirty pages and writing to disk*

I put an asterisk on the last one because it probably requires your attention as well.

If you aren’t using mmap, on the other hand, you still need to handle all those issues. That is a key point that I believe isn’t addressed in the paper. Solving those issues properly (and efficiently) is a seriously challenging task. Given that you are building a specialized solution, you can probably do better than the generic mmap, but it will absolutely have a cost. That cost is both in terms of runtime overhead as well as increased development time.

The comparison that was made by the paper was done using fio benchmark tool, which is great if you want to test your storage system, but is pretty much irrelevant if you are trying to benchmark a buffer pool. Consider the following:

For the mmap version, we need to compute the address of the page and that is pretty much it. For the manual buffer pool, the list of tasks that we need to handle is long. And some of them require us to be thread safe. For example, if we handed a page to a transaction, we need to keep track of that page status as being in use. We cannot evict this page until the transaction is done with it. That means that we probably need to do atomic reference counting, which can have very high costs. There are other alternatives, of course, but they all have even higher degrees of complexity.

In practice, data access within a database isn’t actually random, even if you are doing random reads. There are pages that are going to almost always be referenced. The root page in the B+Tree is a good example. It is always going to be used. Under atomic reference counting, that page is going to be a bottleneck.

Ignoring such overhead of the buffer pool management means that you aren’t actually comparing equivalent structures. I should also point out that I’m probably forgetting a few other tasks that the buffer pool needs to manage as well, which complicate its life significantly. Here is an example of such a buffer pool implementation from what is effectively a random GitHub repository. You can see what the code is trying to do here. The reason I point to this is that there is a mutex there (and I/O under the lock), which is fairly typical for many buffer pools. And not accounting for the overhead of buffer pool management is seriously skewing the results of the paper.

All of this said, I absolutely agree that mmap can be challenging. The paper outlines 4 different problems, which I want to address.

Problem #1 – Transactional safety

A database needs to know when the data is persisted to disk. When using mmap, we explicitly give up that knowledge. That can be a challenge, but I don’t see that as a seriously different one from not using mmap. Let’s consider the manner in which Postgres is working. It has its own buffer pool, and may modify the pages as a result of a write. Postgres may need to evict modified pages to disk before the transaction that modified them is committed. The overhead of managing that is just… part of the challenge that we need to deal with.

For RavenDB, as the paper points out, we modify the pages outside of the mmap memory. This is actually not done for the reason the paper describes. I don’t actually care if the data is written to memory behind my back. What I care about is MVCC (a totally separate concern than buffer management). The fact that I’m copying the modified data to the side means that I Can support concurrent transactions with far greater ease. In a similar fashion, Postgres handles MVCC using multiple entries for the same row in the same page.

When the transaction commits and older transactions no longer need the old version of the data, I can push the data from the modified buffers to the mmap region. That tends to be fairly fast (given that I’m basically doing memcpy(), which runs at memory speed) unless I have to page data in, more on that later.

The paper mentions the issue of single writer in LMDB, and I wanted to point out that a single writer model is actually far more common (and again, not really related to the buffer pool issue). Off the top of my head, most embedded databases implement a single writer model.

  • LMDB
  • Voron (RavenDB’s storage engine)
  • LevelDB
  • Lucene

The one that I can think that doesn’t have a single writer is RocksDB(where allow_concurrent_memtable_write is for writes to the memtable, not related to file I/O).

The reason this matters is that embedded systems can typically assume that all operations in a transaction will complete as a unit. Compare to Postgres, where we may have a transaction spanning multiple network calls, interleaving writes is a must. If we could avoid such concurrency, that would be far preferable. You can get additional concurrency by having sharding writes, but that is usually not needed.

Problem #2 – I/O Stalls

The paper points out, quite correctly, that not having control over the I/O means that you may incur a page fault at any time. In particular, you may end up blocked on I/O without really noticing. This can be a killer especially if you are currently holding a lock and blocked on page fault. Indeed, I consider this to be the most serious issue that you have to deal with mmap based systems.

In practice, however, the situation isn’t so clear cut. Until quite recently, the state of asynchronous I/O on Linux was quite iffy. Until the arrival of io_uring, certain operations that you expected to be async would block occasionally, ruining your day. The paper mentions that you can use async I/O to issue I/O requests to load the next pages (non sequentially) from the disk when you are performing certain operations. You can do the same with mmap as well, and RavenDB does just that. When you start a scan on a B+Tree, RavenDB will ask the OS to ensure that the memory we are interested in is in memory before we actually get to it. On Linux, this is done with madvise(WILL_NEED) call. That call may be blocking, so we actually have a dedicated thread that is meant to handle such a scenario.  In practice, this isn’t really that different from how you’ll handle it with async I/O.

Another consideration to deal with is the cost of mapping at the kernel level. I’m not talking about the I/O cost, but if you have many threads that are faulting pages, you’ll run into problems with the page table lock. We have run into that before, this is considered an OS level bug, but it obviously has an impact on the database. In practice, however, the overhead of memory management is the same in most cases. If you are reading via mmap or allocating directly, you’ll need to orchestrate things. Note that the same page table lock is also in effect if you are heavily allocating / freeing, since you’re also modifying the process page table.

Problem #3 – Error Handling

Error handling is a serious concern for a database. The paper points out that databases such as SQL Server may run a checksum when reading data from disk. When you use a buffer pool, the boundary of reading from the disk is obvious and you can easily validate the read from the disk. Voron is using mmap exclusively, and we do have checksums. We validate the page from the disk the first time that we access it (there is an internal bitmap that is used for that).  There isn’t a big difference between the two anyway. We only check a given page once per run, because to do otherwise is meaningless. When you use read() to get data from the disk, you have no guarantees that the data wasn’t fetched from a cache along the way. So you may validate the data you read is “correct”, while the on disk representation is broken. For that reason, we only do the check once, instead of each time.

A far greater issue to deal with is I/O errors. What do you do when a read or a write fails? If you are using system calls to manage that, you get a return code and can react accordingly. If you are using mmap, the system will generate a SIGBUS that you’ll have to (somehow) handle.

For a database, dealing with I/O errors has a single correct answer. Crash and then run recovery from scratch. If the I/O system has returned an error, there is no longer any way to know what the state of that is. See: fsync-gate. The only way to recover is to stop, reload everything (apply the WAL, run recovery, etc) and get back into a stable state. SIGBUS isn’t my cup of tea with regards to handling this, but error handling for I/O error isn’t actually something that you do, so just restarting the process ends up more acceptable than you might initially think.

Problem #4 – Performance issues

The paper points out three common reasons for performance issues with mmap usage:

  1. page table contention
  2. single threaded page eviction
  3. TLB shootdowns

The first issue is something that I have run into in the past. It was a bug in the operating system which was fixed. There is no longer a single page table in both Windows and Linux.

The single threaded eviction, on the other hand, is something that we never run into. When using Voron, we map the memory using MAP_SHARED, and most of the time, the memory isn’t dirty. If the system needs memory, it can do that when it assigns a page by just discarding the memory of an unmodified shared page. In this model, we typically see most of the memory as shared, clean. So there isn’t a lot of pressure to evict things, and it can be done on as needed basis.

The TLB shootdown issue is not something that we ever run into as a problem. We have run TB range databases on Raspberry PI with 4GB of RAM and hammered that in benchmarks (far exceeding the memory capacity). The interesting thing here is that the B+Tree nature means that the upper tiers of the tree were already in memory, so we mostly ended up with a single page fault per request. In order to actually observe the cost of TLS Shootdown in a significant manner, you need to have:

  • really fast I/O
  • working set that significantly exceeds memory
  • no other work that needs to be done for processing a request

In practice, if you have really fast I/O, you spent money on that, you’ll more likely get more RAM as well. And you typically need to do something with the data you read, which means that you won’t notice the TLB shootdown as much.

Finally, going back to how I started this post. This assumes that there are no other costs of not using mmap and using direct IO. The benchmark doesn’t account for those extra costs. For example, without mmap, who is doing evictions? In practice, that will lead to the same sort of considerations that you’ll have when dealing with mmap. This is especially the case with TLS shootdown when we start talking about high memory traffic (which likely modifies page allocations for the process, leading to the same scenario).

The paper has been quite interesting to read and it has presented a number of real problems that occur with mmap based systems, but I’m afraid that it doesn’t present the alternatives properly and vastly underestimates both costs and complexity of not using mmap and writing your own buffer pool.

time to read 2 min | 322 words

I ran into this post and I thought that I would share my thinking on the matter. I don’t actually know anything about IndexedDB, but I have been working with databases for a sufficiently long time to understand what the root problem here is.

The issue the post is talking about is that IndexedDB is slow, to the rate of inserting hundreds of documents takes multiple seconds. When you go a bit deeper into the post, you realize that the trouble is not with the number of documents that are being saved, but the number of transactions.

IndexedDB has the concept of transactions, and I’m going to err on the safe side and assume that it has durability. I dug a bit and found that IndexedDB is based on LevelDB (and I do know that one quite a bit). The underlying issue is that each transaction commit needs to issue an fsync(), and fsync is slow!

Pretty much all database engines have to deal with that, the usual answer is to allow transaction merging of some sort. Many relational databases will write to the log and commit multiple transactions in a single fsync(), RavenDB will do explicit transaction merging in order to batch writes to disk.

However, for pretty much every database out there with durability guarantees, the worst thing you can do is write a lot of data one transaction at a time. Server databases still benefit from being able to amortize multiple operations across different requests, but a client side database is effectively forced to do serial work with no chance for optimization.

From the client perspective, the only viable option here is to batch writes to get better performance. Then again, if you are working on the client side, you are either responding to a user action (in which case you have plenty of time to commit) or running in the background, which means that you have the ability to batch.

time to read 1 min | 88 words

I posted a few weeks ago about a performance regression in our metrics that we tracked down the to the disk being exhausted.

We replaced the hard disk to a new one, can you see what the results were?

image (1)

This is mostly because we were pretty sure that this is the problem, but couldn’t rule out that this was something else. Good to know that we were on track.

time to read 3 min | 533 words

At the beginning of the year, we run into a problematic query. The issue was the use of an in clause vs. a series of OR. You can see the previous investigation results here. We were able to pinpoint the issue pretty well, very deep in the guts of Lucene, our query engine.

Fast Query Slow Query
image image
Time: 1 – 2 ms Time: 60 – 90 ms
image image

The key issue for this query was simple. There are over 600,000 orders with the relevant statuses, but there are no orders for CustomerId “customers/100”. In the OR case, we would evaluate the query lazily. First checking the CustomerId, and given that there have been no results, short circuiting the process and doing no real work for the rest of the query. The IN query, on the other hand, would do things eagerly. That would mean that it would build a data structure that would hold all 600K+ documents that match the query, and then would throw that all away because no one actually needed that.

In order to resolve that, I have to explain a bit about the internals of Lucene. As its core, you can think of Lucene in terms of sorted lists inside dictionaries. I wrote a series of posts on the topic, but the gist of it is:

 

Note that the ids for documents containing a particular term are sorted. That is important for a lot of optimizations in Lucene, which is also a major problem for the in query. The problem is that each component in the query pipeline needs to maintain this invariant. But when we use an IN query, we need to go over potentially many terms. And then we need to get the results in the proper order to the calling code. I implemented a tiered approach. If we are using an IN clause with a small number of terms in it (under 128), we will use a heap to manage all the terms and effectively do a merge sort on the results.

When we have more than 128 terms, that stops being very useful, however. Instead, we’ll create a bitmap for the possible results and scan through all the terms, filling the bitmap. That can be expensive, of course, so I made sure that this is done lazily by RavenDB.

The results are in:

  OR Query IN Query
Invalid CustomerId 1.39 – 1.5 ms 1.33 – 1.44 ms
Valid CustomerId 17.5 ms 12.3 ms

For the first case, this is now pretty much a wash. The numbers are slightly in favor of the IN query, but it is within the measurement fluctuations.

For the second case, however, there is a huge performance improvement for the IN query. For that matter, the cost is going to be more noticeable the more terms you have in the IN query.

I’m really happy about this optimization, it ended up being quite elegant.

time to read 3 min | 415 words

imageI run into this article that talks about building a cache service in Go to handle millions of entries. Go ahead and read the article, there is also an associated project on GitHub.

I don’t get it. Rather, I don’t get the need here.

The authors seem to want to have a way to store a lot of data (for a given value of lots) that is accessible over REST.  The need to be able to run 5,000 – 10,000 requests per second over this. And also be able to expire things.

I decided to take a look into what it would take to run this in RavenDB. It is pretty late here, so I was lazy. I run the following command against our live-test instance:

image

This say to create 1,024 connections and get the same document. On the right you can see the live-test machine stats while this was running. It peaked at about 80% CPU. I should note that the live-test instance is pretty much the cheapest one that we could get away with, and it is far from me.

Ping time from my laptop to the live-test is around 230 – 250 ms. Right around the numbers that wrk is reporting. I’m using 1,024 connections here to compensate for the distance. What happens when I’m running this locally, without the huge distance?

image

So I can do more than 22,000 requests per second (on a 2016 era laptop, mind) with max latency of 5.5 ms (which the original article called for average time). Granted, I’m simplifying things here, because I’m checking a single document and not including writes. But 5,000 – 10,000 requests per second are small numbers for RavenDB. Very easily achievable.

RavenDB even has the @expires feature, which allows you to specify a time a document will automatically be removed.

The nice thing about using RavenDB for this sort of feature is that millions of objects and gigabytes of data are not something that are of particular concern for us. Raise that by an orders of magnitude, and that is our standard benchmark. You’ll need to raise it by a few more orders of magnitudes before we start taking things seriously.

time to read 5 min | 822 words

This post asked an interesting question, why are hash table so prevalent for in memory usage and (relatively) rare in the case of databases. There is some good points in the post, as well as in the Hacker News thread.

Given that I just did a spike of persistent hash table and have been working on database engines for the past decade, I thought that I might throw my own two cents into the ring.

B+Tree is a profoundly simple concept. You can explain it in 30 minutes, and it make sense. There are some tricky bits to a proper implementation, for sure, but they are more related to performance than correctness.

Hash tables sounds simple, but the moment you have to handle collisions gracefully, you are going to run into real challenges. It is easy to get into nasty bugs with hash tables, the kind that silently corrupt your state without you realizing it.

For example, consider the following code:

This is a hash table using linear addressing. Collisions are handled by adding them to the next available node. And in this case, we have a problem. We want to put “ghi” in position zero, but we can’t, because it is already full. We move it to the first available location. That is well understood and easy. But when we delete “def”, we remove the entry from the array, but we forgot to do fixups for the relocated “ghi”, that value is now gone from the table, effectively. This is the kind of bug you need the moon to be in a certain position while a cat sneeze to figure out.

A B+Tree also maps very nicely to persistent model, but it is entirely non obvious how you can go from the notion of a hash table in memory to one on disk. Extendible hashing exists, and has for a very long time. Literally for more time than I’m alive, but it is not very well known / generically used. It is a beautiful algorithm, mind you. But just mapping the concept to a persistence model isn’t enough, typically, you also had a bunch of additional requirements from disk data structure. In particular, concurrency in database systems is frequently tied closely to the structure of the tree (page level locks).

There is also the cost issue. When talking about disk based data access, we are rarely interested in the actual O(N) complexity, we are far more interested in the number of disk seeks that are involved. Using extendible hashing, you’ll typically get 1 – 2 disk seeks. If the directory is in memory, you have only one, which is great. But with a B+Tree, you can easily make sure that the top levels of the tree will also be memory resident (similar to the extendible hash directory), that leads to typical 1 disk access to read the data, so in many cases, they are roughly the same performance for either option.

Related to the cost issue, you have to also consider security risks. There have been a number of attacks against hash tables that relied on generating hash collisions. The typical in memory fix is to randomize the hash to avoid this, but if you are persistent, you have to use the same hash function forever. That means that an attacker can very easily kill your database server, by generating bad keys.

But these are all relatively minor concerns. The key issue is that B+Tree is just so much more useful. A B+Tree can allow me to:

  • Store / retrieve my data by key
  • Perform range queries
  • Index using a single column
  • Index using multiple columns (and then search based on full / partial key)
  • Iterate over the data in specified order

Hashes allow me to:

  • Store / retrieve my data by key

And that is pretty much it. So B+Tree can do everything that Hashes can, but also so much more. They are typically as fast where it matters (disk reads) and more than sufficiently fast regardless.

Hashes are only good for that one particular scenario of doing lookup by exact key. That is actually a lot more limited than what you’ll consider.

Finally, and quite important, you have to consider the fact that B+Tree has certain access patterns that they excel at. For example, inserting sorted data into a B+Tree is going to be a joy. Scanning the B+Tree in order is also trivial and highly performant.

With hashes? There isn’t an optimal access pattern for inserting data into a hash. And while you can scan a hash at roughly the same cost as you would a B+Tree, you are going to get the data out of order. That means that it is a lot less useful than it would appear to upfront.

All of that said, hashes are still widely used in databases. But they tend to be used as specialty tools. Deployed carefully and for very specific tasks. This isn’t the first thing that you’ll reach to, you need to justify its use.

time to read 2 min | 313 words

I run into this blog post talking about how to handle optimistic concurrency in MongoDB and it brought to mind a very fundamental difference in the design philosophy between RavenDB and MongoDB.

If you’ll read the associated blog post, you’ll see guidance on how to build a simple optimistic concurrency using the MongoDB API. It looks like a relatively straightforward thing, but there is a lot of complexity going on here.

With RavenDB, we have decided that the responsibility of such tasks is on us, and not our users. Here is how you’ll write the same thing in RavenDB:

session.Advanced.OptimisticConcurrency = true;

And you are done. There are also options to set it globally (for all actions), for a particular session, as shown above or for a particular document or documents in a bigger transaction. About the only thing that we don’t handle is retries if the update failed, to allow you to re-run your business logic.

The reason I’m writing this is actually at the very end of the post:

This works just fine if I "remember" to include that Where clause correctly, but there's a better way if we want a general solution. For that, I'd do pretty much what I would have in the Life Beyond Distributed Transactions series - introduce a Repository, Unit of Work, and Identity Map.

This is exactly right. It looks trivial to do something like that when you are looking into a trivial scenario, but put it in a real application and the complexity sprouts. For example, try doing the same thing with multiple documents that need to change together. You have to implement quite a lot of code to do so (identity map, unit of work, hopefully not a repository Smile).

With RavenDB, all of that is just there and available for you. No need to do anything, It Just Works.

time to read 11 min | 2033 words

I was pointed to this blog post which talks about the experience of using RavenDB in production. I want to start by saying that I love getting such feedback from our users, for a whole lot of reasons, not the least of which is that it is great to hear what people are doing with our database.

Alex has been using RavenDB for a while, so he had the chance to use RavenDB 3.5 and 4.2, that is a good from my perspective, because means that he had the chance to see what the changes were and see how they impacted his routine usage of RavenDB. I’m going to call out (and discuss) some of the points that Alex raise in the post.

Speaking about .NET integration:

Raven’s .NET client API trumps MongoDB .NET Driver, CosmosDB + Cosmonaut bundle and leaves smaller players like Cassandra (with DataStax C# Driver), CouchDB (with MyCouch) completely out of the competition.

When I wrote RavenDB 0.x, way before the 1.0 release, it took two months to build the core engine, and another three months to build the .NET client. Most of that time went on Linq integration, by the way. Yes, it literally took more time to build the client than the database core. We put a lot of effort into that. I was involved for years in the NHibernate project and I took a lot of lessons from there. I’m very happy that it shows.

Speaking about technical support:

RavenDB has a great technical support on Google Groups for no costs. All questions, regardless of the obtained license, get meaningful answers within 24 hours and quite often Oren Eini responds personally.

Contrary to the Google Groups, questions on StackOverflow are often neglected. It’s a mystery why Raven sticks to a such archaic style of tech support and hasn’t migrated to StackOverflow or GitHub.

I care quite deeply about the quality of our support, to the point where I’ll often field questions directly, as Alex notes. I have an article on Linked In that talks about my philosophy in that regard which may be of interest.

As to Alex’s point about Stack Overflow vs Google Groups, the key difference is the way we can discuss things. In Stack Overflow, the focus is on an answer, but that isn’t our usual workflow when providing support. Here is a question that would fit Stack Overflow very well, there is a well defined problem with all the details and we are able to provide an acceptable answer in a single exchange. That kind of interaction, on the other hand, is quite rare. It is a lot more common to have to have a lot more back and forth and we tend to try to give a complete solution, not just answer the initial question.

Another issue is that Stack Overflow isn’t moderated by us, which means that we would be subject to rules that we don’t necessarily want to adhere to. For example, we get asked similar questions all the time, which are marked as duplicated and closed on Stack Overflow, but we want to actually answer people. 

GitHub issues are a good option for this kind of discussion, but they tend to cause people to raise issues, and one of the reasons that we have the google group is to create discussion. I guess it comes down to the different community that spring up from the communication medium.

Speaking about documentation:

RavenDB does have the official docs, which are easily navigable and searchable. Works well for beginners and provides good samples to start with, but there are gaps here and there, and it has far less coverage of the functionality than many popular free and open source projects.

Ultimately, as a developer, I want to google my question and it’s acceptable to have the answer on a non-official website. But if you’re having a deep dive with RavenDB, it’s unlikely to find it in the official docs, nor StackOverflow, nor GitHub.

Documentation has always been a chore for me. The problem is that I know what the software does, so it can be hard to even figure out what we need to explain. This year we have hired a couple more technical writers specifically to address the missing pieces in our documentation. I think we are doing quite well at this point.

What wasn’t available at the time of this post and is available now is the book. All of the details about RavenDB that you could care too and more are detailed there are are available. It is also available to Google, so in many cases your Google search will point you to the right section in the book that may answer your question.

image

I hope that these actions covered the gaps that Alex noted in our documentation. And there is also this blog, of course Smile.

Speaking about issues that he had run into:

It’s reliable and does work well. Unless it doesn’t. And then a fix gets promptly released (a nightly build could be available within 24 hours after reporting the issue). And it works well again.

All these bugs (and others I found) have been promptly fixed.

… stability of the server and the database integrity are sacred. Even a slight risk of losing it can keep you awake at night. So no, it’s a biggy, unless the RavenDB team convinces me otherwise.

I didn’t include the list of issues that Alex pointed to on purpose. The actual issues don’t matter that much, because he is correct, from Alex’s perspective, RavenDB aught to Just Work, and anything else is our problem.

We spend a lot of time on ensuring a high quality for RavenDB. I had a two parts with Jeffery Palermo about just that, and you might be interested in this keynote that goes into some of the challenges that are involved in making RavenDB.

One of the issues that he raised was RavenDB crashing (or causing Windows to crash) because of a bug in the Windows Kernel that was deployed in a hotfix. The hotfix was quietly patched some time later by Microsoft, but in the meantime, RavenDB would crash.  And a user would blame us, because we crashed.

Another issue (RavenDB upgrade failing) was an intentional choice by us in the upgrade, however. We had a bug that can cause data corruption in some cases, we fixed it, but we had to deal with potentially problematic state of existing databases. We chose to be conservative and ask the user to take an explicit action in this case, to prevent data loss. It isn’t ideal, I’m afraid, but I believe that we have done the best that we could after fixing the underlying issue. In doubt, we pretty much always have to fall on the prevent data loss vs. availability side.

Speaking about Linq & JS support:

let me give you a sense of how often you’ll see LINQ queries throwing NotSupportedException in runtime.

But in front of us a special case — a database written in the .NET! There is no need in converting a query to SQL or JavaScript.

I believe that I mentioned already that the initial Linq support for RavenDB literally took more time than building RavenDB itself, right? Linq is an awesome feature, for the consumer. For the provider, it is a mess. I’m going to quote Frans Bouma on this:

Something every developer of an ORM with a LINQ provider has found out: with a LINQ provider you're never done. There are always issues popping up due to e.g. unexpected expressions in the tree.

Now, as Alex points out. RavenDB is written in .NET, so we could technically use something like Serialize.Linq and support any arbitrary expression easily, right?

Not really, I’m afraid, and for quite a few reasons:

  • Security – you are effectively allowing a user to send arbitrary code to be executed on the server. That is never going to end up well.
  • Compatibility – we want to make sure that we are able to change our internals freely. If we are forced to accept (and then execute) code from the client, that freedom is limited.
  • Performance – issuing a query in this manners means that we’ll have to evaluate the query on each document in the relevant collection. A full table scan. That is not a feature that RavenDB even has, and for a very good reason.
  • Limited to .NET only – we currently have client for .NET, JVM, Go, Python, C++ and Node.JS. Having features just for one client is not something that we want, it really complicates our lives.

We think about queries using RQL, which are abstract in nature and don’t tie us down with regards to how we implement them. That means that we can use features such as automatic indexes, build fast queries, etc.

Speaking about RQL:

Alex points out some issues with RQL as well. The first issue relates to the difference between a field existing and having a null value. RavenDB make a distinction between these state. A field can have a null value or it can have a missing value. In a similar way to the behavior of NULL in SQL, which can often create similar confusion. The problem with RavenDB is that the schema itself isn’t required, so different documents can have different fields, so in our case, there is an additional level. A field can have a value, be null or not exist. And we reflect that in our queries. Unfortunately, while the behavior is well defined and documented, just like NULL behavior in SQL, it can be surprising to users. 

Another issue that Alex brings up is that negation queries aren’t supported directly. This is because of the way we process queries and one of the ways we ensure that users are aware of the impact of the query. With negation query, we have to first match all documents the exclude all those that match the negation. For large number of documents, that can be expensive. Ideally, the user have a way to limit the scope of the matches that are being negated, which can really help performance.

Speaking about safe by default:

RavenDB is a lot less opinionated than it used to be. Alex rightfully points that out. As we got more users, we had to expand what you could do with RavenDB.  It still pains me to see people do things that are going to be problematic in the end (extremely large page sizes are one good example), but our users demanded that. To quote Alex:

I’d preferred a slowed performance in production and a warning in the server logs rather than a runtime exception.

Our issue with this approach is that no one looks at the logs and that this usually come to a head at 2 AM, resulting in a support call from the ops team about a piece of software that broke. Because of this, we have removed many such features, while turning them to alerts, and the very few that remained (mostly just the number of requests per session) can be controlled globally by the admin directly from the Studio. This ensures that the ops team can do something if you hit the wall, and of course, you can also configure this from the client side globally easily enough.

As a craftsman, it pains me to remove those limits, but I have to admit that it significantly reduced the number of support calls that we had to deal with.

Conclusion:

Overall, I can say RavenDB is a very good NoSQL database for .NET developers, but the ”good” is coming with a bunch of caveats. I’m confident in my ability to develop any enterprise application with RavenDB applying the Domain-driven design (DDD) philosophy and practices.

I think that this is really the best one could hope for. I think that Alex’s review that is honest and to the point. Moreover, it is focused and detailed. That make very valuable. Because he got to his conclusions not out of brief tour of RavenDB but actually holding it up in the trenches.

Thanks, Alex.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Challenge (72):
    19 Sep 2023 - Spot the bug
  2. Filtering negative numbers, fast (4):
    15 Sep 2023 - Beating memcpy()
  3. Recording (9):
    28 Aug 2023 - RavenDB and High Performance with Oren Eini
  4. Production postmortem (50):
    24 Jul 2023 - The dog ate my request
  5. Podcast (4):
    21 Jul 2023 - Hansleminutes - All the Performance with RavenDB's Oren Eini
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}