Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,169
Privacy Policy · Terms
filter by tags archive
time to read 7 min | 1214 words

After describing in detail the major refactoring we did for how RavenDB (via Voron, its storage engine) has gone through, there is one question remaining. What’s the point? The code is a lot simpler, of course, but the whole point of this much effort is to allow us to do interesting things.

There is performance, of course, but we haven’t gotten around to testing that yet because something that is a lot more interesting came up: Disk space management.

Voron allocates disk space from the operating system in batches of up to 1GB at a time. This is done to reduce file fragmentation and allow the file system to optimize the disk layout. It used to be something critical, but SSDs and NVMe made that a lot less important (but still a factor).

What happens if we have a very large database, but we delete a big collection of documents? This is a case where the user’s expectations and Voron’s behavior diverge. A user who just deleted a few million documents would expect to see a reduction in the size of the database. But Voron will mark the area on the disk as “to-be-used-later” and won’t free the disk space back to the operating system.

There were two reasons for this behavior:

  • We designed Voron in an era where it was far more common for systems to have hard disks, where fragmentation was a very serious problem.
  • It is really complicated to actually release disk space back to the operating system.

The first reason is no longer that relevant, since most database servers can reasonably expect to run on SSD or NVMe these days, significantly reducing the cost of fragmentation. The second reason deserves a more in-depth answer.

In order to release disk space back to the operating system, you have to do one of three things:

  • Store the data across multiple files and delete a file where it is no longer in use.
  • Run compaction, basically re-build the database from scratch in a compact form.
  • Use advanced features such as sparse files (hole punching) to return space to the file system without changing the file size.

The first option, using multiple files, is possible but pretty complex. Mostly because of the details of how you split to multiple files, whenever a single entry in an otherwise empty file will prevent its deletion, etc. There are also practical issues, such as the number of open file handles that are allowed, internal costs at the operating system level, etc.

Compaction, on the other hand, requires that you have enough space available during the compaction to run. In other words, if your disk is 85% full, and you delete 30% of the data, you don’t have free space to run a compaction.

Another consideration for compaction is that it can be really expensive. Running compaction on a 100GB database, for example, can easily take hours and in the cloud will very quickly exhaust your I/O credits.

RavenDB & Voron have supported compaction for over a decade, but it was always something that you did on very rare occasions. A user had to manually trigger it, and the downsides are pretty heavy, as you can see.

In most cases, I have to say, returning disk space back to the operating system is not something that is that interesting. That free space is handled by RavenDB and will be reused before we’ll allocate any additional new space from the OS. However, this is one of those features that keep coming up, because we go against users’ expectations.

The final option I discussed is using hole punching or sparse files (the two are pretty much equivalent - different terms between operating systems). The idea is that we can go to the operating system and tell it that a certain range in the file is not used, and that it can make use of that disk space again. Any future read from that range will return zeroes. If you write to this region, the file system will allocate additional space for those writes.

That behavior is problematic for RavenDB, because we used to use memory-mapped I/O to write to the disk. If there isn’t sufficient space to write, memory-mapped I/O is going to generate a segmentation fault / access violation. In general, getting an access violation because of a disk full is not okay by us, so we couldn’t use sparse files. The only option we were able to offer to reduce disk space was full compaction.

You might have noticed that I used past tense in the last paragraph. That is because I am now no longer limited to using just memory-mapped I/O. Using normal I/O for this purpose works even if we run out of disk space, we will get the usual disk full error (which we are already handling anyway).

Yes, that means that starting with RavenDB 7.1, we’ll automatically release free disk space directly back to the operating system, matching your likely expectations about the behavior. This is done in increments of 1MB, since we still want to reduce fragmentation and the number of file metadata that the file system needs to manage.

The one MB trigger

RavenDB will punch a hole in the file whenever there is a consecutive 1MB of free space. This is important to understand because of fragmentation. If you wrote 100 million documents, each 2 KB in size, and then deleted every second document, what do you think will happen? There won’t be any consecutive 1MB range for us to free.

Luckily, that sort of scenario tends to be pretty rare, and it is far more common to have clustering of writes and deletes, which allow us to take advantage of locality and free the disk space back to the OS automatically.

RavenDB will first use all the free space inside the file, reclaiming sparse regions as needed, before it will request additional disk space from the OS. When we do request additional space, we’ll still get it in large chunks (and without using sparse files). That is because it is far more likely to be immediately used, and we want to avoid giving the file system too much work.

Note that the overall file size is going to stay the same, but the actually used disk space is going to be reduced. We updated the RavenDB Studio to report both numbers, but when browsing the files manually, you need to keep that in mind.

I expect that this will be most noticeable for users who are running on cloud instances, where it is common to size the disks to be just sufficiently big enough for actual usage.

It Just Works

There is no action that you need to take to enable this behavior, and on first start of RavenDB 7.1, it will immediately release any free space already in the data files.

The work was actually completed and merged in August 2024, but it is going to be released sometime in Q2/Q3 of 2025. You might have noticed that there have been a lot of low-level changes targeted at RavenDB 7.1. We need to run them through the wringer to make sure that everything works as it should.

I’m looking forward to seeing this in action, there are some really nice indications about the sort of results we can expect. I’ll talk about that in more detail in another post, this one is getting long enough.

time to read 8 min | 1581 words

In the previous post, I talked about a massive amount of effort (2+ months of work) and about 25,000 lines of code changes. The only purpose of this task was to remove two locks from the system. During high load, we spent huge amounts of time contending for these locks, so removing them was well worth the effort.

During this work, I essentially found myself in the guts of Voron (RavenDB’s storage engine) and mostly dealing with old code. I’m talking about code that was written between 10 and 15 years ago. I wrote a blog post about it at the time. Working with old code is an interesting experience, especially since most of this code was written by me. I can remember some of my thoughts from the time I wrote it.

Old code is working code, and old code is also something that was built upon. Other parts of the codebase are making assumptions about the way the system behaves. And the more time a piece of code doesn't change, the more likely its behavior is going to ossify. Changing old code is hard because of the many ways that such dependencies can express themselves.

I dug through all of this decade-plus old code and I realized something pretty horrible.

It turns out that I made a mistake in understanding how Windows implements buffering for memory-mapped files. I realized my mistake around mid-2024, see the related post for the

actual details.

The TLDR summary, however, is that when using unbuffered file I/O with memory-mapped files on Windows, you cannot expect the mapped memory to reflect the data written using the file I/O API. Windows calls it coherence, and it was quite confusing when I first realized what the problem was. It turns out that this applies only to unbuffered I/O and there is no such problem with buffered I/O.

The scenario I needed to work with can use buffered I/O, however, which has been a profound shock to me. Large portions of the architecture of Voron are actually shaped by this limitation.

Because I thought that you couldn’t use both file I/O and memory-mapped files at the same time in Windows and get a consistent view of the data (the documentation literally says that, I have to add), RavenDB used memory-mapped I/O to write to the data file. That is a choice, certainly, but not one that I particularly liked. It was just that I was able to make things work and move on to other things.

This is another tale of accidental complexity, actually. I had a problem and found a solution to it, which at the time I considered quite clever. Because I had a solution, I never tried to dig any deeper into it and figure out whether this is the only solution.

This choice of using only memory-mapped I/O to write to the data file had consequences. In particular, it meant that:

  • We had to map the data using read-write mode.
  • There was no simple way to get an error if a write failed - since we just copied the data to memory, there was no actual write to fail. An error to write to disk would show up as a memory access violation (segmentation fault!) or just not show up at all.
  • Writing to a page that isn’t in memory may require us to read it first (even if we are overwriting all of it).

I accepted those limitations because I thought that this was the only way to do things. When I realized that I was wrong, that opened up so many possibilities. As far as the refactoring work, the way Voron did things changed significantly. We are now mapping the data file as read-only and writing to it using file I/O.

That means we have a known point of failure if we fail to write. That probably deserves some explanation. Failure to write to the disk can come in a bunch of ways. In particular, successfully writing to a file is not enough to safely store data, you also need to sync the file before you can be assured that the data is safe. The key here is that write + sync ensures that you’ll know that this either succeeded or failed.

Here is the old way we were writing to the data file. Conceptually, this looks like this:


auto mem = EnsureFileSize(pagesToWrite[pagesToWriteLength - 1].EndPosition);
for(auto i = 0; i < pagesToWriteLength; i++)
{
    auto path = pagesToWrite[i];
    memcpy(mem + page.Number * 8192, page.Buffer, page.Length);    
}


// some later time
if(FAILED(msync(mem))
   return SYNC_FAILURE;

And here is the first iteration of using the file I/O API for writes.


fallocate_if_needed(pagesToWrite[pagesToWriteLength - 1].EndPosition);
for(auto i = 0; i < pagesToWriteLength; i++)
{
    auto path = pagesToWrite[i];
    if(FAILED(pwrite(page.Number * 8192, page.Buffer, page.Length)))
         return WRITE_FAILURE;
}


// some time later
if (FAILED(fdatasync(file))
   return SYNC_FAILURE;

Conceptually, this is just the same, but notice that we respond immediately to write failures here.

When we started testing this feature, we realized something really interesting. The new version was much slower than the previous one, and it also generated a lot more disk writes.

I moved mountains for this?

Sometimes you get a deep sense of frustration when you look at benchmark results. The amount of work invested in this change is… pretty high. And from an architectural point of view, I’m loving it. The code is simpler, more robust, and allows us to cleanly do a lot more than we used to be able to.

The code also should be much faster, but it wasn’t. And given that performance is a critical aspect of RavenDB, that may cause us to scrap the whole thing.

Looking more deeply into the issue, it turned out that my statement about old code and the state of the system was spot on. Take a look at the two code snippets above and consider how they look from the point of view of the operating system. In the case of the memcpy() version, there is a strong likelihood that the kernel isn’t even involved (the pages are already paged in), and the only work done here is marking them as dirty (done by the CPU directly).

That means that the OS will figure out that it has stuff to write to the disk either when we call msync() or when its own timer is called. On the other hand, when we call pwrite(), we involve the OS at every stage of the process, making it far more likely that it will start the actual write to the disk earlier. That means that we are wasting batching opportunities.

In other words, because we used memory-mapped writes, we (accidentally, I might add) created a situation where we tried very hard to batch those writes in memory as much as possible. Another aspect here is that we are issuing a separate system call for each page. That means we are paying another high price.

The good thing about this is that we now have a good way to deal with those issues. The pwrite() code above was simply the first version used to test things out. Since we now have the freedom to run, we can use whatever file I/O we want.

In particular, RavenDB 7.1 now supports the notion of write modes, with the following possible options:

  • mmap - exactly like previous versions, uses a writable memory map and memcpy() to write the values to the data file.
  • file_io - uses pwrite() to write the data, onc page at a time, as shown above.
  • vectored_file_io - uses pwritev() to write the data, merging adjacent writes to reduce the number of system calls we use (Posix only, since Windows has strict limits on this capability).
  • io_ring - uses HIORING (Windows) / IO_Uring (Linux) to submit the whole set of work to the kernel as a single batch of operations.

RavenDB will select the appropriate mode for the system on its own, usually selecting io_ring for modern Linux and Windows machines, and vectored_file_io for Mac. You can control that using the RAVEN_VORON_WRITER_MODE environment variable, but that is provided only because we want to have an escape hatch, not something that you are meant to configure.

With those changes, we are on a much better footing for overall performance, but we aren’t done yet! I would love to give you the performance numbers, but we didn’t actually run the full test suite with just these changes. And that is because we aren’t done yet, I’ll cover that in the next post.

time to read 10 min | 1875 words

Even though RavenDB 7.0 isn’t quite out of the gate yet (expect the release very soon), I want to start talking about the RavenDB 7.1 release. This release is going to represent the biggest change in RavenDB in about 7 years, and at the same time, it is expected to be a complete non-event for most people.

We care enough about performance that we have both a dedicated team to verify it as well as really talented people whose sole job is to run benchmarks and optimize RavenDB. This year, the goal was to test RavenDB’s performance when running on large machines (many hundreds of GBs or RAM, dozens of cores, etc). The idea was to find bottlenecks and remove them, and we quickly found a few.

Performance optimization is often a very repetitive process. Find a problem, figure out the root cause, and move on. It is a process full of small changes and 0.1% improvements. The key is that if you practice this routinely, you end up with real measurable results over time, since your changes compound.

Most of the time, these are the changes you celebrate:

Here you can see a 6.6% increase in the number of operations per millisecond. That is a pretty big change, especially since it affects reading documents from storage.

We were roughly two months into this process when we ran into the end of the line. The scenario in question was 95% reads / 5% writes, with the idea of seeing how quickly the system responds under load, but without stressing it to the limits. Take a look at our findings:

What you can see here is that creating a transaction is costly, andover 50% of that cost is due to contention over a lock. We didn’t notice that until we had a lot of cores to go around, since there wasn’t a sufficient number of threads actively contending for this lock. Once we started looking for it, the problem was very apparent and visible on smaller machines as well.

RavenDB creates a single transaction per request, so that means we are spending a lot of time doing essentially nothing useful. What is worse, this is actively spinning. So the more cores we have, the more CPU we’ll use and the less useful work we’ll do. Yuck!

When we looked at the root cause, the problem became pretty apparent. Take a look at the following snippet:


_txCreation.EnterReadLock(); // creating read transaction
try
{
    _cancellationTokenSource.Token.ThrowIfCancellationRequested();


    tx = new LowLevelTransaction(previous.LowLevelTransaction, transactionPersistentContext, context);


    ActiveTransactions.Add(tx);
}
finally
{
    _txCreation.ExitReadLock();
}






using (PreventNewTransactions()) // during commit of write transaction
{
    if (tx.Committed && tx.FlushedToJournal)
        Interlocked.Exchange(ref _transactionsCounter, tx.Id);


    State = tx.State;


    Journal.Applicator.OnTransactionCommitted(tx);
    ScratchBufferPool.UpdateCacheForPagerStatesOfAllScratches();
    Journal.UpdateCacheForJournalSnapshots();


    tx.OnAfterCommitWhenNewTransactionsPrevented();
}

What happens is that during the commit of a write transaction we need to make updates to a bunch of places. We also need all new read transactions to see all our changes at once. So during commit, we take a short lock to update a couple of locations. And when we create a write transaction, we hold a read lock to be able to properly read all those values safely.

This particular code (and the approach in general) dates back to the very early days of Voron:

On one hand, I look at this code and cringe internally. On the other hand, it has served us really well for a long time. Of course, on the gripping hand, that code has been around for long enough to have grown some barnacles.

Notice the OnAfterCommitWhenNewTransactionsPrevented() call we have there? It is meant to allow other operations that need to run while no read transactions are allowed, in order to maintain in-memory consistency.

Fixing this issue was both straightforward and quite complex at the same time. Instead of storing all the data in various places, I introduced the following record:


record EnvironmentStateRecord(
    Pager.State DataPagerState, 
    long TransactionId,
    ImmutableDictionary<long, PageFromScratchBuffer> ScratchPagesTable,
    TreeRootHeader Root,
    long NextPageNumber,
    (long Number, long Last4KWritePosition) Journal,
    object ClientState);

The actual details of what it does aren’t really that interesting. What is important is that we have an immutable record that is being generated by the write transaction. That record contains all the database state that we carry over between transactions. When we want to publish this state, we can just make a single atomic write to a pointer, no need for a lock.

Of course, it can’t be that simple. We needed to essentially convert large portions of the code to take their state from the environment record and provide a way for higher-level code to add their own state to the record on write.

That took about two months of hard work, I think. Not complicated work, just grinding through all of those places, figuring out the best way to represent that, and making all the necessary changes.

Unfortunately, that wasn’t the only lock convoy we found. Files inside RavenDB are held by a class called Pager (since it serves pages of data). A file (Pager) may hold multiple States, which is a fancy way of referring to a memory-mapped range we use. The Pager may hold multiple such mapping concurrently. For example, if we increase the file size, we need to map the memory again - so for a certain period of time, we have two such mappings.

A transaction may create a reference to such a state (and thus may access the mapping) for its lifetime. At the end of the transaction, we can release the state, and if it is the last reference, we’ll close the mapping.

This is implemented using the following code:

This runs for multiple Pagers for each transaction, mind you. The actual AddRef() call is very cheap, but the lock…


public void AddRef() => Interlocked.Increment(ref _refs);

As you can see from the image, this is also pretty old code. The problem was how to deal with the issue. I have to be able to keep track of the mapped memory if a transaction is using it, but at the same time, I want to dispose of it quickly if it isn’t being used.

The answer to this question was to lean on the GC. Instead of implementing AddRef() / Release(), I hold the Pager.State object and handle disposal in the finalizer. If there is a reference to it, the finalizer isn’t called, so this gives me the exact behavior I want, for free.

However… I also need to be able to dispose of this explicitly when closing the pager. A common scenario is that I’m closing the Pager and deleting the directory. If I have an outstanding reference to the file, that will fail.

Here is how I’m dealing with the issue. The Pager holds a Weak-Reference to the State and may explicitly dispose of it:


public void Dispose()
{
    foreach (WeakReference<State> state in _states)
    {
        if (state.TryGetTarget(out var v))
        {
            v.Dispose();
        }
    }
}

That also allowed me to refactor the way we deal with files in general inside Voron. The first version of Voron did everything in managed code (including significant P/Invoke and cross-platform work). At some point, we introduced a native DLL (PAL - Platform Abstraction Layer - written in C) that allowed us to have a more consistent API.

The issue was that there was no clean division of labor between the native C code and the managed code, and some responsibilities were split between them in a way that wasn’t very clean.

RavenDB is using Zig

To be rather more exact, I’m using Zig as a build system to make cross-compilation a lot easier. I initially moved everything to use Zig, but it turns out that Zig isn’t able to produce DLLs that would run properly on older versions of Windows.

At least, nothing that I did convinced Zig to give me something that wouldn’t fail to load. So we are still using MSVC to compile for Windows and then Zig to both Linux & Mac (x64 & ARM). If you have good ideas on how to solve that, I would really appreciate it.

As a final thought, this work took several months, and when I got to the end, we got a pull request with some interesting numbers (+11.3K LOC, -11.2K LOC).

I have to admit that I feel a bit bad about that for the reviewers 🙂. I was really hoping that I would get a net negative result in the number of lines changed, but that couldn’t be helped. The end result is that all of our state is now far simpler, but we didn’t actually change anything in the way we behave. We removed two frequently hit lock convoys and allowed RavenDB to be far more efficient.

How efficient? Well… that is a bit complex to answer, because we didn’t stop here. There are more things that happened in RavenDB 7.1 that I need to talk to you about. We’ll cover them in my next post.

time to read 4 min | 749 words

I’m trying to reason about the behavior of this code, and I can’t decide if this is a stroke of genius or if I’m suffering from a stroke. Take a look at the code, and then I’ll discuss what I’m trying to do below:


HANDLE hFile = CreateFileA("R:/original_file.bin", 
GENERIC_READ | GENERIC_WRITE, 
FILE_SHARE_READ | FILE_SHARE_WRITE, 
NULL, 
OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, 
NULL);
if (hFile == INVALID_HANDLE_VALUE) {
    printf("Error creating file: %d\n", GetLastError());
    exit(__LINE__);
}




HANDLE hMapFile = CreateFileMapping(hFile, NULL, 
PAGE_READWRITE, 0, 0, NULL);
if (hMapFile == NULL) {
    fprintf(stderr, "Could not create file mapping object: %x\n", GetLastError());
    exit(__LINE__);
}


char* lpMapAddress = MapViewOfFile(hMapFile, FILE_MAP_WRITE, 0, 0, 0);
if (lpMapAddress == NULL) {
    fprintf(stderr, "Could not map view of file: %x\n", GetLastError());
    exit(__LINE__);
}


for (size_t i = 2 * MB; i < 4 * MB; i++)
{
        lpMapAddress[i]++;
}


HANDLE hDirect = CreateFileA("R:/original_file.bin", 
GENERIC_READ | GENERIC_WRITE, 
FILE_SHARE_READ | FILE_SHARE_WRITE, 
NULL, 
OPEN_ALWAYS, 
FILE_ATTRIBUTE_NORMAL, 
NULL);


SetFilePointerEx(hDirect, (LARGE_INTEGER) { 6 * MB }, & fileSize, FILE_BEGIN);
for (i = 6 ; i < 10 ; i++) {
    if (!WriteFile(hDirect, lpMapAddress + i * MB, MB, &bytesWritten, NULL)) {
        fprintf(stderr, "WriteFile direct failed on iteration %d: %x\n", i, GetLastError());
        exit(__LINE__);
    }
}

The idea is pretty simple, I’m opening the same file twice. Once in buffered mode and mapping that memory for both reads & writes. The problem is that to flush the data to disk, I have to either wait for the OS, or call FlushViewOfFile() and FlushFileBuffers() to actually flush it to disk explicitly.

The problem with this approach is that FlushFileBuffers() has undesirable side effects. So I’m opening the file again, this time for unbuffered I/O. I’m writing to the memory map and then using the same mapping to write to the file itself. On Windows, that goes through a separate path (and may lose coherence with the memory map).

The idea here is that since I’m writing from the same location, I can’t lose coherence. I either get the value from the file or from the memory map, and they are both the same. At least, that is what I hope will happen.

For the purpose of discussion, I can ensure that there is no one else writing to this file while I’m abusing the system in this manner. What do you think Windows will do in this case?

I believe that when I’m writing using unbuffered I/O in this manner, I’m forcing the OS to drop the mapping and refresh from the disk. That is likely the reason why it may lose coherence, because there may be already reads that aren’t served from main memory, or something like that.

This isn’t an approach that I would actually take for production usage, but it is a damn interesting thing to speculate on. If you have any idea what will actually happen, I would love to have your input.

time to read 5 min | 915 words

I would really love to have a better understanding of what is going on here!

If you format a 32 MB disk using NTFS, you’ll get the following result:

So about 10 MB are taken for NTFS metadata. I guess that makes sense, and giving up 10 MB isn’t generally a big deal these days, so I wouldn’t worry about it.

I write a 20 MB file and punch a hole in it between 6 MB and 18 MB (12 MB in total), so we have:

And in terms of disk space, we have:

The numbers match, awesome! Let’s create a new 12 MB file, like so:

And the disk is:

And now I’m running the following code, which maps the first file (with the hole punched in it) and writes 4 MB to it using memory-mapped I/O:


HANDLE hMapFile = CreateFileMapping(hFile, NULL, PAGE_READWRITE, 0, 0, NULL);
if (hMapFile == NULL) {
    fprintf(stderr, "Could not create file mapping object: %x\n", GetLastError());
    exit(__LINE__);


}


char* lpMapAddress = MapViewOfFile(hMapFile, FILE_MAP_WRITE, 0, 0, 0);
if (lpMapAddress == NULL) {
    fprintf(stderr, "Could not map view of file: %x\n", GetLastError());
    exit(__LINE__);
}


for (i = 6 * MB; i < 10 * MB; i++) {
    ((char*)lpMapAddress)[i]++;
}


if (!FlushViewOfFile(lpMapAddress, 0)) {
    fprintf(stderr, "Could not flush view of file: %x\n", GetLastError());
    exit(__LINE__);
}


if (!FlushFileBuffers(hFile)) {
        fprintf(stderr, "Could not flush file buffers: %x\n", GetLastError());
    exit(__LINE__);
}

The end for this file is:

So with the other file, we have a total of 24 MB in use on a 32 MB disk. And here is the state of the disk itself:

The problem is that there used to be 9.78 MB that were busy when we had a newly formatted disk. And now we are using at least some of that disk space for storing file data somehow.

I’m getting the same behavior when I use normal file I/O:


moveAmount.QuadPart = 6 * MB;
SetFilePointerEx(hFile, moveAmount, NULL, FILE_BEGIN);


for (i = 6 ; i < 10 ; i++) {
    if (!WriteFile(hFile, buffer, MB, &bytesWritten, NULL)) {
        fprintf(stderr, "WriteFile failed on iteration %d: %x\n", i, GetLastError());
        exit(__LINE__);
    }
}

So somehow in this sequence of operations, we get more disk space. On the other hand, if I try to write just 22 MB into a single file, it fails. See:


hFile = CreateFileA("R:/original_file.bin", GENERIC_READ | GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
if (hFile == INVALID_HANDLE_VALUE) {
    printf("Error creating file: %d\n", GetLastError());
    exit(__LINE__);
}


for (int i = 0; i < 22; i++) {
    if (!WriteFile(hFile, buffer, MB, &bytesWritten, NULL)) {
        fprintf(stderr, "WriteFile failed on iteration %d: %x\n", i, GetLastError());
        exit(__LINE__);
    }
}

You can find the full source here. I would love to understand what exactly is happening and how we suddenly get more disk space usage in this scenario.

time to read 18 min | 3531 words

Today I set out to figure out an answer to a very specific question. What happens at the OS level when you try to allocate disk space for a sparse file and there is no additional disk space?

Sparse files are a fairly advanced feature of file systems. They allow you to define a file whose size is 10GB, but that only takes 2GB of actual disk space. The rest is sparse (takes no disk space and on read will return just zeroes). The OS will automatically allocate additional disk space for you if you write to the sparse ranges.

This leads to an interesting question, what happens when you write to a sparse file if there is no additional disk space?

Let’s look at the problem on Linux first. We define a RAM disk with 32MB, like so:


sudo mkdir -p /mnt/ramdisk
sudo mount -t tmpfs -o size=32M tmpfs /mnt/ramdisk

And then we write the following code, which does the following (on a disk with just 32MB):

  • Create a file - write 32 MB to it
  • Punch a hole of 8 MB in the file (range is 12MB - 20MB)
  • Create another file - write 4 MB to it (there is now only 4MB available)
  • Open the original file and try to write to the range with the hole in it (requiring additional disk space allocation)


#define _GNU_SOURCE


#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <linux/falloc.h>
#include <errno.h>
#include <string.h>
#include <sys/random.h>


#define MB (1024 * 1024)


void write_all(int fd, const void *buf, size_t count)
{
    size_t bytes_written = 0;
    const char *ptr = (const char *)buf;


    while (bytes_written < count)
    {
        ssize_t result = write(fd, ptr + bytes_written, count - bytes_written);


        if (result < 0)
        {
            if (errno == EINTR)
                continue;


            fprintf(stderr, "Write error: errno = %d (%s)\n", errno, strerror(errno));
            exit(EXIT_FAILURE);
        }


        if (result == 0)
        {


            fprintf(stderr, "Zero len write is bad: errno = %d (%s)\n", errno, strerror(errno));
            exit(EXIT_FAILURE);
        }


        bytes_written += result;
    }
}


int main()
{
    int fd;
    char buffer[MB];


    unlink("/mnt/ramdisk/fullfile");
    unlink("/mnt/ramdisk/anotherfile");


    getrandom(buffer, MB, 0);


    ssize_t bytes_written;


    fd = open("/mnt/ramdisk/fullfile", O_RDWR | O_CREAT | O_TRUNC, 0644);
    if (fd == -1)
    {
        fprintf(stderr, "open full file: errno = %d (%s)\n", errno, strerror(errno));
        exit(EXIT_FAILURE);
    }
    for (int i = 0; i < 32; i++)
    {
        write_all(fd, buffer, MB);
    }
    close(fd);


    fd = open("/mnt/ramdisk/fullfile", O_RDWR);
    if (fd == -1)
    {
        fprintf(stderr, "reopen full file: errno = %d (%s)\n", errno, strerror(errno));
        exit(EXIT_FAILURE);
    }


    if (fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 12 * MB, 8 * MB) == -1)
    {
        fprintf(stderr, "fallocate failure: errno = %d (%s)\n", errno, strerror(errno));
        exit(EXIT_FAILURE);
    }
    close(fd);


    fd = open("/mnt/ramdisk/anotherfile", O_RDWR | O_CREAT | O_TRUNC, 0644);
    if (fd == -1)
    {
        fprintf(stderr, "open another file: errno = %d (%s)\n", errno, strerror(errno));
        exit(EXIT_FAILURE);
    }


    for (int i = 0; i < 4; i++)
    {
        write_all(fd, buffer, MB);
    }
    close(fd);


    // Write 8 MB to the hole in the first file
    fd = open("/mnt/ramdisk/fullfile", O_RDWR);
    if (fd == -1)
    {
        fprintf(stderr, "reopen full file 2: errno = %d (%s)\n", errno, strerror(errno));
        exit(EXIT_FAILURE);
    }


    // Seek to the start of the hole
    if (lseek(fd, 12 * MB, SEEK_SET) == -1)
    {
        fprintf(stderr, "seek full file: errno = %d (%s)\n", errno, strerror(errno));
        exit(EXIT_FAILURE);
    }
    for (int i = 0; i < 8; i++)
    {
        write_all(fd, buffer, MB);
    }
    close(fd);


    printf("Operations completed successfully.\n");
    return 0;
}

As expected, this code will fail on the 5th write (since there is no disk space to allocate in the disk). The error would be:


Write error: errno = 28 (No space left on device)

Here is what the file system reports:


$ du -h /mnt/ramdisk/*
4.0M    /mnt/ramdisk/anotherfile
28M     /mnt/ramdisk/fullfile


$ ll -h /mnt/ramdisk/
total 33M
drwxrwxrwt 2 root   root     80 Jan  9 10:43 ./
drwxr-xr-x 6 root   root   4.0K Jan  9 10:30 ../
-rw-r--r-- 1 ayende ayende 4.0M Jan  9 10:43 anotherfile
-rw-r--r-- 1 ayende ayende  32M Jan  9 10:43 fullfile

As you can see, we have a total of 32 MB of actual size reported, but ll is reporting that we actually have files bigger than that (because we have hole punching).

What would happen if we were to run this using memory-mapped I/O? Here is the code:


fd = open("/mnt/ramdisk/fullfile", O_RDWR);


char *mapped_memory = mmap(NULL, 32 * MB, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (mapped_memory == MAP_FAILED)
{
    fprintf(stderr, "fail mmap: errno = %d (%s)\n", errno, strerror(errno));
    exit(EXIT_FAILURE);
}


for (size_t i = (12 * MB); i < (20 * MB); i++)
{
    mapped_memory[i] = 1;
}
munmap(mapped_memory, 32 * MB);
close(fd);

This will lead to an interesting scenario. We need to allocate disk space for the memory, and we’ll do so (note that we are writing into the hole), and this code will fail with a segmentation fault.

It will fail in the loop, by the way, as part of the page fault to bring the memory in, the file system needs to allocate the disk space. If there is no such disk space, it will fail. The only way for the OS to behave in this case is to fail the write, which leads to a segmentation fault.

I also tried that on Windows. I defined a virtual disk like so:


$ diskpart
create vdisk file="D:\ramdisk.vhd" maximum=32
select vdisk file=D:\ramdisk.vhd"
attach vdisk
create partition primary
format fs=NTFS quick label=RAMDISK
assign letter=R
exit

This creates a 32MB disk and assigns it the letter R. Note that we are using NTFS, which has its own metadata, we have roughly 21MB or so of usable disk space to play with here.

Here is the Windows code that simulates the same behavior as the Linux code above:


#include <stdio.h>
#include <windows.h>


#define MB (1024 * 1024)


int main() {
    HANDLE hFile, hFile2;
    DWORD bytesWritten;
    LARGE_INTEGER fileSize, moveAmount;
    char* buffer = malloc(MB);
    
    int i;


        DeleteFileA("R:\\original_file.bin");
        DeleteFileA("R:\\another_file.bin");


    hFile = CreateFileA("R:/original_file.bin", GENERIC_READ | GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
    if (hFile == INVALID_HANDLE_VALUE) {
        printf("Error creating file: %d\n", GetLastError());
        exit(__LINE__);
    }


    for (int i = 0; i < 20; i++) {
        if (!WriteFile(hFile, buffer, MB, &bytesWritten, NULL)) {
            fprintf(stderr, "WriteFile failed on iteration %d: %x\n", i, GetLastError());
            exit(__LINE__);
        }
        if (bytesWritten != MB) {
            fprintf(stderr, "Failed to write full buffer on iteration %d\n", i);
            exit(__LINE__);
        }
    }


    FILE_ZERO_DATA_INFORMATION zeroDataInfo;
    zeroDataInfo.FileOffset.QuadPart = 6 * MB; 
    zeroDataInfo.BeyondFinalZero.QuadPart = 18 * MB; 


    if (!DeviceIoControl(hFile, FSCTL_SET_SPARSE, NULL, 0, NULL, 0, NULL, NULL) || 
        !DeviceIoControl(hFile, FSCTL_SET_ZERO_DATA, &zeroDataInfo, sizeof(zeroDataInfo), NULL, 0, NULL, NULL)) {
                printf("Error setting zero data: %d\n", GetLastError());
        exit(__LINE__);
        }




    // Create another file of size 4 MB
    hFile2 = CreateFileA("R:/another_file.bin", GENERIC_READ | GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
    if (hFile2 == INVALID_HANDLE_VALUE) {
        printf("Error creating second file: %d\n", GetLastError());
        exit(__LINE__);
    }




    for (int i = 0; i < 4; i++) {
        if (!WriteFile(hFile2, buffer, MB, &bytesWritten, NULL)) {
            fprintf(stderr, "WriteFile 2 failed on iteration %d: %x\n", i, GetLastError());
            exit(__LINE__);
        }
        if (bytesWritten != MB) {
            fprintf(stderr, "Failed to write full buffer 2 on iteration %d\n", i);
            exit(__LINE__);
        }
    }


        moveAmount.QuadPart = 12 * MB;
    SetFilePointerEx(hFile, moveAmount, NULL, FILE_BEGIN);
    for (i = 0; i < 8; i++) {
        if (!WriteFile(hFile, buffer, MB, &bytesWritten, NULL)) {
            printf("Error writing to file: %d\n", GetLastError());
            exit(__LINE__);
        }
    }


    return 0;
}

And that gives us the exact same behavior as in Linux. One of these writes will fail because there is no more disk space for it. What about when we use memory-mapped I/O?


HANDLE hMapFile = CreateFileMapping(hFile, NULL, PAGE_READWRITE, 0, 0, NULL);
if (hMapFile == NULL) {
    fprintf(stderr, "Could not create file mapping object: %x\n", GetLastError());
    exit(__LINE__);


}


char* lpMapAddress = MapViewOfFile(hMapFile, FILE_MAP_WRITE, 0, 0, 0);
if (lpMapAddress == NULL) {
    fprintf(stderr, "Could not map view of file: %x\n", GetLastError());
    exit(__LINE__);
}


for (i = 0; i < 20 * MB; i++) {
    ((char*)lpMapAddress)[i]++;
}

That results in the expected access violation:

I didn’t bother checking Mac or BSD, but I’m assuming that they behave in the same manner. I can’t conceive of anything else that they could reasonably do.

You can find my full source here.

time to read 2 min | 394 words

RavenDB is meant to be a self-managing database, one that is able to take care of itself without constant hand-holding from the database administrator. That has been one of our core tenets from the get-go. Today I checked the current state of the codebase and we have roughly 500 configuration options that are available to control various aspects of RavenDB’s behavior.

These two statements are seemingly contradictory, because if we have so many configuration options, how can we even try to be self-managing? And how can a database administrator expect to juggle all of those options?

Database configuration is a really finicky topic. For example, RocksDB’s authors flat-out admit that out loud:

Even we as RocksDB developers don't fully understand the effect of each configuration change. If you want to fully optimize RocksDB for your workload, we recommend experiments and benchmarking.

And indeed, efforts were made to tune RocksDB using deep-learning models because it is that complex.

RavenDB doesn’t take that approach, tuning is something that should work out of the box, managed directly by RavenDB itself. Much of that is achieved by not doing things and carefully arranging that the environment will balance itself out in an optimal fashion. But I’ll talk about the Zen of RavenDB another time.

Today, I want to talk about why we have so many configuration options, the vast majority of which you, as a user, should neither use, care about, nor even know of.

The idea is very simple, deploying a database engine is a Big Deal, and as such, something that users are quite reluctant to do. When we hit a problem and a support call is raised, we need to provide some mechanism for the user to fix things until we can ensure that this behavior is accounted for in the default manner of RavenDB.

I treat the configuration options more as escape hatches that allow me to muddle through stuff than explicit options that an administrator is expected to monitor and manage. Some of those configuration options control whether RavenDB will utilize vectored instructions or the compression algorithm to use over the wire. If you need to touch them, it is amazing that they exist. If you have to deal with them on a regular basis, we need to go back to the drawing board.

time to read 14 min | 2742 words

In my previous post, I discussed how Linux will silently truncate a big write (> 2 GB) for you. That is expected by the interface of write(). The problem is that this behavior also applies when you use IO_Uring.

Take a look at the following code:


struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
if (!sqe) {
    return 1;
}
io_uring_prep_write(sqe, fd, buffer, BUFFER_SIZE, 0);
io_uring_submit(&ring);


struct io_uring_cqe *cqe;
ret = io_uring_wait_cqe(&ring, &cqe);
if (ret < 0) {
    return 2;
}

If BUFFER_SIZE is 3 GB, then this will write about 2 GB to the file. The number of bytes written is correctly reported, but the complexity this generates is huge. Consider the following function:


int32_t rvn_write_io_ring(
    void *handle,
    int32_t count,
    struct page_to_write *buffers,
    int32_t *detailed_error_code);

There is a set of buffers that I want to write, and the natural way to do that is:


int32_t rvn_write_io_ring(
    void *handle,
    int32_t count,
    struct page_to_write *buffers,
    int32_t *detailed_error_code)
{
    struct handle *handle_ptr = handle;
    for (size_t i = 0; i < count; i++)
    {
        struct io_uring_sqe *sqe = io_uring_get_sqe(
            &handle_ptr->global_state->ring);
        io_uring_prep_write(sqe,
            handle_ptr->file_fd, 
            buffers[i].ptr, 
            buffers[i].count_of_pages * VORON_PAGE_SIZE, 
            buffers[i].page_num * VORON_PAGE_SIZE
        );
    }
    return _submit_and_wait(&handle_ptr->global_state->ring,
                            count, detailed_error_code);
}


int32_t _submit_and_wait(
    struct io_uring* ring,
    int32_t count,
    int32_t* detailed_error_code)
{
    int32_t rc = io_uring_submit_and_wait(ring, count);
    if(rc < 0)
    {
        *detailed_error_code = -rc;
        return FAIL_IO_RING_SUBMIT;
    }


    struct io_uring_cqe* cqe;
    for(int i = 0; i < count; i++)
    {
        rc = io_uring_wait_cqe(ring, &cqe);
        if (rc < 0)
        {
            *detailed_error_code = -rc;
            return FAIL_IO_RING_NO_RESULT;
        }
        if(cqe->res < 0)
        {
            *detailed_error_code = -cqe->res;
            return FAIL_IO_RING_WRITE_RESULT;
        }
        io_uring_cqe_seen(ring, cqe);
    }
    return SUCCESS;
}

In other words, send all the data to the IO Ring, then wait for all those operations to complete. We verify complete success and can then move on. However, because we may have a write that is greater than 2 GB, and because the interface allows the IO Uring to write less than we thought it would, we need to handle that with retries.

After thinking about this for a while, I came up with the following implementation:


int32_t _submit_writes_to_ring(
    struct handle *handle,
    int32_t count,
    struct page_to_write *buffers,
    int32_t* detailed_error_code)
{
    struct io_uring *ring = &handle->global_state->ring;
    off_t *offsets = handle->global_state->offsets;
    memset(offsets, 0, count * sizeof(off_t));   


    while(true)
    {
        int32_t submitted = 0;


        for (size_t i = 0; i < count; i++)
        {
            off_t offset =  offsets[i];
            if(offset == buffers[i].count_of_pages * VORON_PAGE_SIZE)
                continue;


            struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
            if (sqe == NULL) // the ring is full, flush it...
                break;


            io_uring_sqe_set_data(sqe, i);
            io_uring_prep_write(sqe, handle->file_fd, 
                buffers[i].ptr + offset,
                buffers[i].count_of_pages * VORON_PAGE_SIZE - offset,
                buffers[i].page_num * VORON_PAGE_SIZE + offset);


            submitted++;
        }


        if(submitted == 0)
            return SUCCESS;
        
        int32_t rc = io_uring_submit_and_wait(ring, submitted);
        if(rc < 0)
        {
            *detailed_error_code = -rc;
            return FAIL_IO_RING_SUBMIT;
        }


        struct io_uring_cqe *cqe;
        uint32_t head = 0;
        uint32_t i = 0;
        bool has_errors = false;


        io_uring_for_each_cqe(ring, head, cqe) {
            i++;
            uint64_t index = io_uring_cqe_get_data64(cqe);
            int result = cqe->res;
            if(result < 0)
            {
                has_errors = true;
                *detailed_error_code = -result;
            }
            else
            {
                offsets[index] += result;
                if(result == 0)
                {
                    // there shouldn't be a scenario where we return 0 
                    // for a write operation, we may want to retry here
                    // but figuring out if this is a single happening, of if 
                    // we need to retry this operation (_have_ retried it?) is 
                    // complex enough to treat this as an error for now.
                    has_errors = true;
                    *detailed_error_code = EIO;   
                }
            }
        }
        io_uring_cq_advance(ring, i);


        if(has_errors)
            return FAIL_IO_RING_WRITE_RESULT;
    }
}

That is a lot of code, but it is mostly because of how C works. What we do here is scan through the buffers we need to write, as well as scan through an array of offsets that store additional information for the operation.

If the offset to write doesn’t indicate that we’ve written the whole thing, we’ll submit it to the ring and keep going until we either fill the entire ring or run out of buffers to work with. The next step is to submit the work and wait for it to complete, then run through the results, check for errors, and update the offset that we wrote for the relevant buffer.

Then, we scan the buffers array again to find either partial writes that we have to complete (we didn’t write the whole buffer) or buffers that we didn’t write at all because we filled the ring. In either case, we submit the new batch of work to the ring and repeat until we run out of work.

This code assumes that we cannot have a non-error state where we write 0 bytes to the file and treats that as an error. We also assume that an error in writing to the disk is fatal, and the higher-level code will discard the entire IO_Uring if that happens.

The Windows version, by the way, is somewhat simpler. Windows explicitly limits the size of the buffer you can pass to the write() call (and its IO Ring equivalent). It also ensures that it will write the whole thing, so partial writes are not an issue there.

It is interesting to note that the code above will effectively stripe writes if you send very large buffers. Let’s assume that we send it two 4 GB buffers, like so:

OffsetSize
Buffer 11 GB 4 GB
Buffer 210 GB6 GB

The patterns of writes that will actually be executed are:

  1. 1GB .. 3 GB, 10 GB .. 12 GB
  2. 3 GB .. 5 GB, 12 GB .. 14 GB
  3. 14 GB .. 16 GB

I can “fix” that by never issuing writes that are larger than 2 GB and issuing separate writes for each 2 GB range, but that leads to other complexities (e.g., tracking state if I split a write and hit the full ring status, etc.). At those sizes, it doesn’t actually matter in terms of efficiency or performance.

Partial writes are almost always a sign of either very large writes that were broken up or some underlying issue that is already problematic, so I don’t care that much about that scenario in general. For the vast majority of cases, this will always issue exactly one write for each buffer.

What is really interesting from my point of view, however, is how even a pretty self-contained feature can get pretty complex internally. On the other hand, this behavior allows me to push a whole bunch of work directly to the OS and have it send all of that to the disk as fast as possible.

In our scenarios, under load, we may call that with thousands to tens of thousands of pages (each 8 KB in size) spread all over the file. The buffers are actually sorted, so ideally, the kernel will be able to take advantage of that, but even if not, just reducing the number of syscalls will result in performance savings.

time to read 4 min | 764 words

I previously asked what the code below does, and mentioned that it should give interesting insight into the kind of mindset and knowledge a candidate has. Take a look at the code again:


#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/stat.h>


#define BUFFER_SIZE (3ULL * 1024 * 1024 * 1024) // 3GB in bytes


int main() {
    int fd;
    char *buffer;
    struct stat st;


    buffer = (char *)malloc(BUFFER_SIZE);
    if (buffer == NULL) {
        return 1;
    }


    fd = open("large_file.bin", O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
    if (fd == -1) {
        return 2;
    }


    if (write(fd, buffer, BUFFER_SIZE) == -1) {
        return 3;
    }


    if (fsync(fd) == -1) {
        return 4;
    }


    if (close(fd) == -1) {
        return 5;
    }


    if (stat("large_file.bin", &st) == -1) {
        return 6;
    }


    printf("File size: %.2f GB\n", (double)st.st_size / (1024 * 1024 * 1024));


    free(buffer);
    return 0;
}

This program will output: File size: 2.00 GB

And it will write 2 GB of zeros to the file:


~$ head  large_file.bin  | hexdump -C
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
7ffff000

The question is why? And the answer is quite simple. Linux has a limitation of about 2 GB for writes to the disk. Any write call that attempts to write more than that will only write that much, and you’ll have to call the system again. This is not an error, mind. The write call is free to write less than the size of the buffer you passed to it.

Windows has the same limit, but it is honest about it

In Windows, all write calls accept a 32-bit int as the size of the buffer, so this limitation is clearly communicated in the API. Windows will also ensure that for files, a WriteFile call that completes successfully writes the entire buffer to the disk.

And why am I writing 2 GB of zeros? In the code above, I’m using malloc(), not calloc(), so I wouldn’t expect the values to be zero. Because this is a large allocation, malloc() calls the OS to provide us with the buffer directly, and the OS is contractually obligated to provide us with zeroed pages.

time to read 3 min | 536 words

Here is a pretty simple C program, running on Linux. Can you tell me what you expect its output to be?


#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/stat.h>


#define BUFFER_SIZE (3ULL * 1024 * 1024 * 1024) // 3GB in bytes


int main() {
    int fd;
    char *buffer;
    struct stat st;


    buffer = (char *)malloc(BUFFER_SIZE);
    if (buffer == NULL) {
        return 1;
    }


    fd = open("large_file.bin", O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
    if (fd == -1) {
        return 2;
    }


    if (write(fd, buffer, BUFFER_SIZE) == -1) {
        return 3;
    }


    if (fsync(fd) == -1) {
        return 4;
    }


    if (close(fd) == -1) {
        return 5;
    }


    if (stat("large_file.bin", &st) == -1) {
        return 6;
    }


    printf("File size: %.2f GB\n", (double)st.st_size / (1024 * 1024 * 1024));


    free(buffer);
    return 0;
}

And what happens when I run:


head  large_file.bin  | hexdump -C

This shows both surprising behavior and serves as a good opening for discussion on a whole bunch of issues. In an interview setting, that can give us a lot of insight into the sort of knowledge a candidate has.

FUTURE POSTS

  1. RavenDB 7.1: Shared Journals - 3 days from now

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. RavenDB 7.1 (4):
    14 Feb 2025 - Reclaiming disk space
  2. Challenge (77):
    03 Feb 2025 - Giving file system developer ulcer
  3. Answer (13):
    22 Jan 2025 - What does this code do?
  4. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  5. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}