Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,598
|
Comments: 51,229
Privacy Policy · Terms
filter by tags archive
time to read 3 min | 462 words

This blog post about  the color of a function is a really good explanation of the major issues with sync and async code in modern programming.

In C#, we have this lovely async/await model, which allows us to have the compiler handle all the hard work of yielding a thread while there is some sort of an expensive I/O bound operation going on. Having worked with that for quite a while, I can tell you that I really agree with the Bob’s frustrations on the whole concept.

But from my perspective, this come at several significant costs. The async machinery isn’t free, and in some cases (discussed below), the performance overhead of using async is actually significantly higher cost than using the standard blocking model. There is also the issue of the debugability of the solution, if you have a lot of async work going on, it is very hard to see what the state of the overall system is.

In practice, I think that we’ll fall down into the following rules:

For requests that are common, short and most of the work is either getting the data from the client / sending the data to the client, with a short (mostly CPU bound) work, we can use async operations, because they free a thread to do useful work (processing the next request) while we are spending most of our time in doing I/O with the remote machine.

For high performance stuff, where we have a single request doing quite a lot of stuff, or long living, we typically want to go the other way. We want to have a dedicated thread for this operation, and we want to do blocking I/O. The logic is that this operation isn’t going to be doing much while we are waiting for the I/O, so we might as well block the thread and just wait for it in place. We can rely on buffering to speed things up, but there is no point in giving up this thread for other work, because this is rare operation that is we want to be able to explicitly track all the way through.

In practice, with RavenDB, this means that a request such as processing a query is going to be handled mostly async, because we have a short compute bound operation (actually running the query), then we send the data to the client, which should take most of the time. In that time frame, we can give up the request processing thread to do another query. On the other hand, an operation like bulk insert shouldn’t want to give up its thread, because another request coming in and interrupting us means that we will slow down the bulk insert operation.

time to read 4 min | 729 words

After looking at the challenges involved in ensuring durability, let us see how database engines typically handle that. In general, I have seen three different approaches.

The append only model, the copy-on-write model and journaling. We discussed the append only mode a few times already, the main advantage here is that we always write to the end of the file, and we can commit (and thus make durable), but just calling fsync* on the file after we complete all the writes.

* Nitpicker corner: I’m using fsync as a general term for things like FlushFileBuffers, fdatasync, fsync, etc. I’m focused on the actual mechanics, rather than the specific proper API to use here.

There are some issues here that you need to be aware of, though. A file system (and block devices in general) will freely re-order writes as they wish, so the following bit of code:

image

May not actually do what you expect it to do. It is possible that during crash recovery, the second write was committed to disk (fully valid and functioning), but the first write was not. So if you validate just the transaction header, you’ll see that you have a valid bit of data, while the file contains some corrupted data.

The other alternative is to copy-on-write, instead of modifying the data in place, we write it (typically at the end of the file), fsync that, then point to the new location from a new file, and fsync that in turn. Breaking it apart into two fsyncs means that it is much more costly, but it also forces the file system to put explicit barriers between the operations, so it can’t reorder things. Note that you can also do that on a single file, with fsync between the two operations. But typically you use that on separate files.

Finally, we have the notion of explicit journaling. The idea is that you dedicate a specific file (or set of files), and then you can just write to them as you go along. Each transaction you write is hashed and verified, so both the header and the data can be checked at read time. And after every transaction, you’ll fsync the journal, which is how you commit the transaction.

On database startup, you read the journal file and apply all the valid transactions, until you reach the end of the file or a transaction that doesn’t match its hash, at which point you know that it wasn’t committed properly. In this case, a transaction is the set of operations that needs to be applied to the data file in order to sync it with the state it had before the restart. That can be modifying a single value, or atomically changing a whole bunch of records.

I like journal files because they allow me to do several nice tricks. Again, we can pre-allocate them in advance, which means that we suffer much less from fragmentation, but more importantly, most of the writes in journal systems are done at the same location (one after another), so we get the benefit of having sequential writes, which is pretty much the best thing ever to getting good performance from the hard disk.

There are things that I’m skipping, of course, append only or copy on write typically write to the data file, which means that you can’t do much there, you need the data available. But a journal file is rarely read, so you can do things like compress the data to the file on the fly, and reduce the I/O costs that you are going to pay. Other things that you can do is release the transaction lock before you actually write to the journal file, let the next transaction start, but not confirm the current transaction to the user until the disk let us know that the write has completed. That way, we can parallelize the costly part of the old transaction (I/O to disk) with the compute bound portion of the new transaction, gaining something in the meantime.

This is typically called early lock release, and while we played with it, we didn’t really see good numbers here to actually implement it for production.

FUTURE POSTS

  1. The role of junior developers in the world of LLMs - about one day from now

There are posts all the way to Aug 20, 2025

RECENT SERIES

  1. RavenDB 7.1 (7):
    11 Jul 2025 - The Gen AI release
  2. Production postmorterm (2):
    11 Jun 2025 - The rookie server's untimely promotion
  3. Webinar (7):
    05 Jun 2025 - Think inside the database
  4. Recording (16):
    29 May 2025 - RavenDB's Upcoming Optimizations Deep Dive
  5. RavenDB News (2):
    02 May 2025 - May 2025
View all series

Syndication

Main feed ... ...
Comments feed   ... ...
}