Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,592
|
Comments: 51,223
Privacy Policy · Terms
filter by tags archive
time to read 3 min | 519 words

image[3]

In my previous post I talked about the journal and how it is used to recover the database state in case of errors.

In other words, we write everything that we are going to do into the journal, and the code is written to run through the journal and apply these changes to the database.

Ponder this for a moment, and consider the implications of applying the same concept elsewhere. Log shipping is basically taking the journal file from one server and asking another server to apply it as well. That is all.

Everything else are mere details that aren’t really interesting… Well, they are, but first you need to grasp what is so special in log shipping. If you already have a journal and a recovery from it, you are probably at least 80% or so of the way there to getting log shipping to work.

Now that you understand what log shipping is, let’s talk about its implications. Depending on the way your log is structured, you might be able to accept logs from multiple servers and apply all of them (accepting the fact that they might conflict), but in practice this is almost never done in this manner. Log shipping typically mandates that one server would be designated the master (statically or dynamically), and the other(s) are designated as secondary (typically read only copies).

The process in which the logs are shipped is interesting. You can do that on a time basis (every 5 – 15 minutes), every time that you close a journal file (64MB – 256MB), etc. This is typically called offline log shipping, because there is a large gap between the master and the secondary. There is also online log shipping, in which every write to the journal is also a socket write to the other server, which accepts the new journal entries, write them to its own journal and applies them immediately, resulting in a much narrower delay between the systems.  Note that this has its own issues, because this is now a distributed system with all that it implies (if the secondary isn’t available for an hour, what does it mean, etc.).

But journals also allow us to do other fun stuff. In particular, if the journal file records the timestamp of transactions (and most do), they allow us to do what is called “point in time” recovery. Starting from a particular backup, apply all the committed transactions until a certain point in time, bring up to the state of the database at 9:07 AM (one minute before someone run an UPDATE statement without the WHERE clause).

Note that all of the code that actually implements this has already been written as part of ensuring that the database can recover from errors, and all we need to implement “point in time” recovery is to just stop at a particular point in time. That is a pretty neat thing, in my opinion.

time to read 8 min | 1413 words

I talk a lot about journals, and how to make sure they are durable and safe. What I haven’t talked about is what you put in the journal. Carsten reminded me of this lack, and I want to thank him for that.

The first thing we need to understand is what the journal is supposed to do. The whole point of the journal is to make sure that you if you had some sort of error (including power loss), you can use the information on the journal to recover up to the same point you were at before the failure. More formally, in order to maintain ACID, we need to ensure that any transactions we committed (so clients rely on that) are there, and any ongoing transactions have been rolled back properly. The journal is the key to that, but how?

A lot of this depends on the database engine itself, and that has an impact on the file format you used in the journal. The way you work with the journal has also a lot to do with the internals of the database.

Let us take a relational database as an example. We want to insert a new row to a table. The size of the row is usually pretty small, a few dozen to a few hundred bytes. So it’s cool, but how does it play into the journal? Well, before the database can actually modify the data, it needs to write its intent to do so in the journal, and flush it. Only then can it proceed, knowing that there is a stable record in the case of an error.

One way of doing this is to write the following this record into the journal file:

{ position: 1234, value: [binary] }

After it is flushed, we can go ahead and modify the file itself. If we crash, during recovery we’ll see that we have this entry in the journal, and we’ll write it again to the same place in the data file. Either this will be a no-op, because it was previously applied, or we’ll re-apply the change. At any rate, as an external observer, the fact that the journal entry exists ensures that on recovery, the value will be the same. Hence, the fact the entry was written and flushed to the journal ensures that the transaction change was committed.

So this is the high level overview. The devil is in the details. The simplest journal file just records binary data and position, so we can write it out to the data file on recovery. But while that works, it is limiting the database engine in what it can do. If you have multiple concurrent transactions all of which are generating entries to the log, you don’t want to force them to be written to the journal only on transaction commit, you want them to be appended to the log and flushed (so you can amortize the cost), and you want to write them out immediately.

But this means that you are writing uncommitted changes to the journal file. So each of your entry also has a transaction number, and your transaction commit is another entry in the journal that orders to apply all operations from that particular transaction. At this point, you usually have different actors inside the database engine. You have an actor that is responsible for writing to the journal file, and one is typically merging multiple entries from multiple concurrent transactions to allow the maximum level of performance. Consider the implications of the following journal actor implementation:

All pending entries to the journal are actually written using buffered I/O, so they are very fast. However, there are a few things to notice here.

  • We keep a record of the hash (CRC, MD5, etc) of all the writes.
  • When we get a transaction commit entry, we write it to the disk, and then write the hash for all the entries that were written since the last transaction commit.
  • We flush to disk, ensuring that we are actually persisting on spinning rust.
  • We let the caller know the transaction has been safely journaled to disk.

This code has a lot of tiny implications that are not immediately obvious. For example, whenever we send a journal entry to be written, we don’t need to wait for it, we can proceed immediately to the next change in the transaction. And because the writes are buffered, even the I/O on the journal actor is very cheap.

However, when we send a commit transaction entry, we do a few more things. Writing the hash of the all the entries that were written since the last transaction allows us to verify that all entries that were written were written successfully. If there was a power failure while we were writing the transaction, we would know and realize that this transaction isn’t committed, and that is where the log ends.

And while I don’t need to wait for the journal entry to be flushed to disk, I do have to wait for the transaction itself to be written to disk before I let the user know that it has been committed. This type of structure also has the advantage of transactions sharing the load. If we have a long transaction that does something, its journal entries will be flushed to disk along with the other committing transactions. When we need to commit the long transaction, the amount of work we’ll actually have to do is a lot less.

The structure of the journal entry can range from the simple mode of writing what binary data to modify where, to actually have meaning for the database engine (SET [column id]=[value]). The latter is more complex, but it gives you more compact representation for journal entries, which means that you can pack more of them in the same space and I/O is at a premium.

For the technical reasons mentioned in previous posts (alignment, performance, etc.), we typically write only on page boundaries (in multiples of 4KB), so we need to consider whether it is actually worth our time to write things now, or wait a bit and see if there is additional data coming that can complete the target boundary.

What I described so far is one particular approach to handle that. It is used by databases such as PostgresSQL, Cassandra, etc. But it doesn’t work quite in this manner. In the case of something like LevelDB, the journal is written one transaction at a time, under the lock, and in multiples of 32KB. The values in the journal are operations to perform (add / delete from the database), and that is pretty much it. It can afford to do that because all of its data files are immutable. PostgresSQL uses 8KB multiples and only stores in the journal only the data it needs to re-apply the transaction. This is probably because of the MVCC structure it has.

SQL Server, on the other hand, stores both redo and undo information. It make sense, if you look at the journal actor above. Whenever we write an entry to the journal file, we don’t yet know if the transaction would be committed. So we store the information we need to redo it (if committed) and undo it (if it didn’t). Recovery is a bit more complex, and you can read about it the algorithm used most frequently (at least as the base) here: ARIES.

With Voron, however, we choose to go another way. Each transaction is going to keep track of all the pages it modified. And when the transaction commits, we take all of those pages and compress them (to reduce the amount of I/O we have to do), compute a hash of the compressed data, and write it on 4KB boundary. Recovery then becomes trivial. Run through the journal file, and verify each transaction hash at a time. Once this is done, we know that the transaction is valid, so we can decompress the dirty pages and write them directly to the data file. Instead of handling operations log, we handle full pages. This means we don’t actually care what modifications run on that page, we just care about its state.

It makes adding new facilities to Voron very simple, because there is a very strict layering throughout, changing the way we do something has no impact on the journaling / ACID layer.

time to read 2 min | 370 words

We recently had to deal with a stress test that was failing (very) occasionally. We looked into that, and we figured out that this was actually exposing a high severity bug. So we looked into it pretty seriously.

And we keep hitting a wall.

There are a few reasons why you would have a non-reproducible error. The most common issue is if you have a race. For example, if two threads are modifying the same variable without proper synchronization, this will cause those kind of symptoms. We have a lot of experience in dealing with that, and all the signs were there. But we still hit a wall.

You see, the problem was that the entire section of the code we were looking at was protected by a lock. There was no possibility of an error because of threading, since this whole part of the code just couldn’t run concurrently anyway.

So why was it failing only occasionally? If it is single threaded, it should be predictable. In fact, the reason there was a lock there, instead of the more complex merge operations we used to have, was specifically to support reproducibility. The other kind of issue that can create this sort is I/O (which has its own timing issues), but the error happened in a part of the code that was operating on internal data structures, so it wasn’t that.

Eventually we figured it out. This was a long background operation, and because we needed to have the lock in place, we had a piece of code similar to this:

Because this is a long running operation, under lock, we do this in stages, and make sure that other things can run while we do that. But this was exactly what introduced the variability in the test results, and that made it so random and annoying. Once we figured that this was the cause for the difference, all we had to do was write the proper log of operations, and execute it again.

The rest was just finding out which of the 200,000 transactions executed actually caused the problem, mere detail work Smile.

FUTURE POSTS

  1. Semantic image search in RavenDB - 3 days from now

There are posts all the way to Jul 28, 2025

RECENT SERIES

  1. RavenDB 7.1 (7):
    11 Jul 2025 - The Gen AI release
  2. Production postmorterm (2):
    11 Jun 2025 - The rookie server's untimely promotion
  3. Webinar (7):
    05 Jun 2025 - Think inside the database
  4. Recording (16):
    29 May 2025 - RavenDB's Upcoming Optimizations Deep Dive
  5. RavenDB News (2):
    02 May 2025 - May 2025
View all series

Syndication

Main feed ... ...
Comments feed   ... ...
}