Minimal incremental backups improvements in Voron
Voron uses a fairly standard journaling system to ensure ACID compliance. Whenever you commit a transaction, we write the changed pages to the log.
Let us take the simplest example, here is a very simple page:
Now we write the following code:
for(var i = 0; i< 3; i++ )
{
using(var tx = env.NewTransaction(TransactionFlags.ReadWrite))
{
var users = tx.ReadTree("users");
users.Add("Users/" + NextUserId(), GetUserData());
tx.Commit();
}
}
After executing this code, the page is going to look like this:
But what is going to happen in the log? We have to commit each transaction separately, so what we end up is the following journal file:
The log contains 3 entries for the same page. Which make sense, because we changed that in 3 different transactions. Now, let us say that we have a crash, and we need to apply the journal. We are going to scan through the journal, and only write Page #23 from tx #43. For the most part, we don’t care for the older version of the page. That is all well and good, until incremental backups comes into play.
The default incremental backup approach used by Voron is simplicity itself. Just create a stable copy of the journal files, and you are pretty much done, there isn’t anything else that you need to do. Restoring involves applying those transactions in the order they appear in the journal, and rainbows and unicorns and peace on Earth.
However, in many cases, the journal files contain a lot of outdated data. I mean, we really don’t care for the state of Page #23 in tx #42. This came into play when we started to consider how we’ll create Raft snapshots on an ongoing basis for large datasets. Just using incremental backup was enough to give us a lot of promise, but it also expose the issue with the size of the journal becoming an issue.
That is when we introduced the notion of a minimal incremental backup. A minimal incremental backup is a lot more complex to create, but it actually restore in the exact same way as a normal incremental backup.
Conceptually, it is a very simple idea. Read the journals, but instead of just copying them as is, scan through them and find the latest version of all the pages that appear in the journal. In this case, we’ll have a Page #23 from tx #43. And then we generate a single transaction for all the latest versions of all the pages in the transactions we read.
We tried it on the following code:
for (int xi = 0; xi < 100; xi++)
{
using (var tx = envToSnapshot.NewTransaction(TransactionFlags.ReadWrite))
{
var tree = envToSnapshot.CreateTree(tx, "test");
for (int i = 0; i < 1000; i++)
{
tree.Add("users/" + i, "john doe/" + i);
}
tx.Commit();
}
}
This is an interesting experiment, because we are making modifications to the same keys (and hence, probably the same pages), multiple times. This also reflects a common scenario in which we have a high rate of updates.
Min incremental backup created an 8Kb file to restore this. While the standard incremental backup created a 67Kb file for the purpose.
That doesn’t sounds much, until you realize that those are compressed files, and the uncompressed sizes were 80Kb for the minimal incremental backup, and 1.57Mb for the incremental backup. Restoring the min incremental backup is a lot more efficient. However, it ain’t all roses.
An incremental backup will result in the ability to replay transactions one at a time. A min incremental backup will merge transactions, so you get the same end results, but you can’t stop midway (for example, to do partial rollback). Taking a min incremental backup is also more expensive, instead of doing primarily file I/O, we have to read the journals, understand them and output the set of pages that we actually care about.
For performance and ease of use, we limit the size of a merged transaction generated by a min incremental backup to about 512 MB. So if you have made changes to over 512MB of your data since the last backup, we’ll still generate a merged view of all the pages, but we’ll actually apply that across multiple transactions. The idea is to avoid trying to apply a very big transaction and consume all resources from the system in this manner.
Note that this feature was developed specifically to enable better behavior when snapshotting state machines for use within the Raft protocol. Because Raft is using snapshots to avoid having an infinite log, and because the Voron journal is effectively the log of all the disk changes made in the system, that was something that we had to do. Otherwise we couldn’t rely on incremental backups for snapshots (we would have just switched the Raft log with the Voron journal, probably we no save in space). That would have forced us to rely on full backups, and we don’t want to take a multi GB backup very often if we can potentially avoid it.
Comments
Since I seem to recall you also wanting to have a mono version of Voron, do you intend to support a mixed windows/unix cluster? And if so, how do you plan on dealing with the endian differences in the binary snapshot files and shipped logs between systems with different host byte order?
Alex, Yes, we'll support a mixed Windows / Linux cluster. Voron is little Endian, and when we'll need to support big Endian systems, we'll deal with that. I'm fine with limiting us to little endian for now, I don't think that there are a lot of big endian machines that we'll want to target in the near future
Comment preview