When disk and hardware fall…
When your back is against the wall, and your only hope is for black magic (and alcohol).
The title of this post is taken from this song. The topic of this post is a pretty sad one, but a mandatory discussion when dealing with data that you don’t want to lose. We are going to discuss hard system failures.
The source can be things like actual physical disk errors to faulty memory causing corruption. The end result is that you have a database that is corrupted in some manner. RavenDB actually have multiple levels of protections to detect such scenarios. All the data is verified with checksums on first load from the disk, and the transaction journal is verified when applying it as well. But stuff happens, and thanks to Murphy, that stuff isn’t always pleasant.
One of the hard criteria for the Release Candidate was a good story around catastrophic data recovery. What do I mean by that? I mean that something corrupted the data file in such a way that RavenDB cannot load normally. So sit on tight and let me tell you this story.
We first need to define what we are trying to handle. The catastrophic data recovery feature is meant to:
- Recover user data (documents, attachments, etc) stored inside a RavenDB file.
- Recover as much data as possible, disregarding its state, letting user verify correctness (i.e, may recover deleted documents).
- Does not include indexed data, configuration, cluster settings, etc. This is because these can be quite easily handled by recreating indexes or setting up a new cluster.
- Does not replace high availability, backups or proper preventive maintenance.
- Does not attempt to handle malicious corruption of the data.
Basically. the idea is that when you are shit creek, we can hand you paddle. That said, you are still up in shit creek.
I mentioned previously that RavenDB go to quite some length to ensure that it knows when the data on disk is messed up. We also did a lot of work into making sure that when needed, we can actually do some meaningful work to extract your data out. This means that when looking at the raw file format, we actually have extra data there that isn’t actually used for anything in RavenDB except by the recovery tools. That reason (the change to the file format) was why it was a Stop-Ship priority issue.
Given that we are already in catastrophic data recovery mode, we can make very little assumption about the state of the data. A database is a complex beast, involving a lot of moving parts and the on disk format is very complex and subject to a lot of state and behavior. We are already in catastrophic territory, so we can’t just use the data as we would normally would. Imagine a tree where following the pointers to the lower level might at some cases lead to garbage data or invalid memory. We have to assume that the data has been corrupted.
Some systems handle this by having two copies of the master data records. Given that RavenDB is assumed to run on modern file systems, we don’t bother this. ReFS on Windows and ZFS on Linux handle that task better and we assume that production usage will use something similar. Instead, we designed the way we store the data on disk so we can read through the raw bytes and still make sense of what is going on inside it.
In other words, we are going to effectively read one page (8KB) at a time, verify that the checksum matches the expected value and then look at the content. If this is a document or an attachment, we can detect that and recover them, without having to understand anything else about the way the system work. In fact, the recovery tool is intentionally limited to a basic forward scan of the data, without any understanding of the actual file format.
There are some complications when we are dealing with large documents (they can span more than 8 KB) and large attachments (we support attachments that are more then 2GB in size) can requite us to jump around a bit, but all of this can be done with very minimal understanding of the file format. The idea was that we can’t rely on any of the complex structures (B+Trees, internal indexes, etc) but can still recover anything that is still recoverable.
This also led to an interesting feature. Because we are looking at the raw data, whenever we see a document, we are going to write it out. But that document might have actually been deleted. The recovery tool doesn’t have a way of checking (it is intentionally limited) so it just write it out. This means that we can use the recovery tool to “undelete” documents. Note that this isn’t actually guaranteed, don’t assume that you have an “undelete” feature, depending on the state of the moon and the stomach content of the nearest duck, it may work, or it may not.
The recovery tool is important, but it isn’t magic, so some words of caution are in order. If you have to use the catastrophic data recovery tool, you are in trouble. High availability features such as replication and offsite replica are the things you should be using, and backups are so important I can’t stress it enough.
The recommended deployment for RavenDB 4.0 is going to be in a High Availability cluster with scheduled backups. The recovery tool is important for us, but you should assume from the get go that if you need to use it, you aren’t in a good place.
Comments
But is there a way (logs, or something) to detect the deletion further on? Because I guess the presence or absence of a document might carry important business value. For example, let's say there are some data that are to be batch processed (essentially a queue flushed to a DB) and when the processing logic has finished working on it, they can be deleted. In some cases (especially money-related) it sounds problematic to let those data be re-processed.
I do understand that what you've outlined really is a last resort and other measures cannot and should not be replaced with any post-catastrophe mechanism, I'm just curious whether it is just something you didn't cover in this post or is it really a mechanism that can revive data that shouldn't be revived.
@Balázs This behaves similarly to how undelete works on an OS. If the OS didn't assigned the physical pages to some other file, you can still manage to recover data that shouldnt be there at all. At the recovery process we have no way to know (if the underlying tree was corrupted) if the data has been deleted or not. If the tree is corrupted you cannot really rely on it. We do have the ability to know if the content itself is corrupted, not if it was deleted or not.
Balázs,
This is emergency operation, meant to recover data when everything else has already failed and you are at the end of the rope. We are relying on very little to do this recovery, because we can't assume anything about the data. If you are having money related data and you didn't have: high availability, backups, etc. I think you have other issues.
More to the point, we considered trying to mark documents as deleted when they are, but that adds significant complexity and work for several important workloads (create / delete, queue, etc) without any benefit except NOT recovering deleted data. On the other hand, I can most certain envision people wanting to recover deleted data if possible, so there is no point blocking that.
Comment preview