A lot of the complexities involved in actually building a database engine aren’t related to the core features that you want to have. They are related to what looks like peripheral concerns. Backup / restore is an excellent example of that.
Obviously, you want your database engine to have support for backup and restore. But at the same time, actually implementing that (efficiently and easily) is not something that can be done trivially. Let us consider Redis as a good example; in order to backup its in memory state, Redis will fork a new process, and use the OS’s support for copy on write to have a stable snapshot of the in-memory state that it can then write to disk. It is a very elegant solution, with very little code, and it can be done with the assistance of the operation system (almost always a good thing).
It also exposes you to memory bloat if you are backing up to a slow disk (for example, a remote machine) at the same time that you have a lot of incoming writes. Because the OS will create a copy of every memory page that is touched as long as the backup process is running (on its own copy of the data), the amount of memory actually being used is non trivial. This can lead to swapping, and in certain cases, the OS can decide that it runs out of memory and just start killing random processes (most likely, the actual Redis server that is being used, since it is the consumer of all this memory).
Another consideration to have is exactly what kind of work do you have to do when you restore the data. Ideally, you want to be up and running as soon as possible. Given database sizes today, even reading the entire file can be prohibitively expensive, so you want to be able to read just enough to start doing the work, and then complete the process of finishing up the restore later (while being online). The admin will appreciate it much more than some sort of a spinning circle or a progress bar measuring how long the system is going to be down.
The problem with implementing such features is that you need to consider the operating environment in which you are working. The ideal case is if you can control such behaviors, for example, have dedicated backup & restore commands that the admin will use exclusively. But in many cases, you have admins that do all sorts of various things, from shutting down the database and zipping to a shared folder on a nightly basis to taking a snapshot or just running a script with copy / rsync on the files on some schedule.
Some backup products have support for taking a snapshot of the disk state at a particular point in time, but this goes back to the issues we raised in a previous post about low level hooks. You need to be aware of the relevant costs and implications of those things. In particular, most databases are pretty sensitive to the order in which you backup certain files. If you take a snapshot of the journal file at time T1, but the data file at time T2, you are likely to have data corruption when you restore (the data file contains data that isn’t in the journal, and there is no way to recover it).
The really bad thing about this is that it is pretty easy for this to mostly work, so even if you have a diligent admin who test the restore, it might actually work, except when it will fail when you really need it.
And don’t get me started on cloud providers / virtual hosts that offer snapshots. The kind of snapshotting capabilities that a database requires are very specific (all changes that were committed to disk, in the order they were sent, without any future things that might be in flight) in order to be able to successfully backup & restore. From experience, those are not the kind of promises that you get from those tools.
More posts in "The Guts n’ Glory of Database Internals" series:
- (08 Aug 2016) Early lock release
- (05 Aug 2016) Merging transactions
- (03 Aug 2016) Log shipping and point in time recovery
- (02 Aug 2016) What goes inside the transaction journal
- (18 Jul 2016) What the disk can do for you
- (15 Jul 2016) The curse of old age…
- (14 Jul 2016) Backup, restore and the environment…
- (11 Jul 2016) The communication protocol
- (08 Jul 2016) The enemy of thy database is…
- (07 Jul 2016) Writing to a data file
- (06 Jul 2016) Getting durable, faster
- (01 Jul 2016) Durability in the real world
- (30 Jun 2016) Understanding durability with hard disks
- (29 Jun 2016) Managing concurrency
- (28 Jun 2016) Managing records
- (16 Jun 2016) Seeing the forest for the trees
- (14 Jun 2016) B+Tree
- (09 Jun 2016) The LSM option
- (08 Jun 2016) Searching information and file format
- (07 Jun 2016) Persisting information