The Guts n’ Glory of Database InternalsWriting to a data file

time to read 3 min | 556 words

imageSo, we now know how we can write to a journal file efficiently, but a large part of doing that is relying on the fact that we are never actually going to read from the journal file. In other words, this is like keeping your receipts in case of an audit. You do that because you have to, but you really don’t want to ever need it, and you just throw it in the most efficient way to a drawer and will sort if out when you need to.

In most database engines that implement a journal, there is this distinction, the journal is strict for durability and recovery, and the data file(s) are used to actually store the data in order to operate. In our case, we’ll assume a single journal and a single data file.

On every commit, we’ll write to the journal file, as previously discussed, and we ensure that the data is safely on the disk. But what happens to writes on the data file?

Simple, absolutely nothing. Whenever we need to write to the data file, we make buffered writes into the data file, which goes into the big chain of buffers that merge / reorder and try to optimize our I/O. Which is great, because we don’t really need to care about that. We can let the operating system handle all of that.

Up to a point, of course.

Every now and then, we’ll issue an fsync on the data file, forcing the file system to actually flush all those buffered writes to disk. We can do this in an async manner, and continue operations on the file. At the time that the fsync is done (which can be a lot of time, if we had a lot of write and a busy server), we know what is the minimum amount of data that was already written to the data file and persisted on disk. Since we can match it up to the position of the data on the journal, we can safely say that the next time we recover, we can start reading the journal from that location.

If we had additional writes, from later in the journal file, that ended up physically in the data file, it doesn’t matter, because they will be overwritten by the journal entries that we have.

Doing it this way allows us to generate large batches of I/O, and in most cases, allow the operating system the freedom to flush things from the buffers on its own timeline, we just make sure that this doesn’t get into degenerate case (where we’ll need to read tens of GB of journal files) by forcing this every now and then, so recovery is fast in nearly all cases.

All of this I/O tend to happen in async thread, and typical deployments will have separate volumes for logs and data files, so we can parallelize everything instead of competing with one another.

By the way, I’m running this series based on my own experience in building databases, and I’m trying to simplify it as much as it is possible to simplify such a complex topic. If you have specific questions / topics you’ll like me to cover, I’ll be happy to take them.

More posts in "The Guts n’ Glory of Database Internals" series:

  1. (08 Aug 2016) Early lock release
  2. (05 Aug 2016) Merging transactions
  3. (03 Aug 2016) Log shipping and point in time recovery
  4. (02 Aug 2016) What goes inside the transaction journal
  5. (18 Jul 2016) What the disk can do for you
  6. (15 Jul 2016) The curse of old age…
  7. (14 Jul 2016) Backup, restore and the environment…
  8. (11 Jul 2016) The communication protocol
  9. (08 Jul 2016) The enemy of thy database is…
  10. (07 Jul 2016) Writing to a data file
  11. (06 Jul 2016) Getting durable, faster
  12. (01 Jul 2016) Durability in the real world
  13. (30 Jun 2016) Understanding durability with hard disks
  14. (29 Jun 2016) Managing concurrency
  15. (28 Jun 2016) Managing records
  16. (16 Jun 2016) Seeing the forest for the trees
  17. (14 Jun 2016) B+Tree
  18. (09 Jun 2016) The LSM option
  19. (08 Jun 2016) Searching information and file format
  20. (07 Jun 2016) Persisting information