Ayende @ Rahien

It's a girl

Building data stores – Append Only

One of the interesting aspects in building a data store is that you run head on into things that you would generally leave to the infrastructure. By far, most developers deal with concurrency by relegating that responsibility to a database.

When you write your own database, you have to build this sort of thing. In essence, we have two separate issues here:

  • Maximizing Concurrency – does readers wait for writers? does writers wait for readers? does writers wait for writers?
  • Ensuring Consistency – can I read uncommitted data? can I read partially written data?

As I mentioned in my previous post, there are two major options when building a data store, Transaction Log & Append Only. There are probably a better name for each, but that is how I know them.

This post is going to focus on append only. An append only store is very simple idea in both concept and implementation. It requires that you will always append to the file. It makes things a bit finicky with the type of data structures that you have to use, since typical persistent data structures rely on being able to modify data on the disk. But once you get over that issue, it is actually very simple.

An append only store works in the following manner:

  • On startup, the data is read in reverse, trying to find the last committed transaction.
  • That committed transaction contains pointers to locations in the file where the actual data is stored.
  • A crash in the middle of a write just means garbage at the end of the file that you have to skip when finding the last committed transaction.
  • In memory, the only thing that you have to keep is just the last committed transaction.
  • A reader with a copy of the last committed transaction can execute independently of any other reader / writer. It will not see any changes made by writers made after it started, but it also doesn’t have to wait for any writers.
  • Concurrency control is simple:
    • Readers don’t wait for readers
    • Readers don’t wait for writers
    • Writers don’t wait for readers
    • There can be only one concurrent writer

The last one is a natural fallout from the fact that we use the append only model. Only one thread can write to the end of the file at a given point in time. That is actually a performance boost, and not something that would slow the database down, as you might expect.

The reason for that is pretty obvious, once you start thinking about it. Writing to disk is a physical action, and the head can be only in a single place at any given point in time. By ensuring that all writes go to the end of the file, we gain a big perf advantage since we don’t do any seeks.

Comments

Andrew
06/18/2010 04:20 PM by
Andrew

Good to see you posting again.

I'm not sure if its appropriate for your application, but there is an optimization you can use to possibly do bulk writes instead of doing them one at a time, to improve write performance/concurrency.

When a transaction calls Commit(), instead of taking the only writer and writing that transaction to disk and then releasing the writer (causing other transactions to repeat this step in sequence in a blocking fashion, waiting for IO to complete), you could do async or bulk writes.

This technique has been used before on InnoDB database by having a Prepare() method simply reserve the amount of space needed in the data file by advancing the 'next write position' pointer by the size of the data to be written (so the next transaction can start after our reserved space).

Then the actual Commit() method does the writing in one hit (eg several transactions are written to disk at once).

Of course this means you lose transaction safety when Commit() supposedly returns after the client requests it be committed, it still may be in memory pending a bulk write.

Ayende Rahien
06/18/2010 04:28 PM by
Ayende Rahien

Andrew,

Yeah, that is a nice optimization, but I consider the downside to be pretty bad.

Uriel Katz
06/18/2010 05:52 PM by
Uriel Katz

isn`t a single write/flush atomic?

what i mean if you take all your record in a buffer then write it to a file and flush,is it written as a whole or not?

if yes you can have a sequential id(those can be made thread safe with CAS operations) in each record of the append only file,and when you read backwards you check if you got the biggest id.

Ayende Rahien
06/18/2010 09:40 PM by
Ayende Rahien

Uriel,

Writes to a single sector are supposed to be atomic. The problem is that HD no longer use sectors anymore.

I find it safest not to assume that and assume that writes aren't really atomic at the HD level until fsync is called.

Frans Bouma
06/19/2010 09:39 AM by
Frans Bouma

In Windows, Disk IO is abstracted to a level where you'll never be able to say "this byte is now physically stored on disk at this very moment", you'll leave that to windows, to the HDD low level subsystem, to the HDD hardware and command queue. So this comes down to bundling commands to the HDD in such a way that the HDD diskhead steps as less as possible due to YOUR actions. You can never anticipate on another process getting the CPU and doing a disk step as well, and you shouldn't, that's what the low level subsystems are for. So you bundle as much commands as possible in such a way that diskstepping is avoided as much as possible inside your bundle, and that's what you can do, nothing more. What Andrew suggests seems to me impossible due to the nature of how windows handles IO.

About the article: what if you delete elements from the data? rdbms's use their own filesystem inside files, using pages (which typically are the size of a disk block or smaller, but fit in a diskblock without splitting them up) and have to have a form of re-use to avoid fragmentation. Appending to a file ultimately runs you into the situation where deletes fragment the file a lot. Unless your db is insert only, which is what you aren't striving for IMHO.

Ayende Rahien
06/19/2010 10:06 AM by
Ayende Rahien

Frans,

In Windows you can call FlushFileBuffers to ensure that the data is on the disk.

Without that facility, you couldn't build databases on Windows.

Yes, in a mutli application system, it is entirely possible that two different applications would queue different requests to different parts of the disk at the same time.

This reduce the benefit of seekless writes, but doesn't eliminate them.

Deletes are done by removing the data from the index and writing the new metadata.

Yes, that wastes space, and append only data stores usually handle that by running a compaction process every now and then.

OmariO
06/22/2010 11:56 PM by
OmariO

Has SSDs changed anything?

Ayende Rahien
06/23/2010 10:29 AM by
Ayende Rahien

OmariO,

SSDs means that random reads are much faster, but random writes are still more expensive than sequential writes.

zihotki
07/13/2010 11:37 AM by
zihotki

Also one of the good points of append only data store is that you don't have to lock the data store or shut down it to do a backup. Backups a really fast and cheap.

Thanks a lot for the series and for pointing me to very interesting topics.

Comments have been closed on this topic.