Ayende @ Rahien

Jun 18 2010

Building data stores – Append Only

time to read 3 min | 533 words

Tags:

Databases

One of the interesting aspects in building a data store is that you run head on into things that you would generally leave to the infrastructure. By far, most developers deal with concurrency by relegating that responsibility to a database.

When you write your own database, you have to build this sort of thing. In essence, we have two separate issues here:

Maximizing Concurrency – does readers wait for writers? does writers wait for readers? does writers wait for writers?
Ensuring Consistency – can I read uncommitted data? can I read partially written data?

As I mentioned in my previous post, there are two major options when building a data store, Transaction Log & Append Only. There are probably a better name for each, but that is how I know them.

This post is going to focus on append only. An append only store is very simple idea in both concept and implementation. It requires that you will always append to the file. It makes things a bit finicky with the type of data structures that you have to use, since typical persistent data structures rely on being able to modify data on the disk. But once you get over that issue, it is actually very simple.

An append only store works in the following manner:

On startup, the data is read in reverse, trying to find the last committed transaction.
That committed transaction contains pointers to locations in the file where the actual data is stored.
A crash in the middle of a write just means garbage at the end of the file that you have to skip when finding the last committed transaction.
In memory, the only thing that you have to keep is just the last committed transaction.
A reader with a copy of the last committed transaction can execute independently of any other reader / writer. It will not see any changes made by writers made after it started, but it also doesn’t have to wait for any writers.
Concurrency control is simple:

Readers don’t wait for readers
Readers don’t wait for writers
Writers don’t wait for readers
There can be only one concurrent writer

The last one is a natural fallout from the fact that we use the append only model. Only one thread can write to the end of the file at a given point in time. That is actually a performance boost, and not something that would slow the database down, as you might expect.

The reason for that is pretty obvious, once you start thinking about it. Writing to disk is a physical action, and the head can be only in a single place at any given point in time. By ensuring that all writes go to the end of the file, we gain a big perf advantage since we don’t do any seeks.

Building data stores – Transaction log

time to read 3 min | 499 words

Tweet Share Share 0 comments

Tags:

Databases

When you write your own database, you have to build this sort of thing. In essence, we have two separate issues here:

Maximizing Concurrency – does readers wait for writers? does writers wait for readers? does writers wait for writers?
Ensuring Consistency – can I read uncommitted data? can I read partially written data?

As I mentioned in my previous post, there are two major options when building a data store, Transaction Log & Append Only. There are probably a better name for each, but that is how I know them.

This post is going to focus on transaction log. Transaction log is actually pretty simple idea, conceptually. It simply requires that you would state what you intend to do before you do it, in such a way that you can reverse it.

For example, let us say that I want to store “users/ayende” –> "ayende@ayende.com”. All I need to do is to write the following to the transaction log.

{
   "Key": "users/ayende",
   "OldVal": "AYENDE@AYENDE.COM",
   "NewVal": "ayende@ayende.com",
   "TxId": 19474774
}

If the data store crashes before the transaction is completed, we can run a recovery process that would resolve any issues in the actual data from the transaction log. Once a transaction is committed, we can safely delete it from the transaction log.

As I said, conceptually it is a very simple idea, but it leads to some interesting implementation challenges:

You can optimize things by not writing to disk (except for writing to the transaction log) immediately.
You need to keep track of concurrent transactions touching the same records.
You need to handle background flushing to disk.
The crash recovery process can be.. finicky to write.

Concurrency control is something that you essentially have to roll on your own, and you can make it as granular as you feel like. There is some complexity involved in ensuring that you read the appropriate data from the transaction log / data file (based on whatever you are in a transaction reading data you modified or outside it, reading the old data), but where it gets really complex is the number of moving parts that you have to deal with.

Oren Eini

Oren Eini

CEO of RavenDB

Building data stores – Append Only

Building data stores – Transaction log

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed