Ayende @ Rahien

It's a girl

The building blocks of a database: Transactional & persistent hash table

I am working on managed storage solution for RavenDB in an off again on again fashion for a while now. The main problem Is that doing something like this is not particularly hard, but it is complex. You either have to go with a transaction log or an append only model.

There are more than enough material on the matter, so I won’t touch that. The problem is that building that is going to take time, and probably a lot of it I decided that it is better off to have something than nothing, and scaled back the requirements.

The storage has to have:

  • ACID
  • Multi threaded
  • Fully managed
  • Support both memory and files
  • Fast
  • Easy to work with

The last one is important. Go and take a look at the codebase of any of the available databases. They can be… pretty scary.

But something has to be give, so I decided that to make things easier, I am not going to implement indexing on the file system. Instead, I’ll store the data on the disk, and keep the actual index completely in memory. There is an overhead of roughly 16 bytes per key plus the key itself, let us round it to 64 bytes per held per key. Holding 10 million keys would cost ~600 MB. That sounds like a lot, because it is. But it actually not too bad. It isn’t that much memory for modern hardware. And assuming that our documents are 5 KB in size, we are talking about 50 GB for the database size, anyway.

Just to be clear, the actual data is going to be on the disk, it is the index that we keep in memory. And once we have that decision, the rest sort of follow on its own.

It is intentionally low level interface, and mostly it gives you is a Add/Read/Remove interface, but it gives you multiple tables, the ability to do key and range scans and full ACID compliance (including crash recovery).

And the fun part, it does so in 400 lines!

http://github.com/ayende/Degenerated.Storage

Tags:

Posted By: Ayende Rahien

Published at

Originally posted at

Comments

Peter Morris
10/29/2010 11:35 AM by
Peter Morris

I remember years ago reading how Interbase does versioned transactions so that people can read existing data with repeatable read whilst someone else is still in a transaction that has updated the data.

Is that the approach you meant when you said "an append only model"?

Patrick Huizinga
10/29/2010 12:08 PM by
Patrick Huizinga

@Peter Morris

I don't know anything about Interbase, and just a little bit of CouchDB, but this is how I understand CouchDB works.

CouchDB uses immutable trees for their tables. When you modify the table, a new tree is created (creating only one new leaf and only the branches from the leaf up to the root). This way you can/could ask CouchDb for the current root and then at your own pace browse through that snapshot, while the rest of the world races by, creating updates by creating new trees.

Patrick Huizinga
10/29/2010 12:17 PM by
Patrick Huizinga

cont. (heh heh..)

CouchDB uses an append-only file to store your changes. At the end of the file, your data, the new leaf and all the new branches will be written.

'Snapshotting' doesn't require an immutable structure and an immutable structure doesn't require append-only, but with my feeble mind I wouldn't want to implement them any other way.

Torvin
10/29/2010 08:07 PM by
Torvin

looks cool!

i have a question though

public Stream CreateTemporaryStream()

{

  string tempFile = Path.Combine(basePath, Path.GetFileName(Path.GetTempFileName()));

  return File.Open(tempFile, FileMode.Create, FileAccess.ReadWrite);

}

who gonna delete that file?

Rickrock
10/29/2010 10:55 PM by
Rickrock

Dear Ayende,

what exactly makes you think that you as a (unquestionable exceptionally talented) developer and architect will not into the same troubles that all those huge clumsy databases ran after a while? There is a big codebase not only because all other coders suck.

Don't get me wrong: Sometimes small projects change the whole way we see things (Ruby is the most prominent example).

How about though starting with a more "humble" idea like: "Let's skip some things a see how far we get with it and whether it is usable for some scenarios."

Simplifying DBs is good, I support that. Conveying or implying the message that they are complicated without any reason is misleading.

Or is a bit of "Not Invented Here" again?

Ayende Rahien
10/30/2010 11:10 AM by
Ayende Rahien

Peter,

I am not aware of the Interbase model, but an append only model make MVCC trivial to work with.

Ayende Rahien
10/30/2010 11:11 AM by
Ayende Rahien

Tovin,

That is done in the ReplaceAtomically function

Ayende Rahien
10/30/2010 11:12 AM by
Ayende Rahien

Rickrock,

Mostly, because I am not trying to target the same feature set.

And by not trying to hit the same feature set, I have drastically simplified my problem set.

Jesús López
10/31/2010 09:24 AM by
Jesús López

Something sounds wrong in this code:

            WriteCommands(cmds, persistentSource.Log);


            persistentSource.FlushData(); // sync the data to disk before doing anything else

            WriteCommands(cmds, persistentSource.Log);

            persistentSource.FlushLog(); // flush all the index changes to disk

You are writting commands to log twice. When do you write commands to data?

moshe
11/01/2010 02:12 PM by
moshe

I spent time going back into Ayende's archives starting from the beginning while he was on jail duty and coding. It reminded me a little of a science fiction short story I once read about a child who was a gifted composer whom they kept apart from all other people in order that he should not hear any music composed by others and have his creativity tainted (whether through imitation or by consciously trying to avoid imitation).

I think Ayende is the type who should try and rewrite everything from scratch.

For example even microsoft re-implemented javascript regex for IE9, as mentioned here:

blogs.msdn.com/.../...-in-internet-explorer-9.aspx

Dmytrii Nagirniak
11/03/2010 07:35 AM by
Dmytrii Nagirniak

If the indexes are kept in memory it will take significant amount of time to rebild them every time.

Can you comment on it please?

Ayende Rahien
11/03/2010 08:10 AM by
Ayende Rahien

Dmyrii,

Not really. We tested it with millions of items, and it just works

Luc
11/17/2010 12:06 PM by
Luc

My first question would be exactly the same as Torvin already wrote: In "Path.Combine(basePath, Path.GetFileName(Path.GetTempFileName()));", who is going to delete that empty file created in the TEMP directory by Path.GetTempFileName()? Despite its name, GetTempFileName() doesn't just generate a name, it creates an actual file you are supposed to delete...

Comments have been closed on this topic.