Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,189 | Comments: 45,958

filter by tags archive

The Guts n’ Glory of Database Internals: What the disk can do for you

time to read 2 min | 369 words

I’m currently in the process of getting some benchmark numbers for a process we have, and I was watching some metrics along the way. I have mentioned that disk’s speed can be effected by quite a lot of things. So here are two metrics, taken about 1 minute apart in the same benchmark.

This is using a Samsung PM871 512GB SSD drive, and it is currently running on a laptop, so not the best drive in the world, but certainly a respectable one.

Here is the steady state operation while we are doing a lot of write work. Note that the response time is very high, in computer terms, forever and a half:


And here is the same operation, but now we need to do some cleanup and push more data to the disk, in which case, we get great performance.


But oh dear good, just look at the latency numbers that we are seeing here.

Same machine, local hard disk (and SSD to boot), and we are seeing latency numbers that aren’t even funny.

In this case, the reason for this is that we are flushing the data file along side the journal file. In order to allow to to proceed as fast as possible, we try to parallelize the work so even though the data file flush is currently holding most of the I/O, we are still able to proceed with minimal hiccups and stall as far as the client is concerned.

But this can really bring home the fact that we are actually playing with a very limited pipe, and there little that we can do to control the usage of the pipe at certain points (a single fsync can flush a lot of unrelated stuff) and there is no way to throttle things and let the OS know (this particular flush operation should take more than 100MB/s, I’m fine with it taking a bit longer, as long as I have enough I/O bandwidth left for other stuff).

The Guts n’ Glory of Database Internals: The curse of old age…

time to read 5 min | 922 words

This is a fun series to write, but I’m running out of topics where I can speak about the details at a high level without getting into nitty gritty details that will make no sense to anyone but database geeks. If you have any suggestions for additional topics, I would love to hear about them.

This post, however, is about another aspect of running a database engine. It is all about knowing what is actually going on in the system. A typical web application has very little state (maybe some caches, but that is pretty much about it) and can be fairly easily restarted if you run into some issue (memory leak, fragmentation, etc) to recover most problems while you investigate exactly what is going on. A surprising number of production systems actually have this feature that they just restart on a regular basis, for example. IIS will restart a web application every 29 hours, for example, and I have seen production deployment of serious software where the team was just unaware of that. It did manage to reduce a lot of the complexity, because the application never got around to live long enough to actually carry around that much garbage.

A database tend to be different. A database engine lives for a very long time, typically weeks, months or years, and it is pretty bad when it goes down, it isn’t a single node in the farm that is temporarily missing or slow while it is filling the cache, this is the entire system being down without anything that you can do about it (note, I’m talking about single node systems here, distributed systems has high availability systems that I’m ignoring at the moment). That tend to give you a very different perspective on how you work.

For example, if you are using are using Cassandra, it (at least used to) have an issue with memory fragmentation over time. It would still have a lot of available memory, but certain access pattern would chop that off into smaller and smaller slices, until just managing the memory at the JVM level caused issues. In practice, this can cause very long GC pauses (multiple minutes). And before you think that this is an situation unique to managed databases, Redis is known to suffer from fragmentation as well, which can lead to higher memory usage (and even kill the process, eventually) for pretty much the same reason.

Databases can’t really afford to use common allocation patterns (so no malloc / free or the equivalent) because they tend to hold on to memory for a lot longer, and their memory consumption is almost always dictated by the client. In other words, saving increasing large record will likely cause memory fragmentation, which I can then utilize further by doing another round of memory allocations, slightly larger than the round before (forcing even more fragmentation, etc).  Most databases use dedicated allocators (typically some from of arena allocators) with limits that allows them to have better control of that and mitigate that issue. (For example, by blowing the entire arena on failure and allocating a new one, which doesn’t have any fragmentation).

But you actually need to build this kind of thing directly into the engine, and you need to also account for that. When you have a customer calling with “why is the memory usage going up”, you need to have some way to inspect this and figure out what to do about that. Note that we aren’t talking about memory leaks, we are talking about when everything works properly, just not in the ideal manner.

Memory is just one aspect of that, if one that is easy to look at. Other things that you need to watch for is anything that has a linear cost proportional to your runtime. For example, if you have a large LRU cache, you need to make sure that after a couple of months of running, pruning that cache isn’t going to be an O(N) job running every 5 minutes, never finding anything to prune, but costing a lot of CPU time. The number of file handles is also a good indication of a problem in some cases, some databases engines have a lot of files open (typically LSM ones), and they can accumulate over time until the server is running out of those.

Part of the job of the database engine is to consider not only what is going on now, but how to deal with (sometimes literally) abusive clients that try to do very strange things, and how to manage to handle them. In one particular case, a customer was using a feature that was designed to have a maximum of a few dozen entries in a particular query to pass 70,000+ entries. The amazing thing that this worked, but as you can imagine, all sort of assumptions internal to the that features were very viciously violated, requiring us to consider whatever to have a hard limit on this feature, so it is within its design specs or try to see if we can redesign the entire thing so it can handle this kind of load.

And the most “fun” is when those sort of bugs are only present after a couple of weeks of harsh production systems running. So even when you know what is causing this, actually reproducing the scenario (you need memory fragmented in a certain way, and a certain number of cache entries, and the application requesting a certain load factor) can be incredibly hard.

The Guts n’ Glory of Database Internals: Backup, restore and the environment…

time to read 4 min | 684 words

A lot of the complexities involved in actually building a database engine aren’t related to the core features that you want to have. They are related to what looks like peripheral concerns. Backup / restore is an excellent example of that.

Obviously, you want your database engine to have support for backup and restore. But at the same time, actually implementing that (efficiently and easily) is not something that can be done trivially. Let us consider Redis as a good example; in order to backup its in memory state, Redis will fork a new process, and use the OS’s support for copy on write to have a stable snapshot of the in-memory state that it can then write to disk. It is a very elegant solution, with very little code, and it can be done with the assistance of the operation system (almost always a good thing).

It also exposes you to memory bloat if you are backing up to a slow disk (for example, a remote machine) at the same time that you have a lot of incoming writes. Because the OS will create a copy of every memory page that is touched as long as the backup process is running (on its own copy of the data), the amount of memory actually being used is non trivial. This can lead to swapping, and in certain cases, the OS can decide that it runs out of memory and just start killing random processes (most likely, the actual Redis server that is being used, since it is the consumer of all this memory).

Another consideration to have is exactly what kind of work do you have to do when you restore the data. Ideally, you want to be up and running as soon as possible. Given database sizes today, even reading the entire file can be prohibitively expensive, so you want to be able to read just enough to start doing the work, and then complete the process of finishing up the restore later (while being online). The admin will appreciate it much more than some sort of a spinning circle or a progress bar measuring how long the system is going to be down.

The problem with implementing such features is that you need to consider the operating environment in which you are working. The ideal case is if you can control such behaviors, for example, have dedicated backup & restore commands that the admin will use exclusively. But in many cases, you have admins that do all sorts of various things, from shutting down the database and zipping to a shared folder on a nightly basis to taking a snapshot or just running a script with copy / rsync on the files on some schedule.

Some backup products have support for taking a snapshot of the disk state at a particular point in time, but this goes back to the issues we raised in a previous post about low level hooks. You need to be aware of the relevant costs and implications of those things. In particular, most databases are pretty sensitive to the order in which you backup certain files. If you take a snapshot of the journal file at time T1, but the data file at time T2, you are likely to have data corruption when you restore (the data file contains data that isn’t in the journal, and there is no way to recover it).

The really bad thing about this is that it is pretty easy for this to mostly work, so even if you have a diligent admin who test the restore, it might actually work, except when it will fail when you really need it.

And don’t get me started on cloud providers / virtual hosts that offer snapshots. The kind of snapshotting capabilities that a database requires are very specific (all changes that were committed to disk, in the order they were sent, without any future things that might be in flight) in order to be able to successfully backup & restore. From experience, those are not the kind of promises that you get from those tools.

The Guts n’ Glory of Database Internals: The communication protocol

time to read 4 min | 799 words

With all the focus on writing to disk and ensuring the high speed of the database, I almost missed a very important aspect of databases, how do we actually communicate with the outside world.

In many cases, there is no need for that. Many databases just run either embedded only (LevelDB, LMDB, etc) or just assume that this is someone else’s problem, pretty much. But as it turn out, if you aren’t an embedded database, there is quite a bit that you need to get right when you are working on the network code. For the rest of this post, I’m going to assume that we are talking about a database that clients connect to through the network. It can be either locally (shared memory / pipes / UNIX sockets / TCP) or remotely (pretty much only TCP at this point).

The actual protocol format don’t really matter. I’m going to assume that there is some sort of a streaming connection (think TCP socket) and leave it at that. But the details of how the database communicate are quite important.

In particular, what kind of communication do we have between client and server. Is each request independent of one another, are there conversations, what kind of guarantees do we give for the lifetime of the connection, etc?

Relational databases use a session protocol, in which the client will have a conversation with the server, composed of multiple separate request / replies. Sometimes you can tell the server “just process the requests as fast as you can, and send the replies as well, but don’t stop for sending replies”, which helps, but this is probably the worst possible way to handle this. The problem is that you need a connection per client, and those conversation can be pretty long, so you hold up to the connection for a fairly long duration, which means that you have to wait for the next part of the conversation to know what to do next.

A better approach is to dictate that all operations are strictly one way, so once the server has finished processing a request, it can send the reply to the client and move on to process the next request (from the same client or the other one) immediately. In this case, we have several options, we can pipeline multiple requests on the same channel (allowing out of order replies for requests / replies), we can use the serial nature of TCP connections and have a pool of connections that we can share among clients.

An even better method is if you can combine multiple requests from the server to a single reply from the server. This can be useful if you know that the shape of all the commands are going to be same (for example, if you are asked in insert a million records, you can process them as soon as possible, and only confirm that you accepted every ten thousands or so). It may sound silly, but those kind of things really add up, because the client can just pour the data in and the server will process it as soon as it can.

Other considerations that you have to take into account in the communication protocol is security and authentication. If you are allowing remote clients, then they need some way to authenticate. And that can be expensive (typically authentication required multiple round trips between server & client to verify the credentials, and potentially and out of server check in something like LDAP directory), so you want to authenticate once and then keep on to the same connection for as long as you can.

A more problematic issues comes when talking about the content of your conversation with the database, you typically don’t want to send everything over plain text, so you need to encrypt the channel, which has its own costs. For fun, this means dealing with things like certificates, SSL and pretty complex stuff that very few properly understand. If you can get away with letting something else do that for you, see if you can. Setting up a secured connection that cannot be easily bypassed (man in the middle, fake certs, just weak encryption, etc) is none trivial, and should be left for experts in their field.

Personally, I like to just go with HTTPS for pretty much all of that, because it take a lot of the complexity away, and it means that all the right choices has probably already been made for me. HTTPS, of course, is a request/reply, but things like connection pooling, keep-alive and pipelining can do quite a lot to mitigate that in the protocol level. And something like Web Sockets allows to have most of the benefits of TCP connections, with a lot of the complexity handled for you.

The Guts n’ Glory of Database Internals: The enemy of thy database is…

time to read 4 min | 747 words

You might have noticed a theme here in the past few posts. In order to achieve a good performance and ensure the stability of the system, most database engines has a pretty good relationship with the low level hardware and operating system internals.

You pretty much have to do that, because you want to be able to squeeze every last bit of performance of the system. And it works, quite well, until people start lying to you. And by people, I mean all sort of stuff.

The most obvious one is the hardware. If I’m asking the hardware “please make sure that this is on persisted medium”, and the hardware tell me “sure”, but does not such thing. There is very little that the database engine can do. There are quite a lot of drives there that flat out lie about this. Typically enterprise grade drives will not do that unless you have them configured to survive power loss (maybe they have a battery backup?), but I have seen some production systems that were hard to deal with because of that.

Another common case is when administrators put the database on a remote server. This can be because they have a shared storage setup, or are used to putting all their eggs in one basket, or maybe they already have backup scripts running in that location. Whatever the actual reason, it means that every I/O operation we do (already quite expensive) is now turned into a network call (which still need to do I/O on the other side), so that mess up completely the cost benefit analysis the database does on when to actually call the hardware.

Sometimes you have an attached storage directly to the server, with high end connection that provides awesome I/O and allow you to stripe among multiple drives easily. Sometimes, that is the shared storage for the entire company, and you have to compete for I/O with everything else.

But by far the most nefarious enemy we have seen is Anti Virus of various kinds. This problem is mostly on Windows, where admins will deploy an Anti Virus almost automatically, and set it to the most intrusive levels possible, but I have seen similar cases on Linux with various kernel extensions that interfere with how the system works. In particular, timing and contracts are sometimes (often) are broken by such products, and because it is done in an extremely low level manner, the database engine typically has no idea that this happened, or how to recover. For fun, trying to figure out if an Anti Virus is installed (so you can warn the admin to set it up correctly) is one of the behaviors that most Anti Virus will look for and flag you as a virus.

Now, we have run into this with Anti Virus a lot, but the same applies for quite a lot of things. Allowing an indexing service to scan the database files, or putting them on something like Dropbox folder or pretty much anything that interfere with how the data gets persisted to disk will cause issues. And when that happens, it can be really tricky to figure out who is at fault.

Finally, and very common today, we have the cloud. The I/O rates in the cloud are typically metered, and in some clouds you get I/O rates that you would from a bad hard disk 10 years ago. What is worst, because the cloud environment is often shared, it means that you are very vulnerable to noisy neighbors. And that means that two identical I/O requests will complete, the first in 25ms, and the second in 5,000 ms (not a mistake, that is 5 seconds!).  Same file, same system, same size of request, same everything, spaced two seconds apart, and you hit something, and your entire performance work goes down the drain. You can get reserved IOPS, which helps, but you need to check what you are getting. On some clouds you get concurrent IOPS, which is nice, but the cost of serial IOPS (critical for things like journals) remains roughly the same. This is especially true if you need to make unbuffered I/O, or use fsync() on those systems.

We have actually had to add features to the product to measure I/O rates independently, so we can put the blame where it belongs (your drives gives me 6 MB/sec on our standard load, this is the maximum performance I can give under this situation).

The Guts n’ Glory of Database Internals: Writing to a data file

time to read 3 min | 556 words

imageSo, we now know how we can write to a journal file efficiently, but a large part of doing that is relying on the fact that we are never actually going to read from the journal file. In other words, this is like keeping your receipts in case of an audit. You do that because you have to, but you really don’t want to ever need it, and you just throw it in the most efficient way to a drawer and will sort if out when you need to.

In most database engines that implement a journal, there is this distinction, the journal is strict for durability and recovery, and the data file(s) are used to actually store the data in order to operate. In our case, we’ll assume a single journal and a single data file.

On every commit, we’ll write to the journal file, as previously discussed, and we ensure that the data is safely on the disk. But what happens to writes on the data file?

Simple, absolutely nothing. Whenever we need to write to the data file, we make buffered writes into the data file, which goes into the big chain of buffers that merge / reorder and try to optimize our I/O. Which is great, because we don’t really need to care about that. We can let the operating system handle all of that.

Up to a point, of course.

Every now and then, we’ll issue an fsync on the data file, forcing the file system to actually flush all those buffered writes to disk. We can do this in an async manner, and continue operations on the file. At the time that the fsync is done (which can be a lot of time, if we had a lot of write and a busy server), we know what is the minimum amount of data that was already written to the data file and persisted on disk. Since we can match it up to the position of the data on the journal, we can safely say that the next time we recover, we can start reading the journal from that location.

If we had additional writes, from later in the journal file, that ended up physically in the data file, it doesn’t matter, because they will be overwritten by the journal entries that we have.

Doing it this way allows us to generate large batches of I/O, and in most cases, allow the operating system the freedom to flush things from the buffers on its own timeline, we just make sure that this doesn’t get into degenerate case (where we’ll need to read tens of GB of journal files) by forcing this every now and then, so recovery is fast in nearly all cases.

All of this I/O tend to happen in async thread, and typical deployments will have separate volumes for logs and data files, so we can parallelize everything instead of competing with one another.

By the way, I’m running this series based on my own experience in building databases, and I’m trying to simplify it as much as it is possible to simplify such a complex topic. If you have specific questions / topics you’ll like me to cover, I’ll be happy to take them.

The Guts n’ Glory of Database Internals: Getting durable, faster

time to read 3 min | 464 words

I mentioned that fsync is a pretty expensive operation to do. But this is pretty much required if you need to get proper durability in the case of a power loss. Most database system tend to just implement fsync and get away with that, with various backoff strategies to avoid the cost of it.

LevelDB by default will not fsync, for example. Instead, it will rely on the operating system to handle writes and sync them to disk, and you have to take explicit action to tell it to action sync the journal file to disk. And most databases give you some level of choice in how you call fsync (MySQL and PostgresSQL, for example, allow you do select fsync, O_DSYNC, none, etc). MongoDB (using WiredTiger) only flush to disk every 100MB (or 2 GB, the docs are confusing), dramatically reducing the cost of flushing, at the expense of potentially losing data.

Personally, I find such choices strange, and when had a direct goal that after every commit, pulling the plug will have no affect on the database. We started out with using fsync (and its family of friends, fdatasync, FlushFileBuffers, etc) and quickly realized that this isn’t going to be sustainable, we could only achieve nice performance by grouping multiple concurrent transactions and get them to the disk in one shot (effectively, trying to buffer ourselves). Looking at the state of other databases, it was pretty depressing.

In an internal benchmark we did, we were in 2nd place, ahead of pretty much everything else. The problem was that the database engine that was ahead of us was faster by x40 times. You read that right, it was forty times faster than we were. And that sucked. Monitoring what it did showed that it didn’t bother to call fsync, instead, it used direct unbuffered I/O (FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_THROUGH on Windows). Those flags have very strict usage rules (specific alignment for both memory and position in file), but the good thing about them is that they allow us to send the data directly from the user memory all the way to the disk while bypassing all the caches, that means that when we write a few KB, we write a few KB, we don’t need to wait for the entire disk cache to be flushed to disk.

That gave us a tremendous boost. Other things that we did was compress the data that we wrote to the journal, to reduce the amount of I/O, and again, preallocation and writing in sequential manner helps, quite a lot.

Note that in this post I’m only talking about writing to the journal here, since that is typically what is slowing down writes, in my next post, I’ll talk about writes to the data file itself.

The async/await model and its perils

time to read 3 min | 462 words

This blog post about  the color of a function is a really good explanation of the major issues with sync and async code in modern programming.

In C#, we have this lovely async/await model, which allows us to have the compiler handle all the hard work of yielding a thread while there is some sort of an expensive I/O bound operation going on. Having worked with that for quite a while, I can tell you that I really agree with the Bob’s frustrations on the whole concept.

But from my perspective, this come at several significant costs. The async machinery isn’t free, and in some cases (discussed below), the performance overhead of using async is actually significantly higher cost than using the standard blocking model. There is also the issue of the debugability of the solution, if you have a lot of async work going on, it is very hard to see what the state of the overall system is.

In practice, I think that we’ll fall down into the following rules:

For requests that are common, short and most of the work is either getting the data from the client / sending the data to the client, with a short (mostly CPU bound) work, we can use async operations, because they free a thread to do useful work (processing the next request) while we are spending most of our time in doing I/O with the remote machine.

For high performance stuff, where we have a single request doing quite a lot of stuff, or long living, we typically want to go the other way. We want to have a dedicated thread for this operation, and we want to do blocking I/O. The logic is that this operation isn’t going to be doing much while we are waiting for the I/O, so we might as well block the thread and just wait for it in place. We can rely on buffering to speed things up, but there is no point in giving up this thread for other work, because this is rare operation that is we want to be able to explicitly track all the way through.

In practice, with RavenDB, this means that a request such as processing a query is going to be handled mostly async, because we have a short compute bound operation (actually running the query), then we send the data to the client, which should take most of the time. In that time frame, we can give up the request processing thread to do another query. On the other hand, an operation like bulk insert shouldn’t want to give up its thread, because another request coming in and interrupting us means that we will slow down the bulk insert operation.

The Guts n’ Glory of Database Internals: Durability in the real world

time to read 4 min | 729 words

After looking at the challenges involved in ensuring durability, let us see how database engines typically handle that. In general, I have seen three different approaches.

The append only model, the copy-on-write model and journaling. We discussed the append only mode a few times already, the main advantage here is that we always write to the end of the file, and we can commit (and thus make durable), but just calling fsync* on the file after we complete all the writes.

* Nitpicker corner: I’m using fsync as a general term for things like FlushFileBuffers, fdatasync, fsync, etc. I’m focused on the actual mechanics, rather than the specific proper API to use here.

There are some issues here that you need to be aware of, though. A file system (and block devices in general) will freely re-order writes as they wish, so the following bit of code:


May not actually do what you expect it to do. It is possible that during crash recovery, the second write was committed to disk (fully valid and functioning), but the first write was not. So if you validate just the transaction header, you’ll see that you have a valid bit of data, while the file contains some corrupted data.

The other alternative is to copy-on-write, instead of modifying the data in place, we write it (typically at the end of the file), fsync that, then point to the new location from a new file, and fsync that in turn. Breaking it apart into two fsyncs means that it is much more costly, but it also forces the file system to put explicit barriers between the operations, so it can’t reorder things. Note that you can also do that on a single file, with fsync between the two operations. But typically you use that on separate files.

Finally, we have the notion of explicit journaling. The idea is that you dedicate a specific file (or set of files), and then you can just write to them as you go along. Each transaction you write is hashed and verified, so both the header and the data can be checked at read time. And after every transaction, you’ll fsync the journal, which is how you commit the transaction.

On database startup, you read the journal file and apply all the valid transactions, until you reach the end of the file or a transaction that doesn’t match its hash, at which point you know that it wasn’t committed properly. In this case, a transaction is the set of operations that needs to be applied to the data file in order to sync it with the state it had before the restart. That can be modifying a single value, or atomically changing a whole bunch of records.

I like journal files because they allow me to do several nice tricks. Again, we can pre-allocate them in advance, which means that we suffer much less from fragmentation, but more importantly, most of the writes in journal systems are done at the same location (one after another), so we get the benefit of having sequential writes, which is pretty much the best thing ever to getting good performance from the hard disk.

There are things that I’m skipping, of course, append only or copy on write typically write to the data file, which means that you can’t do much there, you need the data available. But a journal file is rarely read, so you can do things like compress the data to the file on the fly, and reduce the I/O costs that you are going to pay. Other things that you can do is release the transaction lock before you actually write to the journal file, let the next transaction start, but not confirm the current transaction to the user until the disk let us know that the write has completed. That way, we can parallelize the costly part of the old transaction (I/O to disk) with the compute bound portion of the new transaction, gaining something in the meantime.

This is typically called early lock release, and while we played with it, we didn’t really see good numbers here to actually implement it for production.

The Guts n’ Glory of Database Internals: Understanding durability with hard disks

time to read 6 min | 1012 words

One of the most important aspect of a database engine is that it needs to support durability. Now, that is a personal opinion as a database author, but I consider this to be a pretty important consideration when selecting a database. Other database engines disagree, from pure in memory databases that lose all the data on restart to databases that makes “best effort” and will work as long as they don’t crash in the wrong place.

Because there are several different metrics for what durable means, I’ll provide several levels of possible durability requirements.

  • None – if the application / server restarts, all the data is lost. Typically is used for in memory databases, explicitly giving up durability for performance.
  • Try – the data is written to disk, but no attempt is made to make sure that it is coherent / up to date. Typically is used to startup an in memory / near in memory database from cold state.
  • Crash – if the database has confirmed a write, and then immediately crashed, the data is still there. Most databases try to get to this level.
  • Power loss – if the database has confirmed a write, even complete power loss of the entire machine will still keep the written data. This is where you wanna be.

Note that in all those cases, I’m talking about single node databases, distributed stuff is a lot more complex, so I’ll not touch it here.

This is when talking about durability, but there is also the notion of atomicity, in other words, a transaction that is composed of multiple operations should either be there complete (and certain if confirmed to the client) or not be there at all (rollback), there should never be a situation where some of the records went in, and some didn’t.

Finally, there is the paranoid mode for durability, in which you don’t trust the hardware, so you write to multiple locations, hash it and validate. Personally, at that level, I think that this is the purpose of the file system to verify that, and this is where the responsibility of the database ends, but if you are stuck with a poor file system choice (like FAT a decade or two ago), that was certainly something that you’ll have to consider.

At any rate, one of the major problems with gaining durability is that you have to pay so much for it. In order to actually be durable, you have to write your changes to the actual disk platter, and that is something that might actually require physical parts to move, so that is something that is very costly. How costly? A high end (15,000 RPM) hard disk can do a theoretical maximum of 250 such writes per second, and that is an extremely theoretical number. In most cases, even on high end hard disks, you’ll see a maximum of a 100 – 150 per second. You can probably double or triple that for high end SSD drive, but those are still very poor numbers, compared to the number of operations you can do in memory and in the CPU in that time frame.

That puts a pretty hard limit on the number of times you can hit the disk, but that is not something that we can tolerate, so the process of making a write in most operation systems, looks like this:


Notice the number of buffers in the middle? Each of them is going to gather the I/O as it can before issuing a (large) write to the layer below them. With the idea that this way, we can amortize the cost of going to the physical disk among many different writes. It works, it works quite well, in fact, to the point where most of the time, you don’t really think how slow the hardware is really is.

But for databases, this is a horrendous problem. To start with, if I’m issuing a write to the disk, I have no idea when this actually hit the platter. For fun, all through this chain (which is actually simplified), each component is free to re-order the writes in any way it feels like it, so we can’t even rely on the order of the writes to ensure consistency. No, in order to protect itself from data loss / corruption in the presence of power loss, we have to tell the operating system to actually skip all those layers and flush directly to disk.

As I said, this is expensive to do. In particular, the normal way to do to make your writes, and then to to call fsync(fd) in order to flush those changes down the chain, all the way to the disk. This has a few issues, however. In particular, note that in the gap between the file system and the disk driver, we’ll lose the correlation between writes made to a particular file and any other writes made to that device. That end result is that the fsync command forces us to flush the entire disk driver buffer (and the disk buffers, etc). In production systems, that can be hundreds of MB that were innocently sitting there, slowly being written to disk, and suddenly you have this disruptive fsync that need to be written, so everything is flushed to disk, and the fsync (which you expect to be short, because you wrote on only a few dozen KB) now takes a minute, because it is actually flushing 100 MB writes from totally different process.

This post is getting long enough, so I’ll defer the actual discussion on how databases actually achieve durability to the next one, in the meantime, just consider the complexities involved in making sure that the data is on the disk, and how much of the design of modern databases is spent in optimizing just this part.


  1. Challenge: The race condition in the TCP stack - one day from now
  2. Challenge: The race condition in the TCP stack, answer - about one day from now
  3. API Design: robust error handling and recovery - 3 days from now
  4. Debugging CoreCLR applications in WinDBG - 4 days from now
  5. Non reproducible / intermittent error handling - 5 days from now

And 21 more posts are pending...

There are posts all the way to Aug 29, 2016


  1. Production postmortem (16):
    05 Feb 2016 - A null reference in our abstraction
  2. The Guts n’ Glory of Database Internals (20):
    18 Jul 2016 - What the disk can do for you
View all series



Main feed Feed Stats
Comments feed   Comments Feed Stats