Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,193 | Comments: 46,025

filter by tags archive

The Guts n’ Glory of Database Internals: The curse of old age…

time to read 5 min | 922 words

This is a fun series to write, but I’m running out of topics where I can speak about the details at a high level without getting into nitty gritty details that will make no sense to anyone but database geeks. If you have any suggestions for additional topics, I would love to hear about them.

This post, however, is about another aspect of running a database engine. It is all about knowing what is actually going on in the system. A typical web application has very little state (maybe some caches, but that is pretty much about it) and can be fairly easily restarted if you run into some issue (memory leak, fragmentation, etc) to recover most problems while you investigate exactly what is going on. A surprising number of production systems actually have this feature that they just restart on a regular basis, for example. IIS will restart a web application every 29 hours, for example, and I have seen production deployment of serious software where the team was just unaware of that. It did manage to reduce a lot of the complexity, because the application never got around to live long enough to actually carry around that much garbage.

A database tend to be different. A database engine lives for a very long time, typically weeks, months or years, and it is pretty bad when it goes down, it isn’t a single node in the farm that is temporarily missing or slow while it is filling the cache, this is the entire system being down without anything that you can do about it (note, I’m talking about single node systems here, distributed systems has high availability systems that I’m ignoring at the moment). That tend to give you a very different perspective on how you work.

For example, if you are using are using Cassandra, it (at least used to) have an issue with memory fragmentation over time. It would still have a lot of available memory, but certain access pattern would chop that off into smaller and smaller slices, until just managing the memory at the JVM level caused issues. In practice, this can cause very long GC pauses (multiple minutes). And before you think that this is an situation unique to managed databases, Redis is known to suffer from fragmentation as well, which can lead to higher memory usage (and even kill the process, eventually) for pretty much the same reason.

Databases can’t really afford to use common allocation patterns (so no malloc / free or the equivalent) because they tend to hold on to memory for a lot longer, and their memory consumption is almost always dictated by the client. In other words, saving increasing large record will likely cause memory fragmentation, which I can then utilize further by doing another round of memory allocations, slightly larger than the round before (forcing even more fragmentation, etc).  Most databases use dedicated allocators (typically some from of arena allocators) with limits that allows them to have better control of that and mitigate that issue. (For example, by blowing the entire arena on failure and allocating a new one, which doesn’t have any fragmentation).

But you actually need to build this kind of thing directly into the engine, and you need to also account for that. When you have a customer calling with “why is the memory usage going up”, you need to have some way to inspect this and figure out what to do about that. Note that we aren’t talking about memory leaks, we are talking about when everything works properly, just not in the ideal manner.

Memory is just one aspect of that, if one that is easy to look at. Other things that you need to watch for is anything that has a linear cost proportional to your runtime. For example, if you have a large LRU cache, you need to make sure that after a couple of months of running, pruning that cache isn’t going to be an O(N) job running every 5 minutes, never finding anything to prune, but costing a lot of CPU time. The number of file handles is also a good indication of a problem in some cases, some databases engines have a lot of files open (typically LSM ones), and they can accumulate over time until the server is running out of those.

Part of the job of the database engine is to consider not only what is going on now, but how to deal with (sometimes literally) abusive clients that try to do very strange things, and how to manage to handle them. In one particular case, a customer was using a feature that was designed to have a maximum of a few dozen entries in a particular query to pass 70,000+ entries. The amazing thing that this worked, but as you can imagine, all sort of assumptions internal to the that features were very viciously violated, requiring us to consider whatever to have a hard limit on this feature, so it is within its design specs or try to see if we can redesign the entire thing so it can handle this kind of load.

And the most “fun” is when those sort of bugs are only present after a couple of weeks of harsh production systems running. So even when you know what is causing this, actually reproducing the scenario (you need memory fragmented in a certain way, and a certain number of cache entries, and the application requesting a certain load factor) can be incredibly hard.

The Guts n’ Glory of Database Internals: Backup, restore and the environment…

time to read 4 min | 684 words

A lot of the complexities involved in actually building a database engine aren’t related to the core features that you want to have. They are related to what looks like peripheral concerns. Backup / restore is an excellent example of that.

Obviously, you want your database engine to have support for backup and restore. But at the same time, actually implementing that (efficiently and easily) is not something that can be done trivially. Let us consider Redis as a good example; in order to backup its in memory state, Redis will fork a new process, and use the OS’s support for copy on write to have a stable snapshot of the in-memory state that it can then write to disk. It is a very elegant solution, with very little code, and it can be done with the assistance of the operation system (almost always a good thing).

It also exposes you to memory bloat if you are backing up to a slow disk (for example, a remote machine) at the same time that you have a lot of incoming writes. Because the OS will create a copy of every memory page that is touched as long as the backup process is running (on its own copy of the data), the amount of memory actually being used is non trivial. This can lead to swapping, and in certain cases, the OS can decide that it runs out of memory and just start killing random processes (most likely, the actual Redis server that is being used, since it is the consumer of all this memory).

Another consideration to have is exactly what kind of work do you have to do when you restore the data. Ideally, you want to be up and running as soon as possible. Given database sizes today, even reading the entire file can be prohibitively expensive, so you want to be able to read just enough to start doing the work, and then complete the process of finishing up the restore later (while being online). The admin will appreciate it much more than some sort of a spinning circle or a progress bar measuring how long the system is going to be down.

The problem with implementing such features is that you need to consider the operating environment in which you are working. The ideal case is if you can control such behaviors, for example, have dedicated backup & restore commands that the admin will use exclusively. But in many cases, you have admins that do all sorts of various things, from shutting down the database and zipping to a shared folder on a nightly basis to taking a snapshot or just running a script with copy / rsync on the files on some schedule.

Some backup products have support for taking a snapshot of the disk state at a particular point in time, but this goes back to the issues we raised in a previous post about low level hooks. You need to be aware of the relevant costs and implications of those things. In particular, most databases are pretty sensitive to the order in which you backup certain files. If you take a snapshot of the journal file at time T1, but the data file at time T2, you are likely to have data corruption when you restore (the data file contains data that isn’t in the journal, and there is no way to recover it).

The really bad thing about this is that it is pretty easy for this to mostly work, so even if you have a diligent admin who test the restore, it might actually work, except when it will fail when you really need it.

And don’t get me started on cloud providers / virtual hosts that offer snapshots. The kind of snapshotting capabilities that a database requires are very specific (all changes that were committed to disk, in the order they were sent, without any future things that might be in flight) in order to be able to successfully backup & restore. From experience, those are not the kind of promises that you get from those tools.

Fast transaction logWindows

time to read 5 min | 806 words


In my previous post, I have tested journal writing techniques on Linux, in this post, I want to do the same for Windows, and see what the impact of the various options are the system performance.

Windows has slightly different options than Linux. In particular, in Windows, the various flags and promises and very clear, and it is quite easy to figure out what is it that you are supposed to do.

We have tested the following scenarios

  • Doing buffered writes (pretty useless for any journal file, which needs to be reliable, but good baseline metric).
  • Doing buffered writes and calling FlushFileBuffers after each transaction (which is pretty common way to handle committing to disk in databases), and the equivalent of calling fsync.
  • Using FILE_FLAG_WRITE_THROUGH flag and asking the kernel to make sure that after every write, everything will be flushed to disk. Note that the disk may or may not buffer things.
  • Using FILE_FLAG_NO_BUFFERING flag to bypass the kernel’s caching and go directly to disk. This has special memory alignment considerations
  • Using FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING flag to ensure that we don’t do any caching, and actually force the disk to do its work. On Windows, this is guaranteed to ask the disk to flush to persisted medium (but the disk can ignore this request).

Here is the code:

We have tested this on an AWS macine ( i2.2xlarge – 61 GB, 8 cores, 2x 800 GB SSD drive, 1GB /sec EBS), which was running Microsoft Windows Server 2012 R2 RTM 64-bits. The code was compiled for 64 bits with the default release configuration.

What we are doing is write 1 GB journal file, simulating 16 KB transactions and simulating 65,535 separate commits to disk. That is a lot of work that needs to be done.

First, again, I run it on the system drive, to compare how it behaves:

Method Time (ms) Write cost (ms)
Buffered

396

0.006

Buffered + FlushFileBuffers

121,403

1.8

FILE_FLAG_WRITE_THROUGH

58,376

0.89

FILE_FLAG_NO_BUFFERING

56,162

0.85

FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING

55,432

0.84

Remember, this is us running on the system disk, not on the SSD drive. Here are those numbers, which are much more interesting for us.

Method Time (ms) Write cost (ms)
Buffered

410

0.006

Buffered + FlushFileBuffers

21,077

0.321

FILE_FLAG_WRITE_THROUGH

10,029

0.153

FILE_FLAG_NO_BUFFERING

8,491

0.129

FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING

8,378

0.127

And those numbers are very significant. Unlike the system disk, where we basically get whatever spare cycles we have, in both Linux and Windows, the SSD disk provides really good performance. But even on identical machine, running nearly identical code, there are significant performance differences between them.

Let me draw it out to you:

Options

Windows

Linux

Difference

Buffered

0.006

0.03

80% Win

Buffered + fsync() / FlushFileBuffers()

0.32

0.35

9% Win

O_DSYNC / FILE_FLAG_WRITE_THROUGH

0.153

0.29

48% Win

O_DIRECT / FILE_FLAG_NO_BUFFERING

0.129

0.14

8% Win

O_DIRECT | O_DSYNC / FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING

0.127

0.31

60% Win

In pretty much all cases Windows has been able to out perform Linux on this specific scenario. In many cases by a significant margin. In particular, in the scenario that I actually really care about, we see 60% performance advantage to Windows.

One of the reasons for this blog post and the detailed code and scenario is the external verification of these numbers. I’ll love to know that I missed something that would make Linux speed comparable to Windows, because right now this is pretty miserable.

I do have a hunch about those numbers, though. SQL Server is a major business for Microsoft, so they have a lot of pull in the company. And SQL Server uses FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING internally to handle the transaction log it uses. Like quite a bit of other Win32 APIs (WriteGather, for example), it looks tailor made for database journaling. I’m guessing that this code path has been gone over multiple times over the years, trying to optimize SQL Server by smoothing anything in the way.

As a result, if you know what you are doing, you can get some really impressive numbers on Windows in this scenario. Oh, and just to quite the nitpickers:

image_thumb[5]

Fast transaction logLinux

time to read 5 min | 910 words

We were doing some perf testing recently, and we got some odd results when running a particular benchmark on Linux. So we decided to check this on a much deeper level.

We got an AWS macine ( i2.2xlarge – 61 GB, 8 cores, 2x 800 GB SSD drive, running Ubuntu 14.04, using kernel version 3.13.0-74-generic, 1GB/sec EBS drives ) and run the following code and run it. This tests the following scenarios on a 1GB file (pre allocated) and “committing” 65,536 (64K) transactions with 16KB of data in each. The idea is that we are testing how fast we can create write those transactions to the journal file, so we can consider them committed.

We have tested the following scenarios

  • Doing buffered writes (pretty useless for any journal file, which needs to be reliable, but good baseline metric).
  • Doing buffered writes and calling fsync after each transaction (which is pretty common way to handle committing to disk in databases)
  • Doing buffered writes and calling fdatasync (which is supposed to be slightly more efficient than calling fsync in this scenario).
  • Using O_DSYNC flag and asking the kernel to make sure that after every write, everything will be synced to disk.
  • Using O_DIRECT flag to bypass the kernel’s caching and go directly to disk.
  • Using O_DIRECT | O_DSYNC flag to ensure that we don’t do any caching, and actually force the disk to do its work.

The code is written in C, and it is written to be pretty verbose and ugly. I apologize for how it looks, but the idea was to get some useful data out of this, not to generate beautiful code. It is only quite probable that I made some mistake in writing the code, which is partly why I’m talking about this.

Here is the code, and the results of execution are below:

It was compiled using: gcc journal.c –o3 –o run && ./run

The results are quite interesting:

Method Time (ms) Write cost (ms)
Buffered

525

0.03

Buffered + fsync

72,116

1.10

Buffered + fdatasync

56,227

0.85

O_DSYNC

48,668

0.74

O_DIRECT

47,065

0.71

O_DIRECT | O_DSYNC

47,877

0.73

The results are quite interesting. The buffered call, which is useless for a journal, but important as something to compare to. The rest of the options will ensure that the data reside on disk* after the call to write, and are suitable to actually get safety from the journal.

* The caveat here is the use of O_DIRECT, the docs (and Linus) seems to be pretty much against it, and there are few details on how this works with regards to instructing the disk to actually bypass all buffers. O_DIRECT | O_DSYNC seems to be the recommended option, but that probably deserve more investigation.

Of course, I had this big long discussion on the numbers planned. And then I discovered that I was actually running this on the boot disk, and not one of the SSD drives. That was a face palm of epic proportions that I was only saved from by the problematic numbers that I was seeing.

Once I realized what was going on and fixed that, we got very different numbers.

Method Time (ms) Write cost (ms)
Buffered

522

0.03

Buffered + fsync

23,094

0.35

Buffered + fdatasync

23,022

0.35

O_DSYNC

19,555

0.29

O_DIRECT

9,371

0.14

O_DIRECT | O_DSYNC

20,595

0.31

There is about 10% difference between fsync and fdatasync when using the HDD, but there is barely any difference as far as the SSD is concerned. This is because the SSD can do random updates (such as updating both the data and the metadata) much faster, since it doesn’t need to move the spindle.

Of more interest to us is that D_SYNC is significantly faster in both SSD and HHD. About 15% can be saved by using it.

But I’m just drolling over O_DIRECT write profile performance on SSD. That is so good, unfortunately, it isn’t safe to do. We need both it and O_DSYNC to make sure that we only get confirmation on the write when it hits the physical disk, instead of the drive’s cache (that is why it is actually faster, we are writing directly to the disk’s cache, basically).

The tests were run using ext4. I played with some options (noatime, noadirtime, etc), but there wasn’t any big difference between them that I could see.

In my next post, I’ll test the same on Windows.

The Guts n’ Glory of Database Internals: The communication protocol

time to read 4 min | 799 words

With all the focus on writing to disk and ensuring the high speed of the database, I almost missed a very important aspect of databases, how do we actually communicate with the outside world.

In many cases, there is no need for that. Many databases just run either embedded only (LevelDB, LMDB, etc) or just assume that this is someone else’s problem, pretty much. But as it turn out, if you aren’t an embedded database, there is quite a bit that you need to get right when you are working on the network code. For the rest of this post, I’m going to assume that we are talking about a database that clients connect to through the network. It can be either locally (shared memory / pipes / UNIX sockets / TCP) or remotely (pretty much only TCP at this point).

The actual protocol format don’t really matter. I’m going to assume that there is some sort of a streaming connection (think TCP socket) and leave it at that. But the details of how the database communicate are quite important.

In particular, what kind of communication do we have between client and server. Is each request independent of one another, are there conversations, what kind of guarantees do we give for the lifetime of the connection, etc?

Relational databases use a session protocol, in which the client will have a conversation with the server, composed of multiple separate request / replies. Sometimes you can tell the server “just process the requests as fast as you can, and send the replies as well, but don’t stop for sending replies”, which helps, but this is probably the worst possible way to handle this. The problem is that you need a connection per client, and those conversation can be pretty long, so you hold up to the connection for a fairly long duration, which means that you have to wait for the next part of the conversation to know what to do next.

A better approach is to dictate that all operations are strictly one way, so once the server has finished processing a request, it can send the reply to the client and move on to process the next request (from the same client or the other one) immediately. In this case, we have several options, we can pipeline multiple requests on the same channel (allowing out of order replies for requests / replies), we can use the serial nature of TCP connections and have a pool of connections that we can share among clients.

An even better method is if you can combine multiple requests from the server to a single reply from the server. This can be useful if you know that the shape of all the commands are going to be same (for example, if you are asked in insert a million records, you can process them as soon as possible, and only confirm that you accepted every ten thousands or so). It may sound silly, but those kind of things really add up, because the client can just pour the data in and the server will process it as soon as it can.

Other considerations that you have to take into account in the communication protocol is security and authentication. If you are allowing remote clients, then they need some way to authenticate. And that can be expensive (typically authentication required multiple round trips between server & client to verify the credentials, and potentially and out of server check in something like LDAP directory), so you want to authenticate once and then keep on to the same connection for as long as you can.

A more problematic issues comes when talking about the content of your conversation with the database, you typically don’t want to send everything over plain text, so you need to encrypt the channel, which has its own costs. For fun, this means dealing with things like certificates, SSL and pretty complex stuff that very few properly understand. If you can get away with letting something else do that for you, see if you can. Setting up a secured connection that cannot be easily bypassed (man in the middle, fake certs, just weak encryption, etc) is none trivial, and should be left for experts in their field.

Personally, I like to just go with HTTPS for pretty much all of that, because it take a lot of the complexity away, and it means that all the right choices has probably already been made for me. HTTPS, of course, is a request/reply, but things like connection pooling, keep-alive and pipelining can do quite a lot to mitigate that in the protocol level. And something like Web Sockets allows to have most of the benefits of TCP connections, with a lot of the complexity handled for you.

The Guts n’ Glory of Database Internals: The enemy of thy database is…

time to read 4 min | 747 words

You might have noticed a theme here in the past few posts. In order to achieve a good performance and ensure the stability of the system, most database engines has a pretty good relationship with the low level hardware and operating system internals.

You pretty much have to do that, because you want to be able to squeeze every last bit of performance of the system. And it works, quite well, until people start lying to you. And by people, I mean all sort of stuff.

The most obvious one is the hardware. If I’m asking the hardware “please make sure that this is on persisted medium”, and the hardware tell me “sure”, but does not such thing. There is very little that the database engine can do. There are quite a lot of drives there that flat out lie about this. Typically enterprise grade drives will not do that unless you have them configured to survive power loss (maybe they have a battery backup?), but I have seen some production systems that were hard to deal with because of that.

Another common case is when administrators put the database on a remote server. This can be because they have a shared storage setup, or are used to putting all their eggs in one basket, or maybe they already have backup scripts running in that location. Whatever the actual reason, it means that every I/O operation we do (already quite expensive) is now turned into a network call (which still need to do I/O on the other side), so that mess up completely the cost benefit analysis the database does on when to actually call the hardware.

Sometimes you have an attached storage directly to the server, with high end connection that provides awesome I/O and allow you to stripe among multiple drives easily. Sometimes, that is the shared storage for the entire company, and you have to compete for I/O with everything else.

But by far the most nefarious enemy we have seen is Anti Virus of various kinds. This problem is mostly on Windows, where admins will deploy an Anti Virus almost automatically, and set it to the most intrusive levels possible, but I have seen similar cases on Linux with various kernel extensions that interfere with how the system works. In particular, timing and contracts are sometimes (often) are broken by such products, and because it is done in an extremely low level manner, the database engine typically has no idea that this happened, or how to recover. For fun, trying to figure out if an Anti Virus is installed (so you can warn the admin to set it up correctly) is one of the behaviors that most Anti Virus will look for and flag you as a virus.

Now, we have run into this with Anti Virus a lot, but the same applies for quite a lot of things. Allowing an indexing service to scan the database files, or putting them on something like Dropbox folder or pretty much anything that interfere with how the data gets persisted to disk will cause issues. And when that happens, it can be really tricky to figure out who is at fault.

Finally, and very common today, we have the cloud. The I/O rates in the cloud are typically metered, and in some clouds you get I/O rates that you would from a bad hard disk 10 years ago. What is worst, because the cloud environment is often shared, it means that you are very vulnerable to noisy neighbors. And that means that two identical I/O requests will complete, the first in 25ms, and the second in 5,000 ms (not a mistake, that is 5 seconds!).  Same file, same system, same size of request, same everything, spaced two seconds apart, and you hit something, and your entire performance work goes down the drain. You can get reserved IOPS, which helps, but you need to check what you are getting. On some clouds you get concurrent IOPS, which is nice, but the cost of serial IOPS (critical for things like journals) remains roughly the same. This is especially true if you need to make unbuffered I/O, or use fsync() on those systems.

We have actually had to add features to the product to measure I/O rates independently, so we can put the blame where it belongs (your drives gives me 6 MB/sec on our standard load, this is the maximum performance I can give under this situation).

The Guts n’ Glory of Database Internals: Writing to a data file

time to read 3 min | 556 words

imageSo, we now know how we can write to a journal file efficiently, but a large part of doing that is relying on the fact that we are never actually going to read from the journal file. In other words, this is like keeping your receipts in case of an audit. You do that because you have to, but you really don’t want to ever need it, and you just throw it in the most efficient way to a drawer and will sort if out when you need to.

In most database engines that implement a journal, there is this distinction, the journal is strict for durability and recovery, and the data file(s) are used to actually store the data in order to operate. In our case, we’ll assume a single journal and a single data file.

On every commit, we’ll write to the journal file, as previously discussed, and we ensure that the data is safely on the disk. But what happens to writes on the data file?

Simple, absolutely nothing. Whenever we need to write to the data file, we make buffered writes into the data file, which goes into the big chain of buffers that merge / reorder and try to optimize our I/O. Which is great, because we don’t really need to care about that. We can let the operating system handle all of that.

Up to a point, of course.

Every now and then, we’ll issue an fsync on the data file, forcing the file system to actually flush all those buffered writes to disk. We can do this in an async manner, and continue operations on the file. At the time that the fsync is done (which can be a lot of time, if we had a lot of write and a busy server), we know what is the minimum amount of data that was already written to the data file and persisted on disk. Since we can match it up to the position of the data on the journal, we can safely say that the next time we recover, we can start reading the journal from that location.

If we had additional writes, from later in the journal file, that ended up physically in the data file, it doesn’t matter, because they will be overwritten by the journal entries that we have.

Doing it this way allows us to generate large batches of I/O, and in most cases, allow the operating system the freedom to flush things from the buffers on its own timeline, we just make sure that this doesn’t get into degenerate case (where we’ll need to read tens of GB of journal files) by forcing this every now and then, so recovery is fast in nearly all cases.

All of this I/O tend to happen in async thread, and typical deployments will have separate volumes for logs and data files, so we can parallelize everything instead of competing with one another.

By the way, I’m running this series based on my own experience in building databases, and I’m trying to simplify it as much as it is possible to simplify such a complex topic. If you have specific questions / topics you’ll like me to cover, I’ll be happy to take them.

The Guts n’ Glory of Database Internals: Getting durable, faster

time to read 3 min | 464 words

I mentioned that fsync is a pretty expensive operation to do. But this is pretty much required if you need to get proper durability in the case of a power loss. Most database system tend to just implement fsync and get away with that, with various backoff strategies to avoid the cost of it.

LevelDB by default will not fsync, for example. Instead, it will rely on the operating system to handle writes and sync them to disk, and you have to take explicit action to tell it to action sync the journal file to disk. And most databases give you some level of choice in how you call fsync (MySQL and PostgresSQL, for example, allow you do select fsync, O_DSYNC, none, etc). MongoDB (using WiredTiger) only flush to disk every 100MB (or 2 GB, the docs are confusing), dramatically reducing the cost of flushing, at the expense of potentially losing data.

Personally, I find such choices strange, and when had a direct goal that after every commit, pulling the plug will have no affect on the database. We started out with using fsync (and its family of friends, fdatasync, FlushFileBuffers, etc) and quickly realized that this isn’t going to be sustainable, we could only achieve nice performance by grouping multiple concurrent transactions and get them to the disk in one shot (effectively, trying to buffer ourselves). Looking at the state of other databases, it was pretty depressing.

In an internal benchmark we did, we were in 2nd place, ahead of pretty much everything else. The problem was that the database engine that was ahead of us was faster by x40 times. You read that right, it was forty times faster than we were. And that sucked. Monitoring what it did showed that it didn’t bother to call fsync, instead, it used direct unbuffered I/O (FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_THROUGH on Windows). Those flags have very strict usage rules (specific alignment for both memory and position in file), but the good thing about them is that they allow us to send the data directly from the user memory all the way to the disk while bypassing all the caches, that means that when we write a few KB, we write a few KB, we don’t need to wait for the entire disk cache to be flushed to disk.

That gave us a tremendous boost. Other things that we did was compress the data that we wrote to the journal, to reduce the amount of I/O, and again, preallocation and writing in sequential manner helps, quite a lot.

Note that in this post I’m only talking about writing to the journal here, since that is typically what is slowing down writes, in my next post, I’ll talk about writes to the data file itself.

The cost of the async state machine

time to read 3 min | 483 words

In my previous post, I talked about high performance cost of the async state machine in certain use cases. I decided to test this out using the following benchmark code.

Here we have two identical pieces of code, sending 16 MB over the network. (Actually the loopback device, which is important, and we’ll look into exactly why later)

One of those is a using the synchronous, blocking, API, and the second is using async API. Beside the API differences, they both do the exact same thing, send 16MB over the network.

  • Sync - 0.055 seconds
  • Async - 0.380 seconds

So the async version is about 7 times worse than the sync version. But when I’m trying this out on 256MB test, the results are roughly the same.

  • Sync - 0.821 seconds
  • Async - 5.669 seconds

The reason this is the case is likely related to the fact that we are using the loopback device, in this case, we are probably almost always never blocking in the sync case, we almost always have the data available for us to read. In the async case, we have to pay for the async machinery, but we never actually get to yield the thread.

So I decided to test what would happen when we introduce some additional I/O latency. It was simplest to write the following:

And then to pipe both versions through the same proxy which gave the following results for 16MB.

  • Sync - 29.849 seconds
  • Async - 29.900 seconds

In this case, we can see that the sync benefit has effectively been eliminated. The cost of the async state machine is swallowed in the cost of the blocking waits. But unlike the sync system, when the async state machine is yielding, something else can run on this thread.

So it is all good, right?

Not quite so. As it turn out, if there are certain specific scenarios where you actually want to run a single, long, important operation, you can set things up so the TCP connection will (almost) always have buffered data in it. This is really common when you send a lot of data from one machine to the other. Not a short request, but something where you can get big buffers, and assume that the common setup is a steady state of consuming the data as fast as it comes. In RavenDB’s case, for example, one such very common scenario is the bulk insert mode. The client is able to send the data to the server fast enough in almost all cases that by the time the server process a batch, the next batch is already buffered and ready.

And in that specific case, I don’t want to be giving up my thread to something else to run on. I want this particular process to be as fast as possible, potentially at the expensive of much else.

This is likely to have some impact on our design, once we have the chance to run more tests.

FUTURE POSTS

  1. Non reproducible / intermittent error handling - about one day from now
  2. The Guts n’ Glory of Database Internals: What goes inside the transaction journal - 2 days from now
  3. The Guts n’ Glory of Database Internals: Log shipping and point in time recovery - 3 days from now
  4. The Guts n’ Glory of Database Internals: Merging transactions - 4 days from now
  5. The Guts n’ Glory of Database Internals: Early lock release - 5 days from now

And 20 more posts are pending...

There are posts all the way to Sep 02, 2016

RECENT SERIES

  1. Production postmortem (16):
    05 Feb 2016 - A null reference in our abstraction
  2. The Guts n’ Glory of Database Internals (20):
    18 Jul 2016 - What the disk can do for you
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats