reDifferent I/O Access Methods for Linux
The “Different I/O Access Methods for Linux, What We Chose for Scylla, and Why” is quite fascinating. It is a pleasure to be able to read in depth into another database implementation strategy and design decisions. In particular where they don’t match what we are doing, because we can learn from the differences.
The article is good both in terms of discussing the general I/O approaches on Linux in general and the design decisions and implications for ScyllaDB in particular.
ScyllaDB chose to use AIO/DIO (async direct I/O) using a dedicated library called Seastar. And they are effectively managing their own memory and I/O scheduling internally, skipping the kernel entirely. Given how important I/O is for a database, I can say that I strongly resonate with this approach, and it is something that we have tried for a while with RavenDB. That included paying careful attention to how we are sending I/O, controlling our own caching, etc.
We are in a different position from Scylla, of course, since we are using managed code, which introduced a different set of problems (and benefits), but during the design of RavenDB 4.0, we chose to go in a very different direction.
But first, let me show you the single statement in the post that caused me to write this blog post:
The great advantage of letting the kernel control caching is that great effort has been invested by the kernel developers over many decades into tuning the algorithms used by the cache. Those algorithms are used by thousands of different applications and are generally effective. The disadvantage, however, is that these algorithms are general-purpose and not tuned to the application. The kernel must guess how the application will behave next, and even if the application knows differently, it usually has no way to help the kernel guess correctly.
This statement is absolutely correct. The kernel caching algorithms (and the kernel behavior in general) is tailored to suit a generic set of requirements, which means that if you deviate from the way the kernel expects, it is going to be during extra work and you can experience significant problems (performance and otherwise).
Another great resource that I want to point you to is the following article: “You’re Doing It Wrong” which had a major effect on the way RavenDB 4.0 is designed.
What we did, basically, is to look at how the kernel is doing things, and then see how we can fit our own behavior to what the kernel is expecting. Actually, I’m lying here, because there is no one kernel here. RavenDB runs on Windows, Linux and OSX (Darwin kernel). So we have three very different systems with wildly different optimizations that we need to run optimally on. Actually, to be fair, we consider Windows & Linux as the main targets for deployment, but that still give us very different stacks to work on.
The key here was to be predictable in all things and be sure that whatever operations we make, the kernel can successfully predict them. This can be a lot of work, but not something that you’ll usually see in the code. It involves laying out the data so it is nearby in the file and ensuring that we have hotspots that the kernel can recognize and optimize for us, etc. And it involves a lot of work with the guts of the system to make sure that we match what the kernel expects.
For example, consider this statement from the article:
…application-level caching allows us to cache not only the data read from disk but also the work that went into merging data from multiple files into a single cache item.
This is a good example of the different behavior. ScyllaDB is using LSM model, which means that in order to read data, they typically need to lookup in multiple files. RavenDB uses a different model (B+Tree with MVCC) which typically means that store all the data in a single file. Furthermore, the way we store the information, we can access it directly via memory map without doing any work to prepare it ahead of time. That means that we can lean entirely on the page cache and gain all the benefits thereof.
The ScyllaDB approach also limits them to running only on Linux and only on specific configurations. Because they rely on async direct I/O, which is a… fiddly beast in Linux at the best of times, you need to make sure that everything matches just so in order to be able to get it working. Just running this on a stock Ubuntu won’t work, since ext4 will block for many operations. Another problem in my view is that this assumes that they are the only player on the machine. If you need to run with additional software on the machine, that can cause fights over resources. For production, this is less of a problem, but for running on a developer machine, that is frequently something that you need to take into account. The kernel will already do that for you (which is useful even in production when people put SQL Server & RavenDB on the same box) so you don’t have to worry about it too much. I’m not sure that this concern is even valid for ScyllaDB, since they tend to be deployed in clusters of dedicated machines (or at least Docker instances) so they have better control over the environment. That certainly make it easier if you can dictate such things.
Another consideration for the RavenDB approach is that we want to be, as much as possible, friendly to the administrator. When we lean on what the kernel does, we usually get that for free. The administrator can usually dictate policies to the kernel and have it follow them, and good sys admins both know how and know when to do that. On the other hand, if we wrote it all ourselves, we would also need to provide the hooks to modify the behavior (and monitor it, and train users in how it works, etc).
Finally, it is not a small thing to remember that if you let the kernel cache your data, that means that that cache is still around if you restart the database (but not the machine), which means that your mostly alleviate the issue of slow cold start if you needed to do things like update configuration or the database binaries.
More posts in "re" series:
- (06 Sep 2018) Summary
- (05 Sep 2018) When the data hits the disk
- (04 Sep 2018) Reading data from disk
- (03 Sep 2018) The hash structure
- (31 Aug 2018) Working with the file system
- (30 Aug 2018) Digging into the C++ impl
- (29 Aug 2018) Let’s check these numbers
- (28 Aug 2018) Let’s start with managed code
- (27 Aug 2018) Reading the paper