Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,190 | Comments: 45,960

filter by tags archive

ChallengeThe race condition in the TCP stack

time to read 3 min | 463 words

Occasionally, one of our tests hangs. Everything seems to be honky dory, but it just freezes and does not complete. This is a new piece of code, and thus is it suspicious unless proven otherwise, but an exhaustive review of it looked fine. It took over two days of effort to narrow it down, but eventually we managed to point the finger directly at this line of code:

image

In certain cases, this line would simply not read anything on the server. Even though the client most definitely sent the data. Now, given that TCP is being used, dropped packets might be expected. But we are actually testing on the loopback device, which I expect to be reliable.

We spent a lot of time investigating this, ending up with a very high degree of certainty that the problem was in the TCP stack somewhere. Somehow, on the loopback device, we were losing packets. Not always, and not consistently, but we were absolutely losing packets, which led the server to wait indefinitely for the client to send the message it already did.

Now, I’m as arrogant as the next developer, but even I don’t think I found that big a bug in TCP. I’m pretty sure that if it was this broken, I would have known about it. Beside, TCP is supposed to retransmit lost packets, so even if there were lost packets on the loopback device, we should have recovered from that.

Trying to figure out what was going on there sucked. It is hard to watch packets on the loopback device in WireShark, and tracing just told me that a message is sent from the client to the server, but the server never got it.

But we continued, and we ended up with a small reproduction of the issue. Here is the code, and my comments are below:

This code is pretty simple. It starts a TCP server, and listens to it, and then it reads and writes to the client. Nothing much here, I think you’ll agree.

If you run it, however, it will mostly work, except that sometimes (anywhere between 10 runs and 500 runs on my machine), it will just hang. I’ll save you some time and let you know that there are no dropped packets, TCP is working properly in this case. But the code just doesn’t. What is frustrating is that it is mostly working, it takes a lot of work to actually get it to fail.

Can you spot the bug? I’ll continue discussion of this in my next post.

Reducing allocations and resource usages when using Task.Delay

time to read 2 min | 399 words

In my previous posts, I talked about tri state waiting, which included the following line:

image

And then I did a deep dive into how timers on the CLR are implemented. Taken together, this presents me somewhat of a problem. What is the cost of calling a Task.Delay? Luckily, the code is there, so we can check.

The relevant costs are here. We allocate a new DelayPromise, and a Timer instance. As we previously saw, actually creating a Timer instance will take a global lock, and the cost of actually firing those timers is proportional to the number of timers that we have.

Let’s consider the scenario for the code above. We want to support a very large number of clients, and we typically expect them to be idle, so every second we’ll have the delay fire, we send them a heartbeat, and go on with the waiting. What will happen in such a case for the code above?

All of those async connections are going to be allocating several objects per run, and each of them is going to contend on the same lock. All of them are also going to run a lot more slowly than they should.

In order to resolve this issue, given what we know now about the timer system on the CLR, we can write the following code:

In this case, we have a single static timer (so there is no lock contention per call), and we have a single task allocation per second. All of the connections will use the same instance. In other words, if there are 10,000 connections, we just saved 20K allocations per second, as well as 10K lock contentions, not to mention that instead of waking up every single connection one at a time, we have just a single task instance is completed once, resulting in all the the timing out tasks being woken up.

This code does have the disadvantage of allocating a task every second, even if no one is listening. It is a pretty small price to pay, and fixing it can be left as an exercise for the reader Smile.

How timers works in the CLR

time to read 3 min | 490 words

One of the coolest things about the CoreCLR being open sourced is that I can trawl through the source code and read random parts of the framework. One of the reasons to do this, is to be able to understand the implementation concerns, not just the design, which than allows us to produce a much better output.

In this case, I was investigating a hunch, and I found myself deep inside the code that runs the timers in .NET. The relevant code is here, and it is clearly commented as well as quite nice to read.

I’m just going to summarize a few interesting things I found in the code.

There is actually only one single real timer for the entire .NET process. I started out thinking this is handled via CreateTimerQueueTimer on Windows, but I couldn’t find a Linux implementation. Reading the code, the CLR actually implements this directly via this code. Simplified, it does the following:

image

This has some interesting implications. It means that timers are all going to be fired from the same thread, at the same time (not quite true, see below), and that there is likely going to be a problem with very long timers (a timer for three months from now will overflow int32, for example).

The list of timers is held in a linked list, and every time it is awakened, it runs through the list, finding the timer to trigger, and the next time to be triggered. The code in this code path is called with only a single timer, which is then used in the managed code for actually implementing the managed timers. It is important to note that actually running the timer callback is done by queuing that on the thread pool, not executing it on the timer thread.

On the managed side, there are some interesting comments explaining the expected usage and data structures used. There are two common cases, one is the use of timeout, in which case this is typically discarded before actual use, and the other is having the recurring timers, which tend to happen once in a long while. So the code favors adding / removing timers over actually finding which need to be executed.

Another thing to note is that this adding / removing / changing / iterating over timers is protected by a single lock. Every time the unmanaged timer wakes, it queue the callback on the thread pool, and then the FireNextTimers is called, which takes a look, iterates over all the timers, and queues all those timers to be executed on the thread pool.

This behavior is interesting, because it has some impact on commonly used cases. But I’ll discuss that on my next post.

The cadence of tests speed

time to read 3 min | 402 words

RavenDB has quite a bit of tests. Over five thousands of them, on the last count. They test common things (like saving a document works) and esoteric things (like spatial query behavior near the equator). They are important, but they also take quite a lot of time to run. In our current setup, running the full test suite can take 3 – 6 hours, depending on which build agent is running and what is the configuration we have.

This means that in terms of actually running the tests, developers can’t really get good feedback. We have taken some steps to address that, by creating a core set of tests that will examine the major and most important features, fast, which developers can run in reasonable time as part of their work, and then we let the build server sort out the pull requests on its own time.

For RavenDB 4.0, we decided to go in a different direction. We split the tests into two separate projects. FastTests and SlowTests. The requirement for the FastTest is that they will run in under 1 minute. In order to do that we use xunit’s parallel test support, which allows us to run concurrent tests (this actually required a lot of changes in the test code, which assumed that we are running in a serial fashion and that each test is the sole owner of the entire system).

Part of making sure that we are running in under 1 minute is knowing when we breached that. Because the tests run concurrently, it is a bit difficult to know what is taking so long. Thankfully, it is easy to get a report out. We use the following script to do that.

We now have the top slowest tests (some of them takes tens of seconds to run, but we don’t usually care, as long as other test run concurrently parallel and the whole suite is fast), and we can figure out whatever we can optimize the test or move it to the SlowTest project.

Another change we made was to move a lot of the test code to use the async API, this allows us to run more tests concurrently without putting too much of a burden on the system. Our current tests times range between 30 – 55 seconds, which is enough to run them pretty much on every drop of a hat.

Tri state waiting with async tcp streams

time to read 4 min | 714 words

We recently had the need to develop a feature that requires a client to hold a connection to the server and listen to a certain event. Imagine that we are talking about a new document arriving to the database.

This led to a very simple design:

  • Open a TCP connection and let the server know about which IDs you care about.
  • Wait for any of those IDs to change.
  • Let the client know about it.

Effectively, it was:

Unfortunately, this simple design didn’t quite work. As it turns out, having a dedicated TCP connection per id is very expensive, so we would like to be able to use a single TCP connection in order to watch multiple documents. And we don’t know about all of them upfront, so we need to find a way to talk to the server throughout the process. Another issue that we have is the problem of steady state. If none of the documents we care about actually change for a long time, there is nothing going on over the network. This is going to lead the TCP connection to fail with a timeout.

Actually, a TCP connection that passes no packets is something that is perfectly fine in theory, but the problem is that it requires resource that that need to be maintained. As long as you have systems that are not busy, it will probably be fine, but the moment that it reaches the real world, the proxies / firewalls / network appliances along the way use a very brutal policy, “if I’m not seeing packets, I’m not maintaining the connection”, and it will be dropped, usually without even a RST packet. That makes debugging this sort of things interesting.

So our new requirements are:

  • Continue to accept IDs to watch throughout the lifetime of the connection.
  • Let the client know of any changes.
  • Make sure that we send the occasional heartbeat to keep the TCP connection alive.

This is much more fun to write, to be honest, and the code we ended up with was pretty, here it is:

There are a couple of things to note here. We are starting an async read operation from the TCP stream without waiting for it to complete, and then we go into a loop and wait for one of three options.

  1. A document that we watched has been changed (we are notified about that by the documentChanged task completion), in which case we notify the user. Note that we first replace the documentChanged task and then we drain the queue from all pending documents changes for this collection, after which we’ll go back to waiting. On the doc changed event, we first enqueue the document that was changed, and then complete the task. This ensure that we won’t miss anything.
  2. New data is available from the client. In this case we read it and add it to the IDs we are watching, while starting another async read operation (for which we’ll wait on the next loop iteration). I’m creating a new instance of the IDs collection here to avoid threading issues, and also because the number of items is likely to be very small and rarely change. If there were a lot of changes, I would probably go with a concurrent data structure, but I don’t think it is warranted at this time.
  3. Simple timeout.

Then, based on which task has been completed, we select the appropriate behavior (send message to client, accept new doc ID to watch, send heartbeat, etc).

The nice thing about this code is that errors are also handled quite nicely. If the client disconnects, we will get an error from the read, and know that it happened and exit gracefully (well, we might be getting that just when we are writing data to the client, but that is pretty much the same thing in terms of our error handling).

The really nice thing about this code is that for the common cases, where there isn’t actually anything for us to do except maintain the TCP connection, this code is almost never in runnable state, and we can support a very large number of clients with very few resources.

The Guts n’ Glory of Database Internals: What the disk can do for you

time to read 2 min | 369 words

I’m currently in the process of getting some benchmark numbers for a process we have, and I was watching some metrics along the way. I have mentioned that disk’s speed can be effected by quite a lot of things. So here are two metrics, taken about 1 minute apart in the same benchmark.

This is using a Samsung PM871 512GB SSD drive, and it is currently running on a laptop, so not the best drive in the world, but certainly a respectable one.

Here is the steady state operation while we are doing a lot of write work. Note that the response time is very high, in computer terms, forever and a half:

image

And here is the same operation, but now we need to do some cleanup and push more data to the disk, in which case, we get great performance.

image

But oh dear good, just look at the latency numbers that we are seeing here.

Same machine, local hard disk (and SSD to boot), and we are seeing latency numbers that aren’t even funny.

In this case, the reason for this is that we are flushing the data file along side the journal file. In order to allow to to proceed as fast as possible, we try to parallelize the work so even though the data file flush is currently holding most of the I/O, we are still able to proceed with minimal hiccups and stall as far as the client is concerned.

But this can really bring home the fact that we are actually playing with a very limited pipe, and there little that we can do to control the usage of the pipe at certain points (a single fsync can flush a lot of unrelated stuff) and there is no way to throttle things and let the OS know (this particular flush operation should take more than 100MB/s, I’m fine with it taking a bit longer, as long as I have enough I/O bandwidth left for other stuff).

The Guts n’ Glory of Database Internals: The curse of old age…

time to read 5 min | 922 words

This is a fun series to write, but I’m running out of topics where I can speak about the details at a high level without getting into nitty gritty details that will make no sense to anyone but database geeks. If you have any suggestions for additional topics, I would love to hear about them.

This post, however, is about another aspect of running a database engine. It is all about knowing what is actually going on in the system. A typical web application has very little state (maybe some caches, but that is pretty much about it) and can be fairly easily restarted if you run into some issue (memory leak, fragmentation, etc) to recover most problems while you investigate exactly what is going on. A surprising number of production systems actually have this feature that they just restart on a regular basis, for example. IIS will restart a web application every 29 hours, for example, and I have seen production deployment of serious software where the team was just unaware of that. It did manage to reduce a lot of the complexity, because the application never got around to live long enough to actually carry around that much garbage.

A database tend to be different. A database engine lives for a very long time, typically weeks, months or years, and it is pretty bad when it goes down, it isn’t a single node in the farm that is temporarily missing or slow while it is filling the cache, this is the entire system being down without anything that you can do about it (note, I’m talking about single node systems here, distributed systems has high availability systems that I’m ignoring at the moment). That tend to give you a very different perspective on how you work.

For example, if you are using are using Cassandra, it (at least used to) have an issue with memory fragmentation over time. It would still have a lot of available memory, but certain access pattern would chop that off into smaller and smaller slices, until just managing the memory at the JVM level caused issues. In practice, this can cause very long GC pauses (multiple minutes). And before you think that this is an situation unique to managed databases, Redis is known to suffer from fragmentation as well, which can lead to higher memory usage (and even kill the process, eventually) for pretty much the same reason.

Databases can’t really afford to use common allocation patterns (so no malloc / free or the equivalent) because they tend to hold on to memory for a lot longer, and their memory consumption is almost always dictated by the client. In other words, saving increasing large record will likely cause memory fragmentation, which I can then utilize further by doing another round of memory allocations, slightly larger than the round before (forcing even more fragmentation, etc).  Most databases use dedicated allocators (typically some from of arena allocators) with limits that allows them to have better control of that and mitigate that issue. (For example, by blowing the entire arena on failure and allocating a new one, which doesn’t have any fragmentation).

But you actually need to build this kind of thing directly into the engine, and you need to also account for that. When you have a customer calling with “why is the memory usage going up”, you need to have some way to inspect this and figure out what to do about that. Note that we aren’t talking about memory leaks, we are talking about when everything works properly, just not in the ideal manner.

Memory is just one aspect of that, if one that is easy to look at. Other things that you need to watch for is anything that has a linear cost proportional to your runtime. For example, if you have a large LRU cache, you need to make sure that after a couple of months of running, pruning that cache isn’t going to be an O(N) job running every 5 minutes, never finding anything to prune, but costing a lot of CPU time. The number of file handles is also a good indication of a problem in some cases, some databases engines have a lot of files open (typically LSM ones), and they can accumulate over time until the server is running out of those.

Part of the job of the database engine is to consider not only what is going on now, but how to deal with (sometimes literally) abusive clients that try to do very strange things, and how to manage to handle them. In one particular case, a customer was using a feature that was designed to have a maximum of a few dozen entries in a particular query to pass 70,000+ entries. The amazing thing that this worked, but as you can imagine, all sort of assumptions internal to the that features were very viciously violated, requiring us to consider whatever to have a hard limit on this feature, so it is within its design specs or try to see if we can redesign the entire thing so it can handle this kind of load.

And the most “fun” is when those sort of bugs are only present after a couple of weeks of harsh production systems running. So even when you know what is causing this, actually reproducing the scenario (you need memory fragmented in a certain way, and a certain number of cache entries, and the application requesting a certain load factor) can be incredibly hard.

The Guts n’ Glory of Database Internals: Backup, restore and the environment…

time to read 4 min | 684 words

A lot of the complexities involved in actually building a database engine aren’t related to the core features that you want to have. They are related to what looks like peripheral concerns. Backup / restore is an excellent example of that.

Obviously, you want your database engine to have support for backup and restore. But at the same time, actually implementing that (efficiently and easily) is not something that can be done trivially. Let us consider Redis as a good example; in order to backup its in memory state, Redis will fork a new process, and use the OS’s support for copy on write to have a stable snapshot of the in-memory state that it can then write to disk. It is a very elegant solution, with very little code, and it can be done with the assistance of the operation system (almost always a good thing).

It also exposes you to memory bloat if you are backing up to a slow disk (for example, a remote machine) at the same time that you have a lot of incoming writes. Because the OS will create a copy of every memory page that is touched as long as the backup process is running (on its own copy of the data), the amount of memory actually being used is non trivial. This can lead to swapping, and in certain cases, the OS can decide that it runs out of memory and just start killing random processes (most likely, the actual Redis server that is being used, since it is the consumer of all this memory).

Another consideration to have is exactly what kind of work do you have to do when you restore the data. Ideally, you want to be up and running as soon as possible. Given database sizes today, even reading the entire file can be prohibitively expensive, so you want to be able to read just enough to start doing the work, and then complete the process of finishing up the restore later (while being online). The admin will appreciate it much more than some sort of a spinning circle or a progress bar measuring how long the system is going to be down.

The problem with implementing such features is that you need to consider the operating environment in which you are working. The ideal case is if you can control such behaviors, for example, have dedicated backup & restore commands that the admin will use exclusively. But in many cases, you have admins that do all sorts of various things, from shutting down the database and zipping to a shared folder on a nightly basis to taking a snapshot or just running a script with copy / rsync on the files on some schedule.

Some backup products have support for taking a snapshot of the disk state at a particular point in time, but this goes back to the issues we raised in a previous post about low level hooks. You need to be aware of the relevant costs and implications of those things. In particular, most databases are pretty sensitive to the order in which you backup certain files. If you take a snapshot of the journal file at time T1, but the data file at time T2, you are likely to have data corruption when you restore (the data file contains data that isn’t in the journal, and there is no way to recover it).

The really bad thing about this is that it is pretty easy for this to mostly work, so even if you have a diligent admin who test the restore, it might actually work, except when it will fail when you really need it.

And don’t get me started on cloud providers / virtual hosts that offer snapshots. The kind of snapshotting capabilities that a database requires are very specific (all changes that were committed to disk, in the order they were sent, without any future things that might be in flight) in order to be able to successfully backup & restore. From experience, those are not the kind of promises that you get from those tools.

Fast transaction logWindows

time to read 5 min | 806 words


In my previous post, I have tested journal writing techniques on Linux, in this post, I want to do the same for Windows, and see what the impact of the various options are the system performance.

Windows has slightly different options than Linux. In particular, in Windows, the various flags and promises and very clear, and it is quite easy to figure out what is it that you are supposed to do.

We have tested the following scenarios

  • Doing buffered writes (pretty useless for any journal file, which needs to be reliable, but good baseline metric).
  • Doing buffered writes and calling FlushFileBuffers after each transaction (which is pretty common way to handle committing to disk in databases), and the equivalent of calling fsync.
  • Using FILE_FLAG_WRITE_THROUGH flag and asking the kernel to make sure that after every write, everything will be flushed to disk. Note that the disk may or may not buffer things.
  • Using FILE_FLAG_NO_BUFFERING flag to bypass the kernel’s caching and go directly to disk. This has special memory alignment considerations
  • Using FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING flag to ensure that we don’t do any caching, and actually force the disk to do its work. On Windows, this is guaranteed to ask the disk to flush to persisted medium (but the disk can ignore this request).

Here is the code:

We have tested this on an AWS macine ( i2.2xlarge – 61 GB, 8 cores, 2x 800 GB SSD drive, 1GB /sec EBS), which was running Microsoft Windows Server 2012 R2 RTM 64-bits. The code was compiled for 64 bits with the default release configuration.

What we are doing is write 1 GB journal file, simulating 16 KB transactions and simulating 65,535 separate commits to disk. That is a lot of work that needs to be done.

First, again, I run it on the system drive, to compare how it behaves:

Method Time (ms) Write cost (ms)
Buffered

396

0.006

Buffered + FlushFileBuffers

121,403

1.8

FILE_FLAG_WRITE_THROUGH

58,376

0.89

FILE_FLAG_NO_BUFFERING

56,162

0.85

FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING

55,432

0.84

Remember, this is us running on the system disk, not on the SSD drive. Here are those numbers, which are much more interesting for us.

Method Time (ms) Write cost (ms)
Buffered

410

0.006

Buffered + FlushFileBuffers

21,077

0.321

FILE_FLAG_WRITE_THROUGH

10,029

0.153

FILE_FLAG_NO_BUFFERING

8,491

0.129

FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING

8,378

0.127

And those numbers are very significant. Unlike the system disk, where we basically get whatever spare cycles we have, in both Linux and Windows, the SSD disk provides really good performance. But even on identical machine, running nearly identical code, there are significant performance differences between them.

Let me draw it out to you:

Options

Windows

Linux

Difference

Buffered

0.006

0.03

80% Win

Buffered + fsync() / FlushFileBuffers()

0.32

0.35

9% Win

O_DSYNC / FILE_FLAG_WRITE_THROUGH

0.153

0.29

48% Win

O_DIRECT / FILE_FLAG_NO_BUFFERING

0.129

0.14

8% Win

O_DIRECT | O_DSYNC / FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING

0.127

0.31

60% Win

In pretty much all cases Windows has been able to out perform Linux on this specific scenario. In many cases by a significant margin. In particular, in the scenario that I actually really care about, we see 60% performance advantage to Windows.

One of the reasons for this blog post and the detailed code and scenario is the external verification of these numbers. I’ll love to know that I missed something that would make Linux speed comparable to Windows, because right now this is pretty miserable.

I do have a hunch about those numbers, though. SQL Server is a major business for Microsoft, so they have a lot of pull in the company. And SQL Server uses FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING internally to handle the transaction log it uses. Like quite a bit of other Win32 APIs (WriteGather, for example), it looks tailor made for database journaling. I’m guessing that this code path has been gone over multiple times over the years, trying to optimize SQL Server by smoothing anything in the way.

As a result, if you know what you are doing, you can get some really impressive numbers on Windows in this scenario. Oh, and just to quite the nitpickers:

image_thumb[5]

Fast transaction logLinux

time to read 5 min | 910 words

We were doing some perf testing recently, and we got some odd results when running a particular benchmark on Linux. So we decided to check this on a much deeper level.

We got an AWS macine ( i2.2xlarge – 61 GB, 8 cores, 2x 800 GB SSD drive, running Ubuntu 14.04, using kernel version 3.13.0-74-generic, 1GB/sec EBS drives ) and run the following code and run it. This tests the following scenarios on a 1GB file (pre allocated) and “committing” 65,536 (64K) transactions with 16KB of data in each. The idea is that we are testing how fast we can create write those transactions to the journal file, so we can consider them committed.

We have tested the following scenarios

  • Doing buffered writes (pretty useless for any journal file, which needs to be reliable, but good baseline metric).
  • Doing buffered writes and calling fsync after each transaction (which is pretty common way to handle committing to disk in databases)
  • Doing buffered writes and calling fdatasync (which is supposed to be slightly more efficient than calling fsync in this scenario).
  • Using O_DSYNC flag and asking the kernel to make sure that after every write, everything will be synced to disk.
  • Using O_DIRECT flag to bypass the kernel’s caching and go directly to disk.
  • Using O_DIRECT | O_DSYNC flag to ensure that we don’t do any caching, and actually force the disk to do its work.

The code is written in C, and it is written to be pretty verbose and ugly. I apologize for how it looks, but the idea was to get some useful data out of this, not to generate beautiful code. It is only quite probable that I made some mistake in writing the code, which is partly why I’m talking about this.

Here is the code, and the results of execution are below:

It was compiled using: gcc journal.c –o3 –o run && ./run

The results are quite interesting:

Method Time (ms) Write cost (ms)
Buffered

525

0.03

Buffered + fsync

72,116

1.10

Buffered + fdatasync

56,227

0.85

O_DSYNC

48,668

0.74

O_DIRECT

47,065

0.71

O_DIRECT | O_DSYNC

47,877

0.73

The results are quite interesting. The buffered call, which is useless for a journal, but important as something to compare to. The rest of the options will ensure that the data reside on disk* after the call to write, and are suitable to actually get safety from the journal.

* The caveat here is the use of O_DIRECT, the docs (and Linus) seems to be pretty much against it, and there are few details on how this works with regards to instructing the disk to actually bypass all buffers. O_DIRECT | O_DSYNC seems to be the recommended option, but that probably deserve more investigation.

Of course, I had this big long discussion on the numbers planned. And then I discovered that I was actually running this on the boot disk, and not one of the SSD drives. That was a face palm of epic proportions that I was only saved from by the problematic numbers that I was seeing.

Once I realized what was going on and fixed that, we got very different numbers.

Method Time (ms) Write cost (ms)
Buffered

522

0.03

Buffered + fsync

23,094

0.35

Buffered + fdatasync

23,022

0.35

O_DSYNC

19,555

0.29

O_DIRECT

9,371

0.14

O_DIRECT | O_DSYNC

20,595

0.31

There is about 10% difference between fsync and fdatasync when using the HDD, but there is barely any difference as far as the SSD is concerned. This is because the SSD can do random updates (such as updating both the data and the metadata) much faster, since it doesn’t need to move the spindle.

Of more interest to us is that D_SYNC is significantly faster in both SSD and HHD. About 15% can be saved by using it.

But I’m just drolling over O_DIRECT write profile performance on SSD. That is so good, unfortunately, it isn’t safe to do. We need both it and O_DSYNC to make sure that we only get confirmation on the write when it hits the physical disk, instead of the drive’s cache (that is why it is actually faster, we are writing directly to the disk’s cache, basically).

The tests were run using ext4. I played with some options (noatime, noadirtime, etc), but there wasn’t any big difference between them that I could see.

In my next post, I’ll test the same on Windows.

FUTURE POSTS

  1. Challenge: The race condition in the TCP stack, answer - one day from now
  2. API Design: robust error handling and recovery - about one day from now
  3. Debugging CoreCLR applications in WinDBG - 3 days from now
  4. Non reproducible / intermittent error handling - 4 days from now
  5. The Guts n’ Glory of Database Internals: What goes inside the transaction journal - 7 days from now

And 21 more posts are pending...

There are posts all the way to Aug 30, 2016

RECENT SERIES

  1. Production postmortem (16):
    05 Feb 2016 - A null reference in our abstraction
  2. The Guts n’ Glory of Database Internals (20):
    18 Jul 2016 - What the disk can do for you
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats