I was always somewhat baffled that something that was so obviously required so often can be so expensive with RDBMSes. When diving deep into B-Tree implementation I suddenly found it obvious why that is the case.
Let us look at the following B-Tree:
Now, if I asked you how many entries you had in this tree, you would have to visit each and every page. And that can be very expensive when you have large number of items.
But why can’t we just keep track of the total number ourselves? There is counting B-Trees that we can use that will give us the answer in O(1) time.
The answer to that is that it is too expensive to maintain this. Imagine a big tree, with 10 million records, and a fan out of 16 keys per page. That means that it has a depth of 6. Let us say that we want to add a new record. We can do that by modifying and saving the page. That would cost us 4KB of writes.
However, if we use a counting B-Tree, we have to update all of the parent pages. With a depth of 6, that means that instead of writing 4KB to disk we will have to write 24KB to disk. And probably have to do multiple seeks to actually do the write properly. That is an unacceptable performance cost, and the reason why you have to manually check all the pages when you want the full count.
One of the annoying things about doing performance work is that you sometimes feel like you found the problem. You fix it, and another pops out.
After the previous run of work, we have tackled two important perf issues:
- Immutable collections – which were replaced with more memory intensive but hopefully faster safe collections.
- Transaction contention from background work – which we reduced and close to eliminated.
Once we have done that, it is time to actually look at the code in the profiler again. Note that I am writing this while I am traveling on the train (on battery power), so you can pretty much ignore any actual numbers, we want to look at the percentages.
Let us look at the contention issue first. Here is the old code:
And the new one:
We still have some cost, but that appears to have been drastically cut down. I’ll call it good for now.
What about transaction commit? Here is the old code:
And here is the new one:
What we can see is that our change didn’t actually improve anything much. In fact, we are worse off. We’ll probably go with a more complex system of chained dictionaries, rather than the just plain copying that appears to be causing so much trouble.
However, there is something else to now. Both Commit and NewTransaction together are less than 60% of the cost of the operation. The major cost is actually in adding item to the in memory tree:
In particular, it looks like finding the right page to place the item is what is costing us so much. We added 2 million items (once sequential, once random), which means that we need to do an O(logN) operation 2 million times. I don’t know if we can improve on the way it works for random writes, but we should be able to do something about how it behaves for sequential writes.
After the issues we run into with immutable structures, I decided that I still wanted to maintain the same guarantees that immutable data structures gave us, but that still maintain the speed of mutable structures.
I managed to do that by making different design choices when building my implementation. While the BCL immutable collections are based on binary trees, and attempt to spare GC pressure in favor of CPU cycles, I find that for a lot of the things that we do, we can spare the memory in favor of actually being faster overall. In particular, this is true since the immutable collections are order of magnitude slower for reads than their mutable counterparts. I could have dealt with the slow writes, somehow. But we do a lot of reads, and that is utterly unacceptable.
Here is how I implemented SafeList:
The implementation for SafeDictionary is pretty similar as well. Note that I have dedicated & optimized methods for the type of things that I need in my code.
But the key part here is that I gain the safety by always creating another copy of the underlying implementation. We are never actually mutating anything.
Sure, this results in us having to deal with a lot more allocations, but we get the same speed for reads as you do for the standard mutable collections. And more importantly, we are still faster for writes.
Why is the performance of an immutable list over 16 times slower than a standard list? I took a peek at what it was actually doing, and it made a lot of sense. In order to maintain efficient indexing access, the actual storage of the data in the immutable list is a binary tree. With the key being used as the indexer.
This result is a much higher cost for pretty much everything. Let us look at the following:
This List<T> is 0.23 seconds, ImmutableList<T> takes 1.28 seconds. When I use foreach, instead, we get 0.22 seconds vs. 2.29 seconds.
As you can see from the blog post describing them, because immutable collections are mostly implemented as binary trees, I don’t really think that there is a good way to approach this as is. The problem is that the immutable collections actually need to do a lot more than what I need them to do.
Now, it might have been more acceptable to use them if the perf was limited to just writing to them, it might have been acceptable. But as you can see, we have the same problem when we are reading, and that is quite unacceptable.
After finding out that a lot of our costs in Voron is actually related to immutable collections, I decided to run some isolated perf tests.
1 minute and 58 seconds
Yes, that is correct, the immutable version is over thirty times slower. And given that we are only using 10 million items, that is a ridiculous high rate.
I decided to do some tests to see if it is just the number of calls that we are making here:
So that it appears that it isn’t the number of calls, but something intrinsic to the way it works. And how about lists?
I think we are going to need to do something about this, even though I love the idea of immutable collections, 16 – 32 times slower is just going to be utterly unacceptable.
But this is for writes, how about reads?
I had written the following for both of them:
So the summary is, the performance for our workloads is probably going to be simply unacceptable. We’ll have to find another way to handle this. The way to handle this nicely is to basically copy the values. So I wanted to know how long it would take to fully clone a 10 millions items dictionary. The answer: 0.6 seconds. Doing the same with immutable dictionary? 16 seconds.
So we are going to need to do something else .
Earlier this month I posted our results for the initial performance of Voron. We learned a lot from that, and it led us to do some major refactoring to the codebase. Now, we are back on a more or less stable ground, so it is time to actually see what is going on. Where before we had a baseline against other storage engines, now we are testing against the previous version as well as Esent.
To get a good and fixed numbers, I decided to run the tests on a EC2 machine, in particular, an m1.large instance. The idea is that this gives us a simple way to get a standard machine config that anyone can use to test out as well.
For the purpose of this benchmark, we are doing writes to one of the temporary storage disks, rather than a permanent EBS drive. This will presumably be somewhat faster than real world configuration, but I think that running in an environment where we are actually using a common production machine (rather than a dedicated development machine) should give us better numbers, certainly more realistic ones.
The first thing to check, of course, is the simplest. Sequential writes:
Here we can see that we are actually worse off than we were before. That is actually kind of sad. But we’ll get to that later on.
For random reads, we are still very good, and it is obvious that we are pretty early make use of all of the resources that the machine gives us. Here we are consistently slightly faster than the older version, but that is something that is probably accidental. We certainly did no work on improving read performance (there was very little need to do so, after all).
Random writes was a big surprise:
As it happens, it appears that we are faster, significantly so, than Esent in this scenario. Which was quite a surprise. For some reason, we are also faster in the new code than the old version. Which is quite interesting.
Finally, we have random reads.
No, it isn’t an error. We see a rate of ~1,600 reads/sec for Esent single threaded. I am not quite sure why this is behaving in this fashion, but I am willing to say that there is something wrong and just discount this result.
I mentioned before that it appears that we are actually slower than the old code. So I decided to check what is actually going on there. My tool of choice, the profiler. Here is the old version:
You can see that we have done two runs (one sequential, one random) of 10,000 transactions, writing a total of 2 million items. And roughly 26% of the time is spent committing the transaction.
Digging into that, a lot of that time is spent just flushing to disk:
Note, however, that a decidedly non trivial amount of time is actually spent on ImmutableDictionary, but I’ll touch on that later on.
With our changes, the new code looks like:
The one thing that jumps at me is that the cost of creating a transaction is now very significant. And here is why:
The actual reason for this is likely that we also have the background journal flushing, that now also needs to take the transaction lock. Which is introducing contention into the mix. We’ll need to look into how to resolve that.
Let us dig into the commit. The change there was pretty much the whole reason for what we were doing.
You can see that we are using WriteFileGather, and unlike before, where syncing took about 15% of the overall time. Now it is taking just 7%. So we are better than 50% improvement on that score.
But the really interesting bit? Look at what was by far the most expensive operation there. It was the calls to immutable dictionary. In fact, let us look at the results in the profiler for the immutable collections:
So now we have two new things that we need to investigate. One is reducing contentions, and the second is checking how we can optimize our usage of the immutable collections.
I’ll report on our finding…
I was asked to review the book (and received a free electronic copy).
As someone that is very into storage engines, I was quite excited about this. After going over the leveldb codebase, I would finally get to read a real book about how it works.
I was disappointed, badly.
This book isn’t really about leveldb. It contains pretty much no background, explanation, history or anything much at all about how leveldb works. Instead, it is pretty much a guide of how to use LevelDB to write iOS application. There is a lot of chapters dealing with Objective-C, NSString and variants, how to do binding, how to handle drag and drop.
However, things that I would expect. Such as explanations of how it works, what does it do, alternative use cases, etc are very rare, if there at all. Only chapter 10 is really worth reading, and even so, I got the feeling that it only made sense to me because I already knew quite a lot leveldb already. I can’t imagine actually starting from scratch and actually being able to understand leveldb from this book.
If you are working on iOS apps / OS X, I guess that this might be a good choice, but only if you want to know about actually implementing leveldb. You’ll need to do your actual leveldb learning elsewhere.
The book does contain some interesting tidbits. Chapter 10 is talking about tuning and key policies, and it did have some interesting things to talk about, but it also contain wrong information* (and if I could spot it, with my relatively little experience with leveldb, I’m pretty sure that there are other things there too that are wrong).
* The book said that is it better to write keys in order, to reduce I/O. But leveldb writes to a skip list in memory, then flush that entire thing in sorted fashion to disk. Your writes have to be bigger than the buffer size of that to actually matter, and that still won’t help you much.
In short, feel free to skip this book, unless you are very focused on writing leveldb apps on iOS. In which case it might be a worth it, but I don’t think so. You are better off reading the docs or any of the tutorials.
I got an interesting question from Thomas:
How can the OS actually ensure that the write through calls go to disk before returning, when it cannot do the same for fsync. I mean couldn't there still be some driver / raid controller / hard disk doing some caching but telling windows the data is written?
And I think that the answer is worth a post.
Let us look at the situation from the point of view of the operation system. You have an application that issue a write request. And the OS will take the data to be written and write it to its own buffers, or maybe it will send it to the disk, with instructions to write the data, but nothing else. The disk driver is then free to decide what the optimal way to actually do that would be. In many cases, that means not writing the data right now, but placing that in its own buffer, and do a lazy write when it feels like it. This is obviously a very simplified view of how it works, but it is good enough for what we are doing.
Now, when we call fsync, we have to do that with a file handle. But as it turned out, that isn’t quite as useful as you might have thought it would be.
The OS is able to use the file handle to find all of the relevant data that has been written to this file and weren’t send to the disk yet. And it will call the disk and tell it, “hi, how about writing those pieces too, if you don’t mind*”. However, that is only part of what it needs to do. What about data that has already been written by the OS to the disk drive, but is still in the disk drive cache?
* It is a polite OS.
Well, we need to force the drive to flush it to the actual physical media, but here we run into an interesting problem. There is actually no way for the OS to tell a disk drive “flush just the data belong to file X”. That is because at the level of the disk drive, we aren’t actually talking about files, we talk about sectors. Therefor, there isn’t any way to say, flush just this data. And since the disk drive won’t tell the OS when it actually flushed the data to disk, the OS has no way of telling (nor does it needs to track it) what specific pieces need to actually be flushed.
Therefor, what it does is go to the disk driver and tell it, flush everything that is in your cache, and tell me when you are done. As you can imagine, if you are currently doing any writes, and someone call fsync, that can be a killer for performance, because the disk needs to flush the entire cache. It is pretty common for disks to come with 64MB or 128MB caches. That means that when fsync is called, it might be doing a lot of work. the FireFox fsync issue is probably the most high profile case where this was observed. There have been a lot of people looking into that, and you can read a lot of fascinating information about it.
Now, what about Write Through? Well, for that the OS does something slightly differently. Instead of just handing the data to the disk driver and telling it do whatever it wants with it, what it does is to tell the disk driver that it needs to write this data right now. Because we can give the disk driver the specific instructions about what to flush to disk, it can do that without having to flush everything in its cache. That is the difference between writing a few KB and writing tens of MB.
I said that this is a great oversimplification. There are some drivers that would choose to just ignore fsync. They can do that, and they should do that, under certain circumstances. For example, if you are using a disk that comes with its own battery backed memory, there is no reason to actually wait for the flush, we are already ensured that the data cannot go away if someone pulls the plug. However, there are plenty of drives that would just ignore fsync (or handling only 3rd fsync, or whatever) because it leads to better performance.
This also ignore things like the various intermediaries along the way. If you are using hardware RAID, for example, you also have the RAID cache, etc, etc, etc. And yes, I think that there are drivers there that would ignore write through as well.
At the low level Write Through uses SCSI commands with Force Unit Access, and fsync uses SYNCHRONIZE_CACHE for SCSI and FLUSH_CACHE for ATAPI. I think that ATAPI 7 has Force Unit Access, as well, but I am not sure.
Also known as: Please check the subtitle of this blog.
This benchmark doesn’t actually provide much useful information. It is too short and compares fully featured DBMS systems to storage engines. I always stress very much that people never make decisions based on benchmarks like this.
These paint the fully featured DBMS systems in a negative light that isn’t a fair comparison. They are doing a LOT more work. I’m sure the FoundationDB folks will not be happy to know they were roped into an unfair comparison in a benchmark where the code is not even available.
This isn’t a benchmark. This is just an interim step along the way of developing Voron. It is a way for us to see where we stand and where we need to go. A benchmark include full details about what you did (machine specs, running environment, full source code, etc). This is just us putting stress on our machine and comparing where we are at. And yes, we could have done it in isolation, but that wouldn’t really give us any major advantage. We need to see how we compare to other database.
And yes, we compare apples to oranges here when we compare a low level storage engine like Voron to SQL Server. I am well aware of that. But that isn’t the point. For the same reason that we are currently doing a lot of micro benchmarks rather than the 48 hours ones we have in the pipeline.
I am trying to see how users will evaluate Voron down the road. A lot of the time, that means users doing micro benchmarks to see how good we are. Yes, those aren’t very useful, but they are a major way people make decisions. And I want to make sure that we come out in a good light under that scenario.
With regards to Foundation DB, I am sure they are as happy about it as I am about them making silly claims about RavenDB transaction support. And the source code is available if you really want to, in fact, we got the Foundation DB there because we had an explicit customer request, and because they contributed the code for that.
Next, let us move to something else equally important. This is my personal blog. I publish here things that I do on a daily basis. And if I am currently in a performance boost stage, you’re going to be getting a lot of details on that. Those are the results of performance runs, they aren’t benchmarks. They don’t get anywhere beyond this blog. When we’ll put the results on ravendb.net, or something like that, then it will be a proper benchmark.
And while I fully agree that making decisions based on micro benchmarks is a silly way to go about doing so, the reality is that many people do just that. So one of the things that I’m focusing on is exactly those things. It helps that we currently see a lot of places to improve in those micro benchmarks. We already have a plan (and code) to see how we do on a 24 – 48 hours benchmark, which would also allow us to see all sort of interesting things (mixed reads & writes, what happens when you go beyond physical memory size, longevity issues, etc).
Those are interim results, but it gives me some hope. This is running on a different machine than the numbers I posted before. But I have enough to compare to each other. You can see it here. All numbers are in operations / sec.
Writing 100,000 sequential items (100 items per transaction in 1,000 transactions):
And writing 100,000 random items:
And here is what we really care about. Comparing Voron to Esent:
This is actually pretty amazing, because we are now at about 25% write speed of Esent for sequential writes (we used to be at less than 5%). When talking about random writes (what we really care about) we are neck in neck.
I am currently in the states, and I can’t go anywhere without seeing a lot of signs for Black Friday. Since it seems that this is a pretty widely spread attempt to do a load test on everyone’s servers (and physical stores, for some reason). I decided that I might as well join the fun and see how we handle the load.
You can use one of the following coupon codes (each one has between 4 – 16 uses) to get a 20% discount for any of our products, if you buy in the next 48 hours.
I keep getting asked by people: “What is the configuration option to make NHibernate run faster?”
People sort of assume that NHibernate is configured to be slow by default because it amuses someone.
Well, while there isn’t a “Secret_incantation” = “chicken/sacrifice” option in NHibernate, there is this one:
And it pretty much does the same thing.
No, I won’t explain why. Go read the docs.
I have a method that does a whole bunch of work, then submit an async write to the disk. The question here is how should I design it?
In general, all of the method’s work is done, and the caller can continue doing whatever they want to do with they life, all locks are released, all the data structures are valid again, etc.
However, it isn’t until the async work complete successfully that we are actually done with the work.
I could do it like this:
But this implies that you have to wait for the task to complete, which isn’t true. The commit function looks like this:
So by the time you have the task from the method, all the work has been done. You might want to wait of the task to ensure that the data is actually durable on disk, but in many cases, you can get away with not waiting.
I am thinking about doing it like this:
To make it explicit that there isn’t any async work done that you have to wait for.
I really like the manner in which C# async tasks work. And while building Voron, I run into a scenario in which I could really make use of Windows async API. This is exposed via the Overlapped I/O. The problem is that those are pretty different models, and they don’t appear to want to play together very nicely.
Since I don’t feel like having those two cohabitate in my codebase, I decided to see if I could write a TPL wrapper that would provide nice API on top of the underlying Overlapped I/O implementation.
Here is what I ended up with:
Note that I create the file with overlapped enabled, as well as write_through & no buffering (I need them for something else, not relevant for now).
It it important to note that I bind the handle (which effectively issue a BindIoCompletionCallback under the cover, I think), so we won’t have to use events, but can use callbacks. This is much more natural manner to work when using the TPL.
Then, we can just issue the actual work:
As you can see, all the actual details are handled in the helper functions, we can just run the code we need, passing it the overlapped structure it requires. Now, let us look at those functions:
The complexity here is that we need to handle 3 cases:
- Successful completion
- Error (no pending work)
- Error (actually success, work is done in an async manner).
But that seems to be working quite nicely for me so far.
For the past few days I have been talking about our findings with regards to creating ACID storage solution. And mostly I’ve been focusing on how it works with Windows, using Windows specific terms and APIs.
The problem is that I am not sure if those are still relevant if we talk about Linux. I know that fsync perf is still an issue (if only because both Win & Lin are running on the same hardware). But would the same solutions apply?
For example, the nearest that I can find to FILE_FLAG_NO_BUFFERING is O_DIRECT and FILE_FLAG_WRITE_THROUGH appears to be similar to O_SYNC. But I am not sure if they are actually behaving in the same fashion.
Any ideas? Anyone has something like Process Monitor for Linux and can look at the actual behavior of industry grade databases commit behavior?
From my exploring, it appears that PostgreSQL is using fdatasync() as the default approach, but it can use O_DIRECT and O_DSYNC as well, so that is promising. But I would like to have someone who actually know Linux intimately tell me if I am even in the right direction.
In Voron, we use a double buffer approach. We use the first two pages of the file to alternately write the last version of the database info. For example, the last transaction id, among other things.
The problem is that when we make those changes, we have to call fsync on that, and as we have seen, that is something that we would like to avoid if possible. Because of that, we are going to try something different. We are going to extract the header information from the first few pages of the file in favor of holding them as separate files: header.one and header.two
The idea is that they are very small files, and as such, it would be cheap to fsync them independently. Moreover, we can take advantage of the fact that very small files (and in this case, I am not sure we even above 256 bytes) are usually stored in the MFT in NTFS and in the inode in ext4. That means that fsync would get both data and metadata at the same time, hopefully just writing out a single block.
I am not sure how useful that is going to be, but I have hopes.
This post is here to answer several queries in the mailing list, and some questions that were raised in this blog post. I think that this is important enough to warrant a post here, instead of an email to the list, or just a comment.
To summarize, we had a few issues recently that impacted our users’ systems. Those are usually (but not always) cases where a combination of features wasn’t working properly (feature intersection), or just actual bugs. That led to some questions that are worth answering. You can find all the details below, but I would like to talk about what we are actually doing.
In the past 4 or 5 years, we have managed to create a NoSQL database for the .NET platform, and it has been doing nothing but picking up steam ever since we released it. We have been working hard to provide performance, features and stability for our users. On a personal note, it has been quite an amazing ride, seeing more people put RavenDB to use and creating interesting applications and features.
First, there seems to be some concerns about the new things that we are doing. Voron, in particular, appears to be a cause for concern. We have relied on Esent as our storage engine for the past four or five years, to great success. Not least of its properties is the fact that Esent has been around the block for a while now, and is proven to be robust and safe in the simplest of methods, high and constant use over multiple decades. Esent also have its share of problems, but we didn’t forget why we chose it in the first place. Indeed, I still think that that was an excellent choice. With Voron, the only change you’ll see is that it won’t be the only choice.
Voron is meant to allow us to run on Linux machines, and to provide us with a fully owned stack, so we can do more interesting things across the board. But we aren’t letting go of Esent, and in any way you care to name, Esent is still going to be the core (and default) option we have for storage in RavenDB. With RavenDB 3.0, you’ll have the option to make an informed choice about selecting Voron as a storage engine, with a list of pros & cons.
Second, we do acknowledge that we suffer from a typical blindness for how we approach RavenDB. Since we built it, we know how things are supposed to be, and that is how we usually test them. Even when we try to go for the edge cases, we are constrained by our own thinking. We are currently working on getting an external testing team to do just that. Actively work to make use of RavenDB in creative ways specifically to try to break it.
Third, our own internal policies for releasing RavenDB need to be adjusted slightly. In particular, we are usually faced with two competing pressures: Release Already and Super Stable. We have always tried to release both unstable and stable versions, and the process for moving from unstable to stable is a pretty good one, I think. We have:
- The test suite, now clocking at just over 3,000 tests.
- A separate test suite that is meant to stress test the database.
- Performance test suite, to make sure that we are in line for general performance.
- Longevity tests, making sure that we don’t have any issues in long term usage.
- Finally, as an act of dog fooding, we upgrade our own servers to the new build, and let it run in production for a while, just to make absolutely sure.
We are going to add additional tests (see the 2nd point) to the process, and we are going to extend the duration of all of those steps. I think that in the past few months we have have leaned too far toward the “Release Already” mode, so we are going to try to lean back (hopefully not too much) the other way.
Fourth, with regards to licensing. It has been our policy to provide anyone with a free trail license of RavenDB if they want to test it on a temporary basis. We require permanent non developer servers to have a license. I think that this strikes the appropriate balance.
Fifth, we are going to be working on additional tooling round deployment and upgrades. For customers that jump multiple versions (moving from 1.x to 2.5, for example), the update process of the RavenDB internal storage data during upgrades can be lengthy and there is too little visibility into it at the moment. We are also working on building tools that help figure out what is going on with a production instance (more ops endpoint, more visibility into internal operations, etc).
In summary, we are grateful for our users for bringing any issues to our attention. We are trying hard to have a very responsive feedback cycle, and we can usually resolve most issues within 24 – 48 hours. But I know we need to do better in making sure that users have a more streamlined experience.
So far, we have got to the conclusion that we are going to ditch fsync in favor of unbuffered write through calls. We also saw that it can play nicely with memory mapped files, which is what we are using for Voron.
However, there is a problem here. Before we can write the data to the journal file, we need some way to actually put it. Previously, we could use the memory directly from the memory mapped journal file, and then just flush it. However, now we cannot do that, the only writes that we can do to the journal are using the unbufered write through I/O. Otherwise, we have to deal call fsync again. And sadly, we cannot call WriteFile on the memory that is mapped to the same part of the file that we write to.
That means that we need some scratch space to work with. And that means that we need to make some choices here. The obvious place to handle this scratch space is memory. The problem with that is that this means that we are going to compete with the rest of the system for available memory. In particular, we would need some way to free up memory after we use it, or we may hold into it forever. But if we free the memory, we might need to use it again, in which case we have a free/alloc pattern that isn’t going to be good.
Ideally, we want to get a continuous range of memory, so that probably explains why we care about its size and not releasing it early. One thing that I should note is that we are worried mostly about big transactions, ones that might need to touch hundreds or thousands of megabytes. Those tend to be rare, yes, but I hate to have any sort of hard limits in my software.
So what we’ll probably do is create another memory mapped file, of a size that is at least as big as the current journal file. And we will put all of our in flight transactional data in there. The good news about it is that we can re-use the space on every transaction, just overwriting what previous values. It also means that we easily expand the size of the current transaction buffer, so to speak. And under high memory pressure, we have an easier way to handle things.
When the transaction actually commits, we will write directly from the scratch space to the journal file, as a single, sequential, unbuffered write through write. Externally, there would be much of a change. And most of that would probably just have to do with the transaction commit semantics. Speaking of which, we probably want to talk about the way we store the header information…
After going over all the options for handling ACID, I am pretty convinced that the fsync approach isn’t a workable one for high speed transactional writes. It is just too expensive.
Indeed, when looking at how both SQL Server and Esent handle this, they are using unbufferred write through writes to handle this. Now, those are options that are available to us as well. We have the FILE_FLAG_WRITE_THROUGH and FILE_FLAG_NO_BUFFERING options with Windows (I’ll discuss Linux in another post).
Ususally FILE_FLAG_NO_BUFFERING is problematic, because it requires you to write with specific memory alignment. However, we are already doing only paged writes, so that isn’t an issue. We can already satisfy exactly what FILE_FLAG_NO_BUFFERING requires.
However, using FILE_FLAG_NO_BUFFERING comes with a cost. If you are using unbuffered I/O, you cannot be using the buffer cache. In fact, in order to test our code on cold start, we do an unbuffered I/O to reset it, and the results are pretty drastic.
However, the only place were we actually need to do all of this is in the journal file. And we only have a single active one at any given point in time. The problem is, of course, that we want to both read & write from the same file. So I decided to run some tests to see how the system will behave.
I wrote the following code:
As you can see, what we are doing is actually writing to a file using standard File I/O, and reading via memory mapped file. I’m pre-allocating the data, and I am using two handles. Nothing strange happening here.
And here is the system behavior below. Note that we don’t have any ReadFile calls. The answer to the memory mapped reads were done directly from the file system buffers, no need to touch the disk.
Note that this is my baseline test. I want to start adding write through & no buffering and see how it works.
I changed the fs constructor to be:
Which gave us the following:
I am not really sure about this behavior, but I am guessing that what actually happened here is that we are seeing several levels of calls (probably we have unbufferred write followed by a memory map write?). Our write to send a page ended up writing a bit more, but that is fine.
Next, we want to see what is going on with no buffering & write through, which means that I need to write the following:
And we get the following behavior:
And now we can actually see the behavior that I was afraid of. After making the write to the file, we lose that part of the buffer, so we need to read it again from the file.
However, it is smart enough to know that the data haven’t changed, so subsequent reads (even if there have been writes to other parts of the file) can still use the buffered data.
Finally, we have the final try, with just NoBuffering and no WriteThrough:
According to this blog post, NoBuffering without WriteThrough had a significant performance benefit. However, I don’t really see this, and both observation through Process Monitor and the documentation suggests that both Esent and SQL Server are using both flags.
All versions of SQL Server open the log and data files using the Win32 CreateFile function. The dwFlagsAndAttributesmember includes the FILE_FLAG_WRITE_THROUGH option when opened by SQL Server.
This option instructs the system to write through any intermediate cache and go directly to disk. The system can still cache write operations, but cannot lazily flush them.
The FILE_FLAG_WRITE_THROUGH option ensures that when a write operation returns successful completion the data is correctly stored in stable storage. This aligns with the Write Ahead Logging (WAL) protocol specification to ensure the data.
So I think that this is where we will go for now. There is still an issue here, regarding current transaction memory, but I’ll address it in my next post.
On my laptop, fsync has effectively no cost. That is probably because of some configuration setting (it is a battery based system, no need to pay the fsync call). On my desktop machine (significantly more powerful than my laptop), I have an fsync times that are an order of magnitude or more higher.
In practice, if you work with fsync, you can expect to get a maximum of about 200 – 300 fsync calls per second on SSD, and significantly less on HDD. If you are seeing higher numbers than that, you are probably not really doing fsync, and are exposed to data integrity issues if you have a hard crash.
In particular, it appears that for high performance code, you really want to forget all about fsync for ensuring your transactional needs.
And that is before we started talking about the cost of fsync (FlushFileBuffers, to be more accurate) as the file size grows. It appears that there is at least some correlation between the size of the file and the cost of calling fsyn/FlushFileBuffers on it.
Considering that we are talking about potentially very large files, we really want to be careful about it. All in all, I think that we need to say goodbye to relying of fsync for ensuring ACID.
But how can we ensure that we’ll be properly ACID? Well, the answer is in the previous post, look at how other people are doing it, but I’ll expand on that in my next post.
Let me start this post by stating that I am not even sure if what I am trying to do is legal here. But from reading the docs, it does appear to be a valid use of the API, and it does work, most of the time.
The full code can be found here: https://gist.github.com/ayende/7495987
The gist of it is that I am trying to do two things:
- Write to a file opened with FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING.
- Read from this file using a memory map.
- Occasionally, I get into situations where after I wrote to the file, I am not reading what I wrote.
I have a repro, and we reproduced this on multiple machines. Both Windows 7 and Windows 8.
Here is the relevant code (the full code is in the link), explanation on it below:
This code is doing the following:
- We setup a file handle using NoBuffering | Write_Through, and we also map the file using memory map.
- We write 3 pages (12Kb) at a time to the file.
- After the write, we are using memory map to verify that we actually wrote what we wanted to the file.
- _At the same time_ we are reading from the same memory in another thread.
- Occasionally, we get an error where the data we just wrote to the file cannot be read back.
Now, here is what I think is actually happening:
- When we do an unbuffered write, Windows has to mark the relevant pages as invalid.
- I _think_ that it does so before it actually perform the write.
- If you have another thread that access that particular range of memory at the same time, it can load the _previously_ written data.
- The WriteFile actually perform the write, but the pages that map to that portion of the file have already been marked as loaded.
- At that point, when we use the memory mapped pointer to access the data, we get the data that was there before the write.
As I said, the code above can reproduce this issue (you might have to run it multiple times).
I am not sure if this is something that is valid issue or just me misusing the code. The docs are pretty clear about using regular i/o & memory mapped i/o. The OS is responsible to keeping them coherent with respect to one another. However, that is certainly not the case here.
It might be that I am using a single handle for both, and Windows does less checking when that happens? For what it is worth, I have also tried it using different handles, and I don’t see the problem in the code above, but I have a more complex scenario where I do see the same issue.
Of course, FILE_FLAG_OVERLAPPED is not specified, so what I would actually expect here is serialization of the I/O, according to the docs. But mostly I need a sanity check to tell me if I am crazy.
As mentioned, we are doing some more performance work in Voron. And we got some really surprising results there. Voron is writing at really good rate, (better than anything else we tested against), just not a good enough rate.
To be fair, if we haven’t seen the Esent benchmark with close to 750k writes / second, we might have been happy, but obviously it is possible to be much faster than we are now. So I decided to figure it out.
To start with, I run Voron through a profiler, and verified that the actual cost there was purely in calling FlushFileBuffers (the Windows version to fsync). In fact, in our tests, about 75% of the time was spent just calling this function. The test in questions does 1 million inserts, using 10,000 transactions of 100 items each. But Esent can basically do so many it doesn’t even count. So how do they do that?
I’m going to dedicate this post to discussing the process for finding it out, then spend the next one or two discussing the implications. At this level, we can’t really use something like a profiler to figure out what is wrong, we need a more dedicated tool. And in this case, we are talking about Process Monitor. It gives you the ability to see what system calls are being made on your system.
Here is what it looks like when we are committing a transaction with Voron:
And here is what it looks like when we are committing a transaction with Esent:
I was curious to test SQL Server too, and here is what it looks like when SQL Server is committing a transaction:
And if I’m already doing this, here is SQL CE transaction commit:
No, this isn’t a mistake. It didn’t do anything. By default, SQL CE only flushes to memory. You have to force it to flush to disk my using tx.Commit(CommitMode.Immediate); If you do that, the transaction commits looks like this:
Not a mistake, you still get nothing. It appears that even with Immediate, it is only writing to disk when it feels like it. At a guess, it is using memory mapped files and doing FlushViewOfFile, instead of calling FlushFileBuffers, but I am not really sure.
Since I run the benchmarks without immediate, I decided to try running the SQL CE stuff there again. Here are the numbers:
This brings to mind an interesting questions, what the hell is it doing that takes so long if it doesn’t even flush to disk?
Anyway, let us look at the SQLite version:
And… I don’t really know how to comment on that, to tell you the truth. I can’t figure out what it is doing, and I probably don’t really want to.
Now, let us look at LMDB:
I am not really sure how to explain the amount of work done here. I think that work because it uses manual file I/O. When I use the WriteMap option, I get:
Which is more reasonable and expected.
I would have shown leveldb as well, but I can’t run it on Windows.
I think that this is enough for now. I’ll discuss the implications of the difference in behavior in my next post. In the meantime, I would love to know what you think about this.
After finishing up the major change of moving Voron to a Write Ahead Journal, it was time to actually start doing some performance testing.
To make things interesting, I decided that we shouldn’t just compare this in isolation, but we should actually compare it to its peers.
Those are early results, and we are going to have to do a lot more work to make sure that everything works faster.
We have run those tests on the following machine:
All the tests were run on a freshly formatted 512GB SSD drive. Note that we are currently showing only the fast runs, we also have a set of tests for much larger data sets (tens of GB) and another for performance over time, but we will deal with those separately. All of the current tests are for writing of 1 million items. Consisting of a 4 bytes integer and a 128 bytes value.
We have tested: SQLite, SQL CE, LMDB, Esent and Voron.
For LMDB, because it needed a fixed file size, we set the initial file size to be 64 GB. All the databases were run using the default configuration options, no secondary indexes were used. All the tests were done using a single thread.
Note that in all cases we used managed code to run the test. This may impact some of the results because some of those engines are native, and there might be some overhead there.
The first test was to see how it performs with sequential writes:
Esent really shines in this, probably because this is pretty much the sweat spot for it. Voron is the second best, but the reason that we do those sorts of tests is to see where we have problems, and I think that we have a problem here, we are supposed to be much better here. In fact, we have earlier tests that show much better performance, so we appear to have a regression. We’ll work on that next.
Next, let us look at sequential reads:
Here, LMDB eclipses everyone else by far, this is its sweet spot. I am pretty happy about Voron’s performance here, especially since it appears to be close to twice as fast as Esent is for this scenario.
Next, we have random writes:
Surprisingly, Voron is doing pretty badly here, even though it is doing much better than LMDB (this is its weak spot) or SQLite.
For random reads, however, the situation is nicer to us:
So, we have our baseline. And I want to see how we can do better. Expect the future posts to focus on what exactly is slowing our writes down.
In the meantime, we do have some really good news, we tested Voron with and without concurrent flushing to the data file, and there isn’t any meaningful difference between the performance of the two options in our current test run.
So far, we have shown the following features:
- HTML5 Studio
- OWIN / Web API based infrastructure.
- Native Java API
I think that you can guess that RavenDB running on Voron is one of the others remaining features, but there are two more in the pipeline that we haven’t talked about…
Can you guess what they are?