Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,336 | Comments: 47,046

filter by tags archive

Getting 600 times better performance

time to read 1 min | 73 words

During the design process for the next release of RavenDB, we set ourselves a pretty crazy goal. We wanted to get a tenfold performance improvement across the board…

This is how my article about gaining 600x performance improvement the the new DZone Guide - Performance: Optimization & Monitoring starts. For the rest, head there and read it all Smile.

What did all this optimization give us?

time to read 2 min | 332 words

I’ve been writing a lot about performance and optimizations, and mostly I’m giving out percentages, because it is useful to compare to before the optimizations.

But when you start looking at the raw numbers, you see a whole different picture.

On the left, we have RavenDB 4.0 doing work (import & indexing) over about 4.5 million documents. On the right, you have RavenDB 3.5, doing the same exact work.

We are tracking allocations here, and this is part of a work we have been doing to measure our relative change in costs. In particular, we focused on the cost of using strings.

A typical application will use about 30% of memory just for strings, and you can see that RavenDB 3.5 (on the right) is no different.

image

On the other hand, RavenDB 4.0 is using just 2.4% of its memory for strings. But what is even more interesting is to look at the total allocations. RavenDB 3.5 allocated about 300 GB to deal with the workload, and RavenDB 4.0 allocated about 32GB.

image

Note that those are allocations, not total memory used, but on just about every metric. Take a look at those numbers:

image

RavenDB 4.0 is spending less time overall in GC than RavenDB 3.5 will spend just on blocking collections.

Amusingly enough, here are the saved profile runs:

image

Fast Dictionary and struct generic arguments

time to read 2 min | 281 words

One of the most common issues that come up with performance tuning is that dictionaries are expensive. It isn’t so much that a single dictionary lookup is expensive, it is the sheer number of them. Dictionaries are used everywhere, and they are often used in very hot codepaths (as caching).

Numerous times we have dealt with that with trying to avoid the dictionary access (often favoring an array based lookup if we can get away with it). But at some point we have decided to implement our own dictionary. Here is how it looks like:

image

The actual dictionary impl is very close to the standard one, but that isn’t what make it fast. Note the generic argument? If we pass a struct implementing IEqualityComparer generic argument, then in most cases, the compiler and the JIT are going to generate code that is able to eliminate all virtual calls. And if there is a trivial equality comparison, that means that you can eliminate all calls and inline the whole thing inside that generic dictionary implementation.

In other words, we eliminate a minimum of two virtual calls per key lookup, and in some cases, we can eliminate even the method calls themselves, and that turn out to be quite important when the number of key lookups is in the billions.

^FEA545CC4AB39925FCF64214523354EF2E6493470066F1BD26^pimgpsh_fullsize_distr

And this is from midway through the optimizations.

1st class patching in RavenDB 4.0

time to read 2 min | 317 words

One of the most common operations in RavenDB is to load a document, make a simple change and save it back. Usually, we tell users to just rely on the change tracking on the session and just save the document, but while it is the easiest way, it isn’t always the best. If I have a large document, I might not want to send it all the way back to the server just for a small change. That is why RavenDB has had a Patch operation for a long time. But while we had this feature, it was always a bit clumsy. It either required you to build a patch request using a somewhat cryptic and limited object graph or to write your own inline javascript to make the modifications.

With RavenDB 4.0, we are introducing patching as a first class concept, baked directly into the session, for example:

In this case, we’ll send the server a request to update the last modified date when the SaveChanges method is called. The syntax is not all that I wish it could be, but we have to operate with the limitations that Linq syntax can accept.

A more interesting case is when you want to use patching not to reduce the network load, but to allow multiple concurrent operations on the document. Let us consider the case of adding a comment to this blog post. It is fine if two users post a comment at the same time, and we can express that using:

This gives us an easy way to express such things and expose a RavenDB capability that too few users has taken advantage of. Beneath the scenes, the Linq expressions are turned into JavaScript patches, which is then used to send just the right commands for the server to work with.

It’s a really cool feature, even if I so myself.

Data checksums in Voron

time to read 7 min | 1285 words

Every time that I think about this feature, I am reminded of this song. This is a feature that is only ever going to be used in everything fails. In fact, it isn’t a feature, it is an early warning system, whose sole purpose is to tell you when you are screwed.

Checksums in RavenDB (actually, Voron, but for the purpose of discussion, there isn’t much difference) are meant to detect when the hardware has done something bad. We told it to save a particular set of data, and it didn’t do it properly, even though it had been very eager to tell us that this can never happen.

The concept of a checksum is pretty simple, whenever we write a page to disk, we’ll hash it, and store the hash in the page. When we read the page from disk, we’ll check if the hash matches the actual data that we read. If not, there is a serious error. It is important to note that this isn’t actually related to the way we are recovering from failures and midway through transactions.

That is handled by the journal, and the journal is also protected by a checksum, on a per transaction basis. However, handling this sort of errors is both expected and well handled. We know where the data is likely to fail and we know why, and we have the information required (in the journal) to recover from it.

This is different, this is validating that data that we have successfully written to disk, and flushed successfully, is actually still resident in the form that we are familiar with. This can happen because the hardware outright lied to us (can usually happen with cheap hardware) or there is some failure (cosmic rays are just one of the many options that you can run into). In particular if running on crappy hardware, this can be just because overheating or too much load on the system. As a hint, another name for crappy hardware is a cloud machine.

There are all sorts of ways that you can happen, and the literature makes for a very sad reading. In a CERN study, about 900 TB were written in the course of six months, and about 180 MB resulted in errors.

The following images are from a NetApp study shows that over a time period of 2.5 years, 8.5% of disks had silent data corruption errors. You can assume that those are not cheap off the shelves disks. Some of the causes are great reading if you are a fan of mysteries and puzzles, but kind of depressing if you build databases for a living (or rely on databases in general).

image

Those are just the failures that had interesting images, mind, there are a lot more there. But from the point of view of the poor database, it ends up being the same thing. The hardware lied to me. And there is very little that a database can do to protect itself against such errors.

Actually, that is a lie. There is a lot that a database can do to protect itself. It used to be common to store critical pages in multiple locations on disks (usually making sure that they are physically far away from one another), as a way to reduce the impact of the inevitable data corruption. This way, things like the pages that describe where all the rest of the data in the system reside tend to be safe from most common errors, and you can at least recover a bit.

As you probably guessed, Voron does checksums, but it doesn’t bother to duplicate information. That is already something that is handled by RavenDB itself. Most of the storage systems that are dealing with data duplication (ZFS has this notion with the copies command, for example) were typically designed to work primarily on a single primary node (such as file system that don’t have distribution capabilities). Given that RavenDB replication already does this kind of work for us, there is no point duplicating such work at the storage layer. Instead the checksum feature is meant to detect a data corruption error and abort any future work on suspect data.

In a typical cluster, this will generate an error on access, and the node can be taken down and repaired from a replica. This serves as both an early warning system and as a way to make sure that a single data corruption in one location doesn’t “infect” other locations in the database, or worse, across the network.

So now that I have written oh so much about what this feature is, let us talk a bit about what it is actually doing. Typically, a database would validate the checksum whenever it reads the data from disk, and then trust the data in memory ( is isn’t really safe to do that either, but let’s us not pull the  research on that, otherwise you’ll be reading the next post on papyrus) as long as it resides in its buffer pool.

This is simple, easy and reduce the number of validation you need to do. But Voron doesn’t work in this manner. Instead, Voron is mapping the entire file into memory, and accessing it directly. We don’t have a concept of reading from disk, or a buffer pool to manage. Instead of doing the OS work, we assume that it can do what it is supposed to do and concentrate on other things. But it does mean that we don’t control when the data is loaded from disk. Technically speaking, we could have tried to hook into the page fault mechanism and do the checks there, but that is so far outside my comfort zone that it gives me the shivers. “Wanna run my database? Sure, just install this rootkit and we can now operate properly.”

I’m sure that this would be a database administrator’s dream. I mean, sure, I can package that in a container, and then nobody would probably mind, but… the insanity has to stop somewhere.

Another option would be to validate the checksum on every read, that is possible, and quite easy, but this is going to incur a substantial performance penalty to do ensure that something that shouldn’t happen didn’t happen. Doesn’t seem like a good tradeoff to me.

What we do instead is make the best of it. We keep a bitmap of all the pages in the data file, and we’ll validate them the first time that we access them (there is a bit of complexity here regarding concurrent access, but we are racing it to success and at worst we’ll end up validating the page multiple times), and afterward, we know that we don’t need to do that again. Once we loaded the data to memory even once, we assume that is isn’t going to change beneath our feet by something. This isn’t an axiom, and there are situations where a page can be loaded from disk, valid, and then become corrupted on disk. The OS will discard it at some point, and then read the corrupt data again, but this is a much rarer circumstance than before.

The fact that we recently have verified that the the page is valid is a good indication that it will remain valid, and anything else have too much overhead for us to be able to use (and remember that we also have those replicas for those extreme rare cases). 

Independent of this post, I just found this article which injected errors in multiple databases data and examined how they behaved. Facincating reading.

The occasionally failing test

time to read 2 min | 211 words

This piece of code is part of a test that runs a scenario, and checks that the appropriate errors are logged. Very occasionally, this test would fail, and it would be nearly impossible to figure out why.

I’ve extracted the offending code from the test, ReadFromWebSocket returns a Task<string>, since that isn’t obvious from the code. Do you see the error?

image

Think about it, what is this doing?

This is reading from a socket, and there is absolutely no guarantee about the amount of data that will go over the wire in any particular point. Because this test assumes that the entire string will be read in a single call from the socket, if the expected value we are looking is actually going to be returned in two calls, we’ll miss it, and this will never return, leading to sadness and worry everywhere*.

* At least everywhere that care about our tests.

The fix was to remember the previous values and compare to all the data read from the socket, not just the values that were returned in the last call.

Big data work on 32 bits

time to read 7 min | 1326 words

Related imageEver put on a pair of shoes that were just a bit too small for you? You think that it would be fine, but then time goes by, and it pinches. And then it hurts, and then you just can’t walk any longer.

That is how it feels to work with any reasonable amount of data (even low hundreds of MB) in 32 bits. The virtual address space is so limited that it is incredibly easily to just run out due to address space fragmentation and fail, and there is very little that you can actually do to recover from such a scenario. We have been working on making RavenDB 4.0 work reliably in 32 bits mode*, and it has been a PITA after PITA. In particular,  Voron was quite explicitly designed for an environment where it is hard to impossible to run out of address space, and out typical modus operandi is to map the entire data file (which can be hundreds of GB in size) as a single continuous memory region.

* And yes, I can’t imagine how much PITA it was to run in 16 bits. When I programmed to those sort of platforms, I never actually used anything that got to needing oh so much memory (I was at middle school at the time, IIRC).

That allows us to do all sort of tricks and optimizations, and it is utterly impossible to do on 32 bits. In fact, on most 32 bits system, after the process has been running for a while, just doing a single 256MB allocation is probably going to fail. You might have that much memory free, but you don’t have it in a continuous run. So we had to do drastic changes, and at the same time, 32 bits is an important requirement, but mostly it is on the sidelines. I want to support it, and I’m willing to put the effort to get there, but I’m not willing to pay for it if it costs me when running outside of 32 bits mode.

In other words, anything that would have a drastic impact of the “normal” mode of running in 64 bits was out, and we didn’t want to re-architect the whole thing. Luckily, we already had the place to put the different behavior. In Voron, the Pager is the entity that is responsible for mapping the file to memory, and hand out pointers from the file based on the page number asked. That meant that we had a well defined interface:

image

Because the file can grow, and the pager might hold multiple overlapping maps to the same file, we only allow pointers to the data file in the scope of a transaction. That ended up being a very good thing, since this enabled us to build a pager that could work in 32 bits mode. Instead of mapping the entire file, we’ll map just the pages that are required, and only for the duration of the transaction, after which we can unmap everything.  The idea is that we’ll only have the data that we are actually using mapped, so we’ll save a lot of address space.

That was the theory, at least, in practice, we run into several interesting issues.

  • We want to reduce the number of system calls we make, and the allocation granularity in Windows in 64KB, so we always need to map at least that much, instead of mapping 1 page at a time.
  • A value may actually be longer than 64KB, or not be aligned on 64KB boundary.
  • Transactions are concurrent, and are typically need to access the same areas (the top branches in the B+Trees we use, for example).

The end result is that we map in multiples of 64KB (on both Windows & Linux), and we check, based on the actual data, whatever we need to remap the data if it is more than is allocated in the current block. We are also sharing all the maps among all the concurrent transactions, to reduce the total amount of virtual address space we are using. This is a bit hard to do concurrently & safely, so we are racing it. In most cases, we’ll have a single mapping, but it is fine to map the same section twice from different transactions (there can only ever be a single write transaction, so all the others are just reading, and never from the same memory that the write tx is writing to). The alternative would have been to use a more invasive locking, and the performance cost isn’t worth it.

Once we got this working, we were most of the way there, but there were still a lot of reversed optimizations that we had to do. For example, in 64 bits mode, it make a lot of sense to try to pre-allocate data in advance to maintain locality, and as the data we dealt with became larger, so would our pre-allocations. But those would end up being around 32MB of data that we pre-allocate (so we need to prepare and touch all of it), and under load, it was actually one of the more common cases for failures due to fragmentation of the address space, because it would allocate (1MB, 2MB, 4MB, 8MB, 16MB, 32MB) and we had many such operations cutting the address space into fine bits.

Another location that cause us problem was with indexes, where we allocated a 16MB range for bloom filter. Made perfect sense to optimize things in 64 bits, but in 32 bits mode that is a huge chunk of my address space that I’m giving up. In both cases, we drastically reduced the sizes involved (to 64KB in the case of the bloom filter, and a max pre-allocation of 1 MB).

Another thing that was actually in our favor is that our memory usage is already policed quite carefully.  We didn’t do that intentionally for 32 bits, we did that because memory allocation is so expensive, but because we rarely ask for memory from the OS, and because we are typically reuse buffers, we were pretty much already there in terms of proper behavior of the system in 32 bits mode. The CLR is also pretty good about allocation in large sections from the OS and being able to move memory around to avoid internal fragmentation.

Overall, it works, we have tested that on the Stack Overflow dataset, and it is quite amazing to see it (you can see it chugging along, using anything between 300 MB – 600 MB as it is inserting 60+ GB of data). It is obviously less efficient than the 64 bits version, but given that we are so much faster in 4.0, I don’t think that anyone will actually notice, and if they do… Well, that is why we have hardware built in this decade.

There were other places where we had to take into account the 32 bits nature of the system, in which we actively undermined previous optimizations. Instead of allowing transactions to merge as much as possible to reduce I/O, we placed a hard limit on how much data can be accessed or modified by a transaction before we force it to commit. The same goes for indexing, instead of letting batches to run as long as we would usually like them, address space considerations forces us to reduce the batch size to account for how much address space is used in the current batch, and close it early (freeing up the address space for other uses).

The overall process is very unpleasant, because things that I would consider to be obvious and easy are suddenly so much harder than I would expect them to be.

When fsync fails

time to read 3 min | 593 words

imageI/O is a strange beast, it is slow, ponderous and prone to all sort of madness. In particular, there is no correlation between when you make an operation and when it will actually reach its destination.

Case in point, this StackOverflow question, which describe a failure that led to data corruption in a database (from context, is seems to be PostgreSQL). The basic problem seems to be pretty simple, fsync can fail, which is fine, but the problem is what is going to happen when it fails.

This is a much more interesting story, and you can read about a deep dive into the Linux Kernel source code to figure out the exact behavior.

But I’m actually going to take a couple of steps higher in the system to talk about this issue. Given that I/O is so slow, and I/O call is effectively a queueing call, with fsync serving as the “I’ll wait until all the previous I/O has completed”.

So what happens if fsync failed? That can happen because of any number of reasons, several of which are actually transient. Ideally, I would like to get an error saying: “try again later, might work” or “the world has ended, nothing can be done”. I would like that, sure, but putting myself at the shoes of the dev writing fsync, I can’t see how it can be done. So effectively, if fsync failed, it says “I have no idea what is the state of the previous writes, and I can’t figure it out, you are on your own”.  Note that calling fsync again in this case means “ensure that all the writes since the previous (failed) fsync are persisted”, and doesn’t help you at all to avoid data corruption.

An extremely short trawling through the PostgreSQL codebase gave me at least one case where they are ignoring the fsync return value. I’m not sure how important that case is (flushing of the PGDATA directory), but it doesn’t seem minor.

Now, here is the deal, if you are a database, with a Write Ahead Log, this isn’t actually all that hard to resolve, you already have a way to replay all your writes. This is annoying, but it is perfectly recoverable. With standard applications?

Here is the deal, a lot of applications are trying to use fsync to ensure that the data has been properly persisted, but if fsync return with an error, there is pretty much nothing that an application can do, and in most cases, you don’t really even have a way to recover.

After seeing this post I went and checked what would be Voron’s behavior (and hence RavenDB) in such a case. If we get an error when fsync fails, we treat this as a catastrophic error. This sounds really scary, but this basically means that we detected a deviation between the in memory state and persisted state in the database, and we are going to shutdown the database in question so we can run full recovery and ensure that we are running in a consistent state. This is pretty much the only thing we can do, because otherwise we are risking data corruption.

This case will result in interruption of service, since the database will need to replay the journal to ensure that everything matches, but if the error is transient, it is likely it will just work. And if it isn’t transient error, well, that is what you have admins for.

The previously sorted algorithm

time to read 4 min | 775 words

The performance team has noticed that a particular scenario (importing large number of documents) is spending a lot of time sorting data. The finger was pointed directly at this guy:

image

Those are 11.3% of the total operation time, in an area that is already extremely optimized. Before I go forward and explain what we did, I need to explain what is going on here. Consider the following JSON document (taken from the JSON value in Wikipedia):

 

RavenDB has progressed far beyond storing JSON as text. I have a 7 parts series of posts that talk about how we store things. But the basic idea is that instead of storing the data as text, we store it in a well crafted binary format that is optimized for reading parts of the document without having to do any work. In other words, if I wanted to what is the type of the second phone number in this document, I wouldn’t have to parse it all. I would go to the root document table and find the ‘phoneNumbers’ positions, just to that, and then just to the second value, and then find the ‘type’ entry in that object table, and then I’ll have the value.

Those searches inside the table require us to have the property names sorted, so we can do a binary search on them. Binary search is really good, giving us efficiency of O(logN). But we do better actually, in fact,  we do a lot better than that, with amortized cost of O(1).  How do we do that? We do that by utilizing a very interesting aspect. We keep seeing the same (or very similar) documents, so we can learn from previous searches and start the search for a property name in the position we have previously found it. This tend to have high probability of success, and at most add a single operation, so it is very successful optimization.

But in order to do that, we need to make sure that the data is sorted. And as you can see in the profiler trace above, this is decidedly non trivial. In fact, we pay over 11% (!) of our JSON parsing time just in sorting those tables. That is crazy. We considered switching to a tailored sort algorithm, but this part was already heavily optimized. Instead of sorting by strings, we are pre-sorting all seen values, then sort on their previously calculated order. That means that we don’t need to do a lot of string comparisons, but this still killed us.

Then I remember that we aren’t trying to optimize a single run. Instead, we are trying to optimize the entire process. And given that observation, we can utilize that knowledge. The basic idea is simple, we assume that most of the time we are going to see similar data, so we are going to cache the sort order. Next time we get an object to sort, we can check if we have sorted this sequence of properties before, and just use that sort. This translate the operation into going over the list of properties a few times, with very little actual code and behavior. The details are a bit gnarly, of course, because we have a lot of stuff that we need to deal with (anything from duplicate properties to dynamic changes in property ids between documents to deciding how much we should cache and how we should do it), but we got it working.

The result is beautiful:

image

It is even more impressive if you look at the performance comparison directly:

image

Total runtime dropped by close to 3 seconds, and we saved our 11% back with interest. Overall we are 13% faster.

This is us testing on production data, which has a high degree of real world issues. Because the data has been through multiple releases and mutations.

But the key here is that this change actually improved all of RavenDB’s JSON reading capabilities. As you might imagine with a JSON Document DB, we tend to do quite a lot of JSON parsing, and all of that just got an amazing performance boost.

FUTURE POSTS

  1. Why we aren’t publishing benchmarks for RavenDB 4.0 yet - 17 hours from now
  2. Deleting highly performant code - about one day from now
  3. The bug in the platform: Partial HTTP requests & early response - 5 days from now
  4. Trying to live without ReSharper in Visual Studio 2017 - 6 days from now
  5. The cost of allocating memory and cheating like crazy for best performance - 7 days from now

And 1 more posts are pending...

There are posts all the way to Apr 06, 2017

RECENT SERIES

  1. RavenDB Conference videos (12):
    03 Mar 2017 - Replication changes in 3.5
  2. Low level Voron optimizations (5):
    02 Mar 2017 - Primitives & abstraction levels
  3. Implementing low level trie (4):
    26 Jan 2017 - Digging into the C++ impl
  4. Answer (9):
    20 Jan 2017 - What does this code do?
  5. Challenge (48):
    19 Jan 2017 - What does this code do?
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats