Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,319 | Comments: 46,927

filter by tags archive

Low level Voron optimizationsRecyclers do it over and over again.

time to read 5 min | 884 words

One of the key rules in optimization work is that you want to avoid work as much as possible. In fact, any time that you can avoid doing work that is a great help to the entire system. You can do that with caching, buffering, pooling or many other such common patterns.

With Voron, one of our most common costs is related to writing to files. We are doing quite a lot of work around optimizing that, but in the end, this is file I/O and it is costly.

A big reduction in the cost of doing such I/O is to pre-allocate the journal files. That means that instead of each write extending the file, we ask the operation system to allocate it to its full expected size upfront. This saves time and also ensures that the OS has a chance to allocate the entire file in as few fragments as it possible can.

However, كل كلب له يومه (every dog has its day), and eventually a journal has outlived its usefulness, which means that it is time to make a hotdog. Or, as the case may be, delete the now useless journal file.

Of course, eventually the current journal file will be full, and we’ll need a new journal file, in which case we’ll ask the OS to allocate us a new one, and pay the cost of doing all of this I/O and the cost of file allocations.

Hm… that seems pretty stupid, isn’t it, when you think about the whole system like that…

Instead we now reuse those journals. We rely on the fact that file rename is atomic in both Windows and Posix, and so we can avoid expensive allocation calls and reuse the buffers.

Here is what this looks like, when doing heavy writes benchmark:

image

It is important to note that we also have to do some management here (to only keep pending journals for a period of time if they aren’t being used) but also need to handle a very strange case. Because we are now reusing a valid journal file, we now have a case where we might read valid transactions, but ones that are obsolete. This means that we need to be aware that beyond just garbage, we might have to encounter some valid data that is actually invalid. That made us tighten our journal validation routine by quite a bit. 

There is also another advantage of this approach is that this also plays very well with the underlying hardware. The reuse of the already allocated files means that the disk has to do a lot less work, it reduces fragmentation and it allows much faster responses overall. According to research papers, the difference can be a factor of 4 difference on modern SSD drives. This is a really good thing, since this means that this approach has wide applicability across mass storage devices (SSD, HDD, etc). I actually had a meeting with a storage company to better understand the low level details of how a disk manages the bits, and some of this behavior is influenced by those discussions.

I’m ignoring a lot of previous work that we have done around that (aligned writes, fixed sizes, pre-allocation, etc) of course, and just focusing on the new stuff.

Some of that only applies to that particular manufacturer disks, but a lot of that has broader applicability. In short, the idea is that if we can keep the amount of writes we do to a few hot spots, the disk can recognize that and organize things so this would be optimized. You can read a bit more about this here, where it discusses the notion of multiple internal storage tiers inside a disk. The idea is that we provide the disk with an easily recognizable pattern of work that it can optimize. We looked at using the disk low level options to tell it directly what we expect from it, but that is both hard to do and will only work in specific brand of disks. In particular, with cloud storage, it is very common to just lose all such notions of being able to pass hints to the disk itself, even while the underlying storage could handle it. (In the previous presentation, this is call I/O tagging and latency / priority hints).

Instead, by intentionally formatting our I/O in easily recognizable pattern, we have much higher applicability and ensure that the Right Thing will happen. Sequential writes, in particular (the exact case for journals) will typically hit a non volatile buffer and stay there for a while, letting the disk optimize its I/O behavior even further.

Another good read on this is here, where it talks about StableBuffer (you can ignore all the other stuff about decomposing and reoredering I/O), just the metrics about how much a focused write like that can help is very good.

Other resource also indicate that this is an optimal data access pattern, preserving the most juice from the drive and giving us the best possible performance.

Low level Voron optimizationsHigh data locality

time to read 3 min | 592 words

After talking about increasing the Voron page size, let us talk about another very important optimization. High data locality. The importance of locality comes up again and again in performance.The cost of getting the next bit of data can be so prohibitedly expensive that it dominates everything else, including standard Big O time complexity metrics. A fun discussion of that is here.

Remember that Voron actually stores all data in pages, and that means that it needs some way to allocate new pages. And by default, whenever you allocate a page, we use a page from the end of the file. In certain scenarios (pure sequential inserts), that generates some pretty good allocation pattern, but even there it can cause issues. Let us consider what the database file looks like after a while:

image

Imagine that the green sections are all pages that belong to the same B+Tree inside Voron. Traversing the B+Tree now means that we have a very high probability of having to jump around in the file a lot. Since we are memory mapped, we wouldn’t typically feel this, since we aren’t actually hitting the disk that often, but it has several interesting implications:

  • Startup time can increase rapidly, since we need to issue many I/O requests to different places in the file
  • Flush / sync time is also increased, because it need to touch more of the disk

Trees are typically used for indexes in Voron, and a document collection would typically have a few different storage indexes (lookup by etag, lookup by name, etc). Because they store different data, they have different growth pattern, so they are going to allocate pages at different rate, which means that the scattering of the pages across the data file is even more sever.

The change we just finished implementing is going to do several important things all at once:

  • Pages for all the storage indexes of a collection are going to be pre-allocated, and when they run out, be allocated again in batches.
  • The indexes will ask the storage to allocate pages nearby the sibling page, to increase locality even further.
  • All indexes will use the same pre-allocation buffer, so they all reside in roughly the same place.

That also give us some really interesting optimizations opportunities. Since indexes are typically order of magnitude smaller than the data they cover, it is possible to ask the operation system to prefetch the sections that we reserved for indexes for each collection in advance, leading to far less paging in the future and improving the startup time.

It also means that the operation system can issue a lot more continuous reads and writes, which is perfectly in line with what we want.

The new allocation strategy ends up looking like this:

image

In this case, we have enough data to fill the first pre-allocated section, and then we allocate a new one. So instead of 4 operations to load things, we can do this in 2.

Even without explicit prefetching on our end, this is going to be great because the operating system is going to be able to recognize the pattern of access and optimize the access itself.

Low level Voron optimizationsThe page size bump

time to read 5 min | 864 words

Explaining the usage pages seems to be one of the things is either hit of miss for me. Either people just get it, or they struggle with the concept. I have written extensively on this particular topic, so I’ll refer it to that post for the details on what exactly pages in a database are.

Voron is currently using 4KB pages. That is pretty much the default setting, since everything else also works in units of 4KB. That means that we play nice with requirements for alignment, CPU page sizes, etc.  However, 4KB is pretty small, and that lead to trees that has higher depth. And the depth of the tree is one of the most major reasons for concern for database performance (the deeper the tree, the more I/O we have to do).

We previously tested using different page sizes (8KB, 16KB and 32KB), and we saw that our performance decreased as a result. That was surprising and completely contrary to our expectations. But a short investigation revealed what the problem was. Whenever you modify a value, you dirty up the entire page. That means that we would need to write that entire page back to storage (which means making a bigger write to the journal, then applying a bigger write to the data filed, etc).

In effect, when increasing the page size to 8KB, we also doubled the amount of I/O that we had to deal with. That was a while ago, and we recently implemented journal diffing, as a way to reduce the amount of unnecessary data that we write to disk. A side affect of that is that we no longer had a 1:1 correlation between a dirty page and full page write to disk. That opened up the path to increasing the page sizes. There is still an O(PageSize) cost to doing the actual diffing, of course, but that is memory to memory cost and negligible in compared to the saved I/O.

Actually making the change was both harder and easier then expected. The hard part was that we had to do a major refactoring working to split a shared value. Both the journal and the rest of Voron used the notion of Page Size. But while we want the page size of Voron to change, we didn’t want the journal write size to change. That led to a lot of frustration where we had to go over the entire codebase and look at each value and figure out whatever it meant writing to the journal, or pages as they are used in the rest of Voron. I’ve got another post scheduled talking about how you can generate intentional compilation errors to make this easy for you to figure it out.

Once we were past the journal issue, the rest was mostly dealing with places that made silent assumptions on the page size. That can be anything from “the max value we allow here is 512 (because we need to fit at least so many entries in)” to tests that wrote 1,000 values and expected the resulting B+Tree to be of a certain depth.

The results are encouraging, and we can see them mostly on the system behavior with very large data sets, those used to generate very deep trees, and this change reduced them significantly. To give some context, let us assume that we can fit 100 entries per page using 4KB pages.

That means that if we have as little as 2.5 million entries, we’ll have (in the ideal case):

  • 1 root page holding 3 entries
  • 3 branch pages holding 250 entries
  • 25,000 leaf pages holding the 2.5 million entries

With 8 KB pages, we’ll have:

  • 1 root page holding 63 entries
  • 12,500 lead pages holding 2.5 million entries

That is a reducing of a full level. The nice thing about B+Trees is that in both cases, the branch pages are very few and usually reside in main memory already, so you aren’t directly paying for their I/O.

What we are paying for is the search on them.

The cost of searching the 4KB tree is:

  • O(log2 of 3) for searching the root page
  • O(log2 of 100) for searching the relevant branch page
  • O(log2 of 100) for searching the leaf page

In other words, about 16 operations. For the 8 KB page, that would be:

  • O(log2 of 63) for searching the root page
  • O(log2 of 200) for searching the leaf page

It comes to 14 operations, which doesn’t seems like a lot, but a lot of our time goes on key comparisons on the key, so anything helps.

However, note that I said that the situation above was the ideal one, this can only happen if the data was inserted sequentially, which it doesn’t usually do. Page splits can cause the tree depth to increase very easily (in fact, that is one of the core reasons why non sequential keys are so strongly discourage in pretty much all databases.

But the large page size allows us to pack many more entries into a single page, and that also reduce the risk of page splits significantly. 

RavenDB 4.0 Indexing Benchmark

time to read 4 min | 645 words

I run into this post, which was pretty interesting to read. It compares a bunch of ways to index inside CouchDB, so I decided to see how RavenDB 4.0 compares.

I wrote the following code to generate the data inside RavenDB:

In the CouchDB post, this took… a while. With RavenDB, this took 7.7 seconds and the database size at the end was 48.06 MB. This is my laptop, 6th gen i7 with 16 GB RAM and SSD drive. The CouchDB tests were run over a similar machine, but with 8 GB RAM, but RavenDB didn’t get to use much memory at all throughout the benchmark.

It is currently sitting on around 205MB working set, and allocated a total of 70 MB managed memory and 19 MB of native memory.

I then created the following index:

image

This does pretty much the same as the indexes created in CouchDB. Note that in this case, this is running in process, so the nearest equivalent would be to Erlang native views.

This took 1.281 seconds to index 100,000 documents, giving us a total of 78,064 indexed documents per second.

The working during the indexing process grew to 312 MB.

But those were small documents, what about larger ones? In the linked post, there was another test done, this time with each index also having a 300KB text property. Note that this is a single property.

The math goes like this: 300KB * 100,000 documents = 30,000,000 KB = 29,296 MB = 28.6 GB.

Inserting those took 1 minute and 40 seconds. However, the working set for RavenDB peaked at around 500 MB, and the total size of the database at the end was  512.06 MB. It took me a while to figure out what was going on.

One of the features that we have with the blittable format is the fact that we can compress large string values. The large string property was basically lorem ipsum repeated 700 times, so it compressed real well. And the beauty of this is that unless we actually need to touch that property and do something to it, it can remain compressed.

Indexing that amount of data took 1.45 seconds, for 68,965 indexes documents per second.

My next test was to avoid creating a single large property and go with a lot of smaller properties.

This generates 100,000 documents in the 90KB – 900 KB range. Inserting this amount of data took a while longer. This is mostly because of the amount of data that we write in this scenario which is in the many GB range. In fact, this is what my machine looked like while the insert was going on:

image

Overall, the insert process took 8:01 minutes to complete. And the final DB size was around 12 GB.

Indexing that amount of information took a lot longer, and consume a few resources:

image

Time to index? Well, it was about twice as much as before, with RavenDB taking a total of 2.58 seconds to index 100,000 documents totaling 12 GB in size.

That comes to 38,759 indexes documents per second.

A large part of that is related to the way we are storing information in blittable format. We don’t need to do any processing to it, so we are drastically faster.

Comparing to the CouchDB results in the linked post, we have:

  RavenDB CouchDB
Small documents 78,064 19,821
Large documents 38,759   9,363

In other words, we are talking about being 4 times faster than the faster CouchDB option.

The edge case is in the timing

time to read 1 min | 141 words

In a previous post, I showed how to get 10x performance boost by batching remote calls. The code included the following method:

This code has a subtle bug in it. Take a look and see if you can find it.

Yes, I’ll wait, honestly.

Go read the code and think about interesting ways to break it.

Okay, found it? Awesome, now let me explain anyway Smile.

This code suffer from the slow arrival syndrome. If we have a new request coming in every 100 ms, then the first request here will languish in the queue for over 25 seconds!

Luckily, the fix is simple:

We just need to make sure that we aren’t waiting for too long. In this case, we will wait a maximum of less than half a second for the messages to go over the wire.

10x speedup utilizing Nagle Algorithm in business application

time to read 4 min | 793 words

Nagle algorithm is a pretty annoying thing. Basically, it means that when we write to the network, the TCP stack will wait a bit to see if we have more stuff to send to that destination before actually emitting the packet. Most people run into this when they wonder why the minimum time for a remote operation is 200ms, even on the local network. Then you figure out how to disable Nagle Algorithm, and you are happy again.

Nagle algorithm was designed for remote terminals, where the speed difference between a human typing and the machine sending packet was big enough that each single letter you typed would be sent as a separate packet. That led to a really high overhead, 40 bytes overhead to send just a single byte to the server, and the number of packets that were sent may be high enough to cause the pipe to choke. So you buffer it, the basic algorithm goes like this, if we don’t have enough data to send a full packet, and if 200 ms didn’t pass since the first buffered data, wait up to 200 ms for more data.  In this manner, if you type fast enough, you will send more than a single character to the server, dramatically reducing the cost of talking with the server, and speeding up everything significantly.

So this is Nagle Algorithm, it is a pretty low level TCP detail, and often overlooked by people. If you don’t study networks, you’ll typically only find out about it when you have a perf problem that you can’t figure out. So how is this relating to business applications?

Imagine that you work on a system that does snail mail sending. A user will call the API and then a few days later a physical letter will show up at your door. Nothing really special about that, right? Except that we charge users based on the plan they choose. For simplicity’s sake, we’ll say that we have two plans:

  • Pay as you go – can send as many letters as they want, and we charge a certain amount for each.
  • Budgetted plan – can send up to certain number of letters per month.

In either case, it is pretty important to us to record that the mail was sent, and if the user is on a budget plan (or has spending alerts on his account), we need to respond to it in certain ways.

Overall, there is nothing really surprising here, and the code to handle this is pretty simple, right?

The problem is that under load, we’re going to have query a few requests going on to the billing service, and that is likely to be a bottleneck for us. Note that this is the case because while processing a single RecordMail request is pretty fast, the problem is that we are actually going to have to wait for both the actual processing and the back and forth between the machines. If the cost of processing a RecordMail request is 1 ms, and the cost of going over the network is 0.5 ms, that adds up quickly.

Now, let us see how we can do it better. We’ll start with moving from making the call immediately to placing the details in a queue.

You can see that we use a task completion source to wait for the result of the call. Next, we need to actually handle this, this is done in the following methods:

What this ends up doing is to reduce the network costs. Instead of going to the server once for every request, we’ll go to the server once per 200 ms or every 256 requests. That dramatically reduce network traffic. And the cost of actually sending 256 requests instead of just one isn’t really that different. Note that this gives us a higher latency per request, since we may have to wait longer for the request to go out to the server, but it also gives me much better throughput, because I can process more requests in a shorter amount of time.

A change like that can push your performance up by an order of magnitude or more. It also gives you economy of scale benefits on the other side. Instead of having to do a fixed amount of work around processing each request, we can now do it over a whole bunch of them.

For example, if I need to load the customer entity, instead of doing it once per mail, I can do that once per batch, and it is likely that this customer will be in the batch more than once, reducing the amount of work I have to do.

The performance regression in the optimizationPart II

time to read 4 min | 762 words

In my previous post I showed a performance conundrum. A code that has been optimized to reduced heavy allocation usage that became over twice as slow.

In particular, we had a problem here, the new code it 3.4 times slower than the new one, but how?

image

Now, the real scenario we had involved concurrent access, so it was much harder to figure out, but I cheated a bit when producing this image, I used sampling profiling, instead of tracing one. The major difference between the two is that tracing profiler will also give you the number of calls. This is called out as something that you would typically do because you want to analyze algorithmic complexity, but I find it incredibly useful to figure out what my code is actually doing.

And indeed, looking at the same code using tracing profiler gives us the following two calls:

image

image

And when looking at the diffs between those two, we have:

image

So for some reason we are making 54 million more calls to the Equals method in the optimized version, but why? Both of those are using the exact same dictionary, using the exact same key type and the same keys, even.

In the real scenario we were facing, that wasn’t the case, so that made it hard to analyze the issue. We started looking into whatever we were doing some sort of cache poisoning by having the buffer holder as the dictionary value, instead of the array directly, but that didn’t pan out. We kept circling around the number of Equals calls. Note that the number of calls to TryGetValue is the same, as well as the number of calls to GetHashCode. So what is the diff?

The diff, quite simple, is not here at all.

The problem is in the RemoveBefore method. In the old version, if we removed all the entries, we’ll remove it completely from the dictionary. In the new version, we’ll reset the buffer so it can be used again next time. The problem with that approach is that it means that the dictionary is pretty big, much bigger than it would be in the case of the old version of the code. And that means that we’ll need to find the value (which is empty), then check its content. On the old version, we’ll just do a GetHashCode, then find that the table entry is over, and exit.

Indeed, all we had to do was change RemoveBefore to look like this:

And that gives us:

  • 14.0 seconds & 1.1 GB of memory for old version
  • 12.8 seconds & 0.4 GB of memory for new version

Which is pretty good result overall. It gets better when you break it apart to its component parts.

image

This is actually surprising, since we didn’t really set out to optimize this call very much, and it is pretty much unchanged in both versions. I think that this is likely because we keep the buffers around longer, so they are more likely to be in the cache.

image

This shows more than double the speed we previous had, which is pretty awesome, since this code is actually called per transactions, so anything that reduces that cost is golden.

image

This happens during a flush, and reducing its speed is important to reducing the time we hold the write lock, so this is pretty sweet.

The performance regression in the optimizationPart I

time to read 4 min | 629 words

PageTable is a pretty critical piece of Voron. It is the component responsible for remapping modified pages in transactions and is the reason why we support MVCC and can avoid taking locks for the most part. It has been an incredibly stable part of our software, rarely changing and pretty much the same as it was when it was initially written in 2013. It has been the subject for multiple performance reviews in that time, but acceptable levels of performance from our code in 2013 is no longer acceptable today. PageTable came up recently in one of our performance reviews as a problematic component. It was responsible for too much CPU and far too many allocations.

Here is a drastically simplified implementation, which retain the salient points:

Here is the sample workout for this class, which just simulates ten thousand transactions. This little scenario takes 15.3 seconds and allocates a total of 1.1GB of memory! That is a lot of allocations, and must have tremendous amount of time spent in GC. The most problematic issue here is the SetItems methods, which will allocate two different delegates for each modified page in the transaction. Then we have the total abandon in which we’ll allocate additional memory in there. As you can imagine, we weren’t very happy about this, so we set out to fix this issue.

We can take advantage off the fact that SetItems and RemoveBefore are only called under lock, while TryGetValue is called concurrently with everything else.

So I wrote the following code:

This relies on allowing stale reads from concurrent readers, which we don’t care about since they wouldn’t be able to make use of the data anyway, and it was able to reduce the allocations to just 320 MB, but the runtime actually went up to 32 seconds.

That is quite annoying, as you can imagine, and much cursing enthused as a result. I then pulled my trusty profiler ask it kindly to figure out what piece of code needs to be hit with a rolling pin and have a stern talk to about what is expected from code after it has been laboriously and carefully optimized. It is expected to sit nicely and be fast, or by Git I’ll revert you.

What the hell?! Here are the original implementation costs, and you can clearly see how much time we are spending on garbage collection.

image

And here is the optimized version, which is actually slower, and actually used more memory?!

image

There are a bunch of interesting things going on here. We can see that we are indeed using spending a little less time in GC, and that both RemoveBefore and SetItems methods are much cheaper, but the cost of TryGetValue is so much higher, in fact, if we compare the two, we have:

image

So we are 3.4 times higher, and somehow, the cost of calling the concurrent dictionary TryGetValue has risen by 88%.

But the implementation is pretty much the same, and there isn’t anything else that looks like it can cause that much of a performance gap.

I’ll leave this riddle for now, because it drove me crazy for two whole days and give you the details on what is going on in the next post.

Digging into the CoreCLRSome bashing on the cost of hashing

time to read 2 min | 372 words

Note: This post was written by Federico.

Recently at CoreFX there has been a proposal to deal with the typical case of everyone writing their own hash combining logic. I really like the framework to push that kind of functionality because hashing is one of those things that unless you really know what you are doing, it can backfire pretty bad (and even if you know, you can step over the cliff too).  But this post is not about the issue itself, but it showcases a few little details that happen at JIT land (and that we usually abuse, of course J). This is the issue: https://github.com/dotnet/corefx/issues/8034#issuecomment-260733285

Let me illustrate with some examples.
ValueTuple is a new type that is being introduced that is necessary for some of the new tuples functionality in C# 7. Because hashing is important, of course they implemented the ability to combine hashes. Now, let’s suppose that we take the actual code hashing code that is being used for ValueTuple and use constants to call it. 

Now under an optimizing compiler chances are that there shouldn't be any difference, but in reality there is.

This is the actual machine code for ValueTuple:

So now what can be seen here? First we are creating an struct in the stack, then we are calling the actual hash code.

Now compare it with the use of HashHelper.Combine which for all purposes it could be the actual implementation of Hash.Combine

I know!!! How cool is that???
But let’s not stop there... let’s use actual parameters:

We use random values here to force the JIT to treat them as parameters and not being able to eliminate the code and convert it yet again into a constant.

The good thing, this is extremely stable. But let’s compare it with the alternative:

The question that you may be asking yourselves is: Does that scale?. So, let’s go overboard...

And the result is pretty illustrative:

The takeaway of the analysis is simple: even if the holding type is a struct, it doesn't mean it is free :)

Making code fasterMicro optimizations and parallel work

time to read 2 min | 339 words

I really wanted to leave this series of posts alone. Getting 135 times faster should be fast enough for everyone, just like 640KB was.

Unfortunately, performance optimization is addictive. Last time, we left it at 283 ms per run. But we still left some performance on the table. I mean, we had inefficient code like this:

image

Just look at it. Analysis showed that it is always called with 2, 4 or 8 only. So we naturally simplified things:

image

Forcing the inlining of those methods also helped, and pushed us further toward 240 ms.

Another cost that we had was date diff calculation, we optimized for the case where the day is the same, but in our dataset, we have about 2 million records that cross the day line. So we further optimized for the scenario where the year & month are the same, and just the day is different. That pushed us further toward 220 ms.

At this point the profiler was basically laughing at us, and we had no real avenues to move forward, so I made the code use 4 threads, each processing the file at different locations.

That gave me: 73 ms and allocated 5,640 kb with peak working set of 300,580 kb

  • 527 times faster than the original version.
  • Allocate 1350 times less memory.
  • 1/3 of the working set.
  • Able to process 3.7 GB / sec.

Note that at this point, we are relying on this being in the file system cache, because if I was reading it from disk, I wouldn’t be able to do more than 100 – 200 MB / sec.

Here is the full code, write code like this at your peril.

FUTURE POSTS

  1. Low level Voron optimizations: Transaction lock handoff - 13 hours from now
  2. RavenDB Conference Videos: Delving into Documents with Data Subscriptions - about one day from now
  3. Low level Voron optimizations: Primitives & abstraction levels - 3 days from now
  4. RavenDB Conference Videos: Replication changes in 3.5 - 4 days from now
  5. Analyzing RavenDB 4.0 with NDepends - 7 days from now

And 6 more posts are pending...

There are posts all the way to Mar 14, 2017

RECENT SERIES

  1. RavenDB Conference videos (12):
    27 Feb 2017 - Building Codealike
  2. Low level Voron optimizations (5):
    20 Feb 2017 - Recyclers do it over and over again.
  3. Implementing low level trie (4):
    26 Jan 2017 - Digging into the C++ impl
  4. Answer (9):
    20 Jan 2017 - What does this code do?
  5. Challenge (48):
    19 Jan 2017 - What does this code do?
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats