Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,206 | Comments: 46,147

filter by tags archive

Stack arena memory allocation context

time to read 4 min | 705 words

Controlling memory allocations is something that any performant service need to do. In RavenDB 4.0 we have taken that to heart, and most of our memory usage is manual, instead of relying on the GC.

Now, there are a lot of really good reasons to want to avoid manual memory management. We have decades of usage telling us that this is a hard problem, with some nasty issues associated with it. But when we are talking about performance different of orders of magnitude, I’ll accept that burden.

The first thing that we did when making this decision is to see if we can change the way we approach unmanaged memory management to make it easier to avoid all of the common issues. We made several decisions:

  • All native allocations must go through a context.
  • A context is a unit of work that is single threaded.
  • A context is bounded in scope.
  • A context is reusable.

By placing those sort of limits on the native memory usage, we managed to gain a lot of simplicity. To start with, because the context is single threaded, we don’t have to implement an allocation strategy that is thread safe. Instead, we can just allocate a bunch of memory in the context, and parcel the memory from there. This looks like this:

image

The arena has allocated a certain size from the operation system. Allocating from that point is just a matter of bumping the next allocation pointer and handing that memory to the caller.

The nice thing about is that we don’t have the possibility for memory leaks, when the context scope ends, we can reset the next allocation pointer to the start, and we have “freed” all that memory. And we can start using it  again, without having to release it to the OS and get it back again. There are a whole bunch of edge cases (what happens when the arena is completely full, etc, but they are pretty obvious).

Not having to deal with multiple threads also present an interesting optimization. Once we are done with the memory, we can return it to the arena, but instead of maintaining free lists, etc, we do the following. If the returned memory is at the top of the arena (the last thing allocated), then we just bump the next allocation down, and the next allocation can reuse that memory. If that isn’t the case, we’ll do nothing, and rely on the context scope to handle this.

That, in turn, means that if I’m careful to free in the reverse order of allocations, I can get, within a single scope, memory reuse very cheaply. That is important, as we’ll see in a bit.

The context is also scoped to a certain operation, such as processing a request or an index batch. Because of that, even if we need to grow the arena, that is limited to however many operations we have to do inside that scope. And once the scope is over, we aren’t disposing of the context and releasing the memory, instead, we pool them so the next action will be able to use them. So a single OS level allocation is used across many actions, and the memory free cost is just setting a pointer.

But some scope need to do a lot of work, consider an index that is processing a big batch, in order to reduce the amount of memory that is being used, that is where the ability to free the top memory allocation comes in handy, because even inside the same scope, we can get memory reuse.

All in all, the allocation strategy was chosen to play to our strengths, and allow us to avoid doing multi threaded work in the hot paths, it also means that we get quite a lot of memory reuse, both within a scoped context and in different scopes, which reduce the overall amount of memory that we need to allocate. And the really great thing, it doesn’t come with a big pile of time spent in GC.

Voron Performance: A snapshot

time to read 2 min | 221 words

Yesterday the perf team let me know that they managed to get ~18% improvement on Voron by utilizing another on disk data structure. I’ll post more on that when we have it merged and working.

Talking about this, we decided to run a few benchmarks on our current status.

Nitpicker – this post talks about the performance of low level storage engines, the benchmark used was storing sequential values with 8 bytes key and 8 bytes value.

Here are the results for writing a billion values (16 bytes total) in ten thousands transactions.

  • 1,000,000,000 values  (1 billion)
  • 9.13 minutes
  • 1.824 million writes / sec
  • 31 GB final size

Then we spiked a small optimization that should allow us to defer major I/O costs, and we got:

  • 100,000,000 values (hundred million)
  • 40 seconds
  • 2.449 million writes / sec
  • 4.1 GB final size

We pretty much cheat here, because we defer the I/O cost under load to a later time, by not purging the journals, but that is something that I’ll post in detail once we have this merged.

Oh, and those are the number before the additional 18% optimization Smile. And they were run on a single commodity hardware node over SSD drive.

The tale of the intermittently failing test

time to read 2 min | 350 words

We recently started seeing a failing test in our RavenDB 4.0 test suite. This test was a relatively simple multi-map/reduce test.  Here it is:

image

I checked the history, and this test has been part of our test suite (and never failed us) since 2012. So I was a bit concerned when it started failing. Of course, it would only fail sometimes, which is the worst kind of failures.

After taking a deep breath and diving directly into the map/reduce implementation and figuring out all the parts that were touched by this test, I was stumped. Then I actually sat down and read through the test and tried to figure out what it is doing. This particular test is one that was sent by a user, so there was business logic to penetrate too.

The strange thing is that this test can never pass, it is inherently flawed, on several levels. To start with, it isn’t waiting for non stale results, which was the obvious racy issue. But once we fixed that, the test always failed. The problem is probably a copy/paste error. There supposed to be two lines for clients/1 and two lines for clients/2. But there are three lines for clients/1 and only one for clients/2. So this test should always fail.

But, because we didn’t have WaitForNonStaleResults, it will always return no results (it didn’t have time to finish indexing from the SaveChanges to the index) and the test would pass with empty result set.

This has been the case since 2012(!), mind you.

I fixed the copy/paste issue and the WaitForNonStaleResults, and the test consistently pass now.

The most interesting observation that I have here is that RavenDB is now able to run a full map/reduce cycle in the time it takes the test to move from the SaveChanges line to the query itself. And that is a damn impressive way to find bugs in your tests.

N+1 queries are hardly a feature

time to read 3 min | 531 words

I run into this article, talking about N+1 issues from a “fresh” perspective. In the article, there is a quote by DHH there that goes like this:

If you have N+1 query it means you’re executing one SQL query per element so if you have 50 emails in an inbox, that’d be 50 SQL calls, right? That sounds like a bug. Well in a Russian doll caching setup, it’s not a bug, it’s a feature. The beauty of those individual calls are that they’re individually cached, on their own timeline, and that they’re super simple.

Now, I have been involved with OR/Ms and databases for over a decade now, and I have spent large parts of my career working to resolve performance problems in database driven applications.

In a word, the idea that having a larger amount of simpler queries is better is nonsense. In particular, it completely ignores the cost of going to the database. Sure, a more complex query may require the database to do additional work, and if you are using caching, then you’ll not have the data in the cache in neat “cache entry per row”. But in practice, this leads to applications doing hundreds of queries per page view, absolute reliance on the cache and tremendous cost at startup.

In RavenDB, most queries are so fast that when we measured, the network component was the largest cost we had to face.

Let us do the numbers, shall we? Let us assume that we are talking about the best case, we have the database machine and the web machine in the same datacenter. Cost of roundtrip in the same datacenter is 0.5 ms. Now, let us go back to the 50 emails example above, shall we? We need to send 50 queries to the database. We’ll assume that we have a perfect database that takes no time at all to answer. That is still 25 ms wasted just on roundtrip times.

And the problem is that this is usually a lot more than 50 queries per page when you adopt this kind of silliness. You typically see hundreds or thousands of them, and the database isn’t really able to answer you in no time, so expect to see much bigger delays in practice.

But wait, I hear you say, this is all about caching, you are ignoring that part. Well, no, I’m not. If you are using a distributed cache, the costs over the network are exactly the same. So this is only relevant if you are using a cache on the same server, but then you run into issues when you have different caches on different machines hold different information. Not to mention that you are now in the wonderful world of having to worry about cache invalidation strategies and aligning them with the business requirements.

And for fun, what happens one of your cache nodes goes down? You got it, you just created a DoS attach on your own database.

On the other hand, you can actually create proper queries, get the data in as few roundtrips as possible and trust the database engine to do its work.

Hibernating Rhinos is hiring

time to read 1 min | 155 words

It’s that time again, we are looking for more people to work on RavenDB. I’m going to assume that if you are reading this, you know what we do, so I’ll skip telling you how exciting, dynamic and buzzword of the day this position is. I’ll say that we are doing a lot of fun things, one of our guys just finished taking us from 200 req/sec in a particular scenario to 30,000 req/sec, for example Smile.

We are looking for someone who can build system software in C#, with really good understanding of the way computers work, and how to get the best out of them. If you have OSS contributions, that puts you at the head of the line.

Just ping us at jobs@ravendb.net with your CV.

This position is for Hadera, Israel, and is not available for remote work.

Voron InternalsThe diff is the way

time to read 3 min | 486 words

I talked about the goals of using diffs for the journals in a previous post, but in this one, I want to talk about what it actually did. To start with, it turn out that using diffs in the journal invalidates quite a few optimizations that we had. We had a whole bunch of stuff that we could do to avoid writing data to the data file if there are additional modifications to it. When using diffs, we can’t do that, because the next diff is building on the previous version. That actually ended up improving the code quality, because a bunch of pretty tricky code had to go away.

It was tricky code because we tried to reduce the data file I/O, but only in the cases that it was allowed, which wasn’t always. Now we always write the latest version of all the modified pages, which might add some additional I/O in some cases, but in practice, this didn’t show up in our benchmarks. And the reduction of complexity might have been worth it even if it was.

We actually have two diff implementation, one that tests two different versions of a page and find the differences, and one that test a page against zeros. That, in addition to collapsing runs of zeros, is the entire implementation. We are actually not as small as we could get, because we insert some metadata into the diff stream to allow easier debugging and error detection. But the nice thing about the diff output is that we are still compressing it, so working extra hard to reduce the uncompressed data isn’t that important if compression will cover it anyway.

We tested before & after of using diffs for journal writes, and we found the following:

Journal size (inserts) 37% reduction in size
Journal size (updates) 80% reduction in size
Bulk insert speed 5% – 10% speed improvement *
Writes / sec (inserts) 12% improvement
Writes / sec (updates) 18% improvement

* If the sizes went down so significantly, why haven’t the insert & update speed improve by a comparable amount?

The answer to that is that for the most part, reducing the amount of data we are writing is awesome, a major limiting factor is the number of times we write to the journal, rather than the number of bytes. And we have to write once per transaction, so that is a limiting factor.

However, the current numbers we are looking at is a benchmark showing roughly 30,000 write requests / second and are unable to get anything higher, because we have saturated the local network. We are currently setting up a dedicated testing environment to see how far we can push this.

How the debugger lied to my face with no shame at all

time to read 1 min | 155 words

Take a look at the following code:

image

As you can see, something pretty freaky is going on here.

We put a value in there, and then we try to get it out, we get a totally different value.

The real repro happened in a pretty complex piece of code, and because the old value was different than the value in the debugger, we were certain that we had something that was changing data behind our back. Only after we marked the memory as read only and still saw that behavior we realized that this is actually a debugger eval bug. Instead of reading the Id property as a long, it is reading it as a byte, leading to this misleading value.

Only cost me most of my hair.

RavenDB 3.5 RC2 is released

time to read 2 min | 224 words

You can now get RavenDB 3.5 RC2 in the download page. This is going to be the last RC before the RTM release, which we expect in the next week or two.

Please take it for a spin, we appreciate any feedback that you have. While the RC build is out there, we are running internal tests, stressing RavenDB under load in various configuration, to see if we can break it.

The changes since the last RC are:

  • Reduce the number of threads being used in the system.
  • Support for dynamic document in More Like This queries
  • Display number of alerts in the studio
  • Reduced studio load times
  • Warn about configuration values that suspect
  • Reduced allocations when reading large number of documents
  • Better optimistic concurrency control on the client, can now specify that on a per document basis.
  • Better handling for clusters, resolved a race condition on the client deciding who the leader is and added better logging and operational support.
  • Fixed index & side by side index replication in master master scenario.
  • Alert if the paging file is on HDD and there is an SSD on the system.
  • Clusters now only select an up to date node as the leader.
  • Fixed perf issue when starting out and loading a large number of dbs concurrently.
  • Better CSV import / export semantics.

Voron InternalsReducing the journal

time to read 7 min | 1285 words

We spend a lot of time trying to reduce our sync I/O cost with Voron, namely, the actual journal write to the disk. This is very expensive, because we have to hit the actual disk, forgoing any buffering.

So anything that can reduce that cost is a really good idea. We spent some time looking at dynamic compression ratios heuristics, to see if it is worth it. Basically, we tried to figure out which option to use:

image

The idea is that based on the speed of the hard disk in use, we can decided whatever it is worth it or not to spend more time compressing the journal entry before saving it. We tested a system where the I/O duration would be balanced against compression speed and size, and adjust automatically.

It failed, horribly. Basically, even on the fastest drives we could find, it was almost always better to compress at the highest level, because the cost of going to disk is so high.

imageThere is another aspect of this, however. The cost of going to disk isn’t linear to the size you are writing. I used the example of putting your groceries in the trunk. The fuel cost of the trip is not really going to be dominated by the weight of the groceries. After writing this statement, I fact checked myself. According to Auto Blog, each 100 pounds (50 KG) of added weight will increase the fuel utilization by about 1%. What is going to dominate the cost, however, is how much do you have to drive.

In the same manner, writing to the disk is impacted by the amount you write, but writing 4KB or 20KB has roughly the same cost anyway. Writing 2 MB is much longer, but not as much as you would expect. Note that all of those numbers assume no buffering all the way to disk, and using DMA.

We then tried to see what happen if we would just avoid compressing small writes. Anything smaller than 64KB is going to be compressed to less than 64KB, but the actual cost of writing to disk isn’t going to change, so we can save the compression costs. That actually improved performance a little bit for fast drives, but it hurt us on slow ones.

I had an interesting discussion with Alex on the usage of diff compression in the journal. This can take advantage on the fact that in many cases, we don’t modify full pages, so we can write just the changes out to disk. He was kind enough to include a few implementations of that for us to look at, those are RLE0 (Zero Run Length Encoding) implementations, and I’ll use RLE to refer to it from now on.

Reducing I/O is always good, and this promised to give a substantial boost, but the actual design details that cropped us are really interesting.  Diff compression can be simple, like the RLE0 in this link, effectively, outputting something like:

... [x bytes unchanged][y bytes changed][byte 1 .. y][z bytes unchanged] ...

Or they can be much more complex, like bsdiff or xdelta. RLE handles the scenario where some bytes changes nicely, but fails badly if there is a single added byte (since it simply check for equality, we’ll see all the bytes are different). Algorithms like bsdiff or xdelta can handle much more complex differences, but they are drastically more expensive. For my purposes, bsdiff has runtime complexity of O( 2N * logN ) and memory utilization of 17N. It other words, to get the diff of 4 pages, we’ll need 272KB and about 230K operations.

Algorithms like that are usually meant for distributions. In other words, they are meant for cases where you can spend as much time as you want generating the diff, and you benefit from reduced download times. A modern usage of those is the Courgette  project, for reducing the size of Chrome updates. It doesn’t matter if generating the update takes 3 hours, since it will be downloaded millions of times, and a 600KB saved in this manner will pay themselves many time over.

But those kind of costs are not something that we can pay. Analysis of our memory usage patterns also showed that in many cases, we are using mostly fixed addressing. In other words, we’ll typically change only small parts of the page, and we don’t tend to have moving writes. When we do (typically on defrag), we do them on a page boundary, so RLE implementation should generate good savings.

We have an implementation that we are currently testing, but while you can read the code, what is more interesting is the assumptions that we are making.

We scan the original and modified buffers using longs. We can safely assume that the buffers we scan are always sized in pages, so we don’t need to worry about buffers whose size isn’t divisible in sizeof(long), this make the code much simpler. We also don’t bother to encode identical parts, instead, we record the (start, count, raw bytes) differences from the original. There is a small optimization there for long runs of zeros (to make it cheaper to delete data), but beyond that, we do very little.  I’ll have a separate post to dive into the actual implementation details and considerations that drove it, but that is for later.

An important reason why we don’t keep track of the unmodified data is that we don’t need it, and that we can’t actually trust the original data. Consider the case where we actually need to use the journal to recover. We do that by running through all of the transactions, and applying the diffs to the data. The problem is that we may fail midway through the recovery process, so the state of the data is not known. When applying a diff, if we use the original data, we might actually see data from a later transaction (which was applied, but we don’t know about it since we crashed before we can make a note of that). Because of this, we only use the modified data, which is safe to apply multiple times. Note that this assumes that modifying a page can not corrupt the page. In other words, if I have a 4 KB page, and I write a value to the 3rd byte, it isn’t going to cause any change to any other byte. Aside from that, we don’t require that the bytes that we modified will be there on restart, because we’ll overwrite them until we are sure that we properly synced them.

Another aspect of the diff operation that we aren’t actually all that worried about the data size (which is interesting, since we really want to reduce it), the reason for that is that we are going to throw all the diffed data into the compressor anyway. The idea is that even after the diff, we are still likely to find data to compress among the modifications on the transaction.

Currently, the steps to write a transaction to disk are:

  • Get all the modified pages.
  • For each of those, compute the difference between it and the previous version of that page.
  • Compress all the diffs of all the pages.
  • Write the compressed data to disk in a safe manner.

FUTURE POSTS

  1. RavenDB Restorspective - 4 days from now

There are posts all the way to Sep 30, 2016

RECENT SERIES

  1. Voron internals (5):
    13 Sep 2016 - The diff is the way
  2. Database Building 101 (8):
    25 Aug 2016 - Graph querying over large datasets
  3. Production postmortem (16):
    23 Aug 2016 - The insidious cost of managed memory
  4. Digging into the CoreCLR (3):
    12 Aug 2016 - Exceptional costs, Part II
  5. The Guts n’ Glory of Database Internals (20):
    08 Aug 2016 - Early lock release
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats