Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,026 | Comments: 44,842

filter by tags archive

The hidden costs of allocations

time to read 4 min | 625 words

I mentioned in a pervious post that we have a code quality gateway to make sure that all logging statements are wrapped in an if statement, to reduce the number of allocations when the user has logging turned off. This is done because logging can be expensive, and is often turned off, so there is no point in paying a penalty for stuff that isn’t working.

There seems to be some confusion about why this is done. Let us assume that we have the following logging code:

void Debug(string msg)

void Debug(string format, params object[] args)
		Console.WriteLine(format, args);

void Debug(Func<string> generateMsg)

Now, the obvious bad example would be to use:

Debug("Hello "+ user.Name);

Since that is going to allocate a new string, and this will happen regardless of whatever logging is enabled or not. On high frequency call sites, this can end up allocating a lot of useless stuff.

So we will move to this mode:

Debug("Hello {0}", user.Name);

And we saved the allocation, right? Except that this actually generate this code, do you see the allocation now?

Debug("Hello {0}", new[] { user.Name });

So let us introduce a better option, shall we? We’ll add a few common overloads without the use of params.

void Debug(string format, object p1);
void Debug(string format, object p1, object p2);
void Debug(string format, object p1, object p2, object p3);

And now we saved the allocation. Unless…

int requestNumber = ...;
Debug("Request # {0}", requestNumber);

Do you see the allocation now? We pass an int to a object, which require us to do boxing, which is an allocation Smile.

So let us try using the lambda method, this way nothing is executed!

Debug(() => return "Hello " + user.Name);

Except… this is actually translated to:

var loggerMsg = new LoggerMessage(user);
Func<string> func = new Func<string>(loggerMsg.Write);
Debug(() => return "Hello " + user.Name);

Here are all the allocations.

There is another issue with logging via lambdas, consider the following code:

void Index(JsonDocument[] docs)
	var batch = new IndexBatchStats();
	database.Stats.Add(batch);// long lived
	batch.Completed += () => database.Stats.IncrmentCompletedBatches(batch);
	Log(() => "Indexing " + docs.Length + " documents");

You might notice that we have two lambdas here. C# is optimizing the number of types generated, and will generally output a single type for all the lifted member in all the lambdas in the method. This means that we have:

void Index(JsonDocument[] docs)
	var batch = new IndexBatchStats();
	database.Stats.Add(batch);// long lived

	var args = new { database, batch, docs }; // all lifted members

	batch.Completed += (args) => args.database.Stats.IncrmentCompletedBatches(args.batch);
	Log((args) => "Indexing " + args.docs.Length + " documents");

As you can see, we have a long lived lambda, which we think is using only other long lived objects (database & batch), but is actually holding a reference to the docs, which are VERY large.

Except for the last issue, which require moving the logging to a separate method to avoid this optimization, all of the issues outlined above can be handled by explicitly calling IsDebugEnabled in the calling code.

And this is why we require it.

Code quality gateways

time to read 2 min | 319 words

I just merged a Pull Request from one of our guys. This is a pretty important piece of code, so it went through two rounds of code reviews before it actually hit the PR stage.

That was the point where the tests run (our test suite takes over an hour to run, so we run a limited test frequently, and leave the rest for later), and we got three failing tests. The funny thing was, it wasn’t a functional test that failed, it was the code quality gateways  tests.

The RavenDB team has grown quite a lot, and we are hiring again, and it is easy to lose knowledge of the “unimportant” things. Stuff that doesn’t bite you in the ass right now, for example. But will most assuredly will bite you in the ass (hard) at a later point in time.

In this case, an exception class was introduced, but it didn’t have the proper serialization. Now, we don’t actually use AppDomains all that often (at all, really), but our users sometimes do, and getting a “your exception could not be serialized” makes for an annoying debug session. Another couple of tests related to missing asserts on new configuration values. We want to make sure that we have a new configuration value is actually connected, parsed and working. And we actually have a test that would fail if you aren’t checking all the configuration properties.

There are quite a few such tests, from making sure that we don’t have void async methods to ensuring that the tests don’t leak connections or resources (which could harm other tests and complicate root cause analysis).

We also started to make use of code analyzers, for things like keeping all complex log statements in conditionals (to save allocations) to validating all async calls are using ConfigureAwait(false).

Those aren’t big things, but getting them right, all the time, give us a really cool product, and a maintainable codebase.

Raw data sections in Voron

time to read 5 min | 810 words

When laying out the design work for RavenDB 4.0, it became clear that we needed more from our storage layer. That require us to modify the way we store the information on disk. Voron is inspired by the LMDB project, and that project basically has a single data type, the B+Tree.

Now, LMDB does some really crazy stuff with this, a single LMDB file can contain any number of B+Trees, and you can also have trees whose values are also trees, but that is basically the sole thing that you can use.

This is roughly what this looks like:


Note that there are several types of values here. Small items can be stored directly inside the B+Tree, while bigger values require separate storage, and we only store the reference to the data in the tree.

As it turns out, this is actually quite important aspect of the way you can optimize behavior inside B+Trees. If you wondered about the difference between clustered and non clustered indexes, that is exactly it. In a clustered index, the data is directly in the B+Tree. In a non clustered index, you don’t have the data directly, but need to follow the reference.

Using C#, here are the differences:

var entry = clusteredIndex.Find(key);
return entry.Data;

var nonClusteredEntry = nonClusteredIndex.Find(key);
var entry = clusteredIndex.Find(nonClusteredIndex.Data);
return entry.Data;

I’m skipping on quite a lot here, but that should give you the gist of it.

Now, where is the problem? The problem is that there if all you have is trees, this is pretty much all that you can do here. Now, let us consider the case of storing documents inside Voron. What kind of questions do we need to answer?

  • Load documents by id
  • Scan documents by id prefix
  • Scan documents by etag

As it stands, doing this using just trees would require us to:

  • Define a tree to hold the data, where the key is the document id. This allows us to both find a document by id and by key prefix.
  • Define an index for the etag, where the key is the etag, and the value is the document id.

So, let us assume that we have a piece of code that need to read documents by etag (for indexing, replication, export, etc):

var enumerator = etagsIndex.ScanFrom(etag); // O(logN)
foreach(var entry in enumerator) // O(~1) for each MoveNext(); 
	var actualDocument = documentsIndex.Find(entry);// O(logN);
	// do something with this

So, if I want to read 32,000 documents based on their etags in a database that has a hundred million documents, that is going to perform close to 900,000 comparison operations. (log2(100 M) = 27, and we issue 32K O(logN) queries on the documentsIndex).

That can get expensive. And this is a relatively simple example. In some cases, a data item might have five covering indexes, and the access rate and size get very high.

In order to resolve that, we are going to change how we layout the data on the disk. Instead of storing the data directly in the tree, we are going to move the data into separate storage. We call this raw data sections.

A raw data section can be small (for values that are less than 2KB in size) or large. Small values are grouped together in sections that are 2 – 8 MB in length. While each larger sized value is going to be place independently, rounded up to the next page size (4KB, by default). The trees will not store that data, but references to the data on disk (effectively, the file position for the data).

Note that this has a few implications:

  • Small data is grouped together, and the likely access method will group together commonly inserted values.
  • We won’t need to move the data if something changes (as long as it doesn’t grow too much), so updating the value will usually mean that we don’t need to update all the indexes (because the location on disk didn’t change), unless the value they are covering did change.
  • Trees are going to be smaller, allowing more of them to reside in memory, and it allows us to do interesting tricks with things like PrefetchVirtualMemory calls on the locations from the index before access them, so we only pay the I/O cost (if any, once).

This is also quite exciting, because it means that we can share indexes on the data very easily. For example, we can have a “etags per collection” index for each collection, which will be much faster to run because it doesn’t need all the extra lookups.

Micro benchmark, reading the results

time to read 1 min | 159 words

In the previous posts, I discussed various options for accessing a value through a pointer, either by using a casted pointer, or casting it whenever we need to use it.

The good news from the benchmark is that all results are close enough to one another to be effectively equal. That means that there really isn’t any difference between the two options.

Except, that there is. By only having to keep track of a single pointer field (instead of two fields, one that is byte* and one that is PageHeader*), we can save a field. That means that we save 8 bytes on every page we allocate. And we can allocate a lot of pages.

The end result is that the casting approach is much faster, not because it runs faster, but because it reduces the size of allocations we make. And reducing the size of allocation end up using less memory, which end up being faster overall.

Mistakes in micro benchmarks

time to read 5 min | 900 words

So on my last post I showed a bunch of small micro benchmark, and aside from the actual results, I wasn’t really sure what was going on there. Luckily, I know a few perf experts, so I was able to lean on them.

In particular, the changes that were recommended were:

  • Don’t make just a single tiny operation, it is easy to get too much jitter in the setup for the call if the op is too cheap.
  • Pay attention to potential data issues, the compiler / jit can decide to put something on a register, in which case you are benching the CPU directly, which won’t be the case in the real world.

I also learned how to get the actual assembly being run, which is great. All in all, we get the following benchmark code:

[BenchmarkTask(platform: BenchmarkPlatform.X86,
            jitVersion: BenchmarkJitVersion.RyuJit)]
[BenchmarkTask(platform: BenchmarkPlatform.X86,
            jitVersion: BenchmarkJitVersion.LegacyJit)]
[BenchmarkTask(platform: BenchmarkPlatform.X64,
                jitVersion: BenchmarkJitVersion.LegacyJit)]
[BenchmarkTask(platform: BenchmarkPlatform.X64,
                jitVersion: BenchmarkJitVersion.RyuJit)]
public unsafe class ToCastOrNotToCast
    byte* p1, p2, p3, p4;
    FooHeader* h1, h2,h3,h4;
    public ToCastOrNotToCast()
        p1 = (byte*)Marshal.AllocHGlobal(1024);
        p2 = (byte*)Marshal.AllocHGlobal(1024);
        p3 = (byte*)Marshal.AllocHGlobal(1024);
        p4 = (byte*)Marshal.AllocHGlobal(1024);
        h1 = (FooHeader*)p1;
        h2 = (FooHeader*)p2;
        h3 = (FooHeader*)p3;
        h4 = (FooHeader*)p4;

    public void NoCast()

    public void Cast()

And the following results:

          Method | Platform |       Jit |   AvrTime |    StdDev |             op/s |
---------------- |--------- |---------- |---------- |---------- |----------------- |
            Cast |      X64 | LegacyJit | 0.2135 ns | 0.0113 ns | 4,683,511,436.74 |
          NoCast |      X64 | LegacyJit | 0.2116 ns | 0.0017 ns | 4,725,696,633.67 |
            Cast |      X64 |    RyuJit | 0.2177 ns | 0.0038 ns | 4,593,221,104.97 |
          NoCast |      X64 |    RyuJit | 0.2097 ns | 0.0006 ns | 4,769,090,600.54 |
---------------- |--------- |---------- |---------- |---------- |----------------- |
            Cast |      X86 | LegacyJit | 0.7465 ns | 0.1743 ns | 1,339,630,922.79 |
          NoCast |      X86 | LegacyJit | 0.7474 ns | 0.1320 ns | 1,337,986,425.19 |
            Cast |      X86 |    RyuJit | 0.7481 ns | 0.3014 ns | 1,336,808,932.91 |
          NoCast |      X86 |    RyuJit | 0.7426 ns | 0.0039 ns | 1,346,537,728.81 |

Interestingly enough, the NoCast approach is faster in pretty much all setups.

Here is the assembly code for LegacyJit in x64:


For RyuJit, the code is identical for the cast code, and the only difference in the no casting code is that the mov edx, ecx is mov rdx,rcx in RyuJit.

As an aside, X64 assembly code is much easier to read than x86 assembly code.

In short, casting or not casting has a very minor performance difference, but not casting allows us to save a pointer reference in the object, which means it will be somewhat smaller, and if we are going to have a lot of them, then that can be a pretty nice space saving.

Micro benchmarks and hot paths

time to read 3 min | 592 words

I’m doing some work to refactor a complex piece of code, and I have a piece of memory that is allocated, then accessed via a struct pointer. This piece of code gets called a lot, so I wondered about the tradeoff of holding a pointer that is already casted to the right struct pointer vs the cost of another pointer in the object.

Note, those are objects that are created a lot, and used a lot. Those are not the kind of things that you would usually need to worry about. Because I wasn’t sure, I decided to test this out. And I used BenchmarkDotNet to do so:

public struct FooHeader
    public long PageNumber;
    public int Size;

[BenchmarkTask(platform: BenchmarkPlatform.X86,
           jitVersion: BenchmarkJitVersion.LegacyJit)]
[BenchmarkTask(platform: BenchmarkPlatform.X64,
               jitVersion: BenchmarkJitVersion.LegacyJit)]
[BenchmarkTask(platform: BenchmarkPlatform.X64,
               jitVersion: BenchmarkJitVersion.RyuJit)]
public unsafe class ToCastOrNotToCast
    byte* p;
    FooHeader* h;
    public ToCastOrNotToCast()
        p = (byte*)Marshal.AllocHGlobal(1024);
        h = (FooHeader*)p;

    public void NoCast()

    public void Cast()

    public void DirectCastArray()

    public void DirectCastPtr()

The last two tests are pretty much just to have something to compare to, because if needed, I would just calculate memory offsets manually, but I doubt that those would be needed.

The one downside of BenchmarkDotNet is that it takes a very long time to actually run those tests. I’ll save you the suspense, here are the results:


I was expected the NoCast method to be faster, to be honest. But the Cast method is consistently (very slightly) the fastest one. Which is surprising.

Here is the generated IL:


And the differences in assembly code are:


Note that I’m not 100% sure about the assembly code. I got it from the disassembly windows in VS, and it is possible that it changed what is actually going on.

So I’m not really sure why this would be difference, and it is really is a small difference. But it is there.

The useless text book algorithms

time to read 2 min | 256 words

I was listening to the Programming Touchdown podcast, and on Episode 43, around the 13:00 minute mark, there was the following quote:

I can count on one, maybe two hands, the number of times in my entire career where I need to use… like the algorithm I used made an impactful difference. Most of the time, it doesn’t matter, architecture can matter, sometimes. Like, true, textbook algorithms.

This is roughly what it felt like to hear this…

I mean, seriously! Okay, I write database engines for a living. Part of my job is to look at academic literature to see if there are good ways to improve what we are doing. But you know what, let us say that building database engines is something special, and that your regular web developer doesn’t need any of it.

Let us see how right that is, shall we?

I’m going to use The Art of Computer Programming by Donald E. Knuth as an example here, because that certain match the definition of a text book. Reading just the table of contents:

  • Data structure - Stack, Queues, Linked Lists, Arrays, Binary Trees.
  • Concepts: Co-routines (async in C#, promises in JS, etc), Random Numbers, how floating points number actually work.
  • Sorting algorithms, searching sorted & unsorted data.

Saying that algorithms don’t matter is about taking decades of research and throwing it down the tube, because the machine is powerful enough for me to not notice that I’m brute forcing a solution at O(N**2)

The insidious cost of allocations

time to read 5 min | 835 words

One of the things we learned from build high performance systems is that algorithm complexity is important for system performance, but it is (usually) easy to see.

Controlling allocations is something that you don’t see. But it can have an even greater impact on your system. In particular, allocations require some CPU cost immediately, but the .NET framework is pretty good about keeping that small. In most cases, that require only bumping a counter value and giving you some memory. The true cost of allocations happen when you need to release that memory. GC, compactions, multiple generations, etc.

There are whole fields of study that deal with optimizing this issue. The downside of allocating a lot of memory is that you are going to pay the cost later. And that figuring out what exactly allocated that memory can often be quite hard. Especially if you are trying to figure it out from a production trace.

And even with the state of the art GC and memory management subsystems, your best bet to reduce memory management pressure is to reduce the number of allocations. We are far from the first to run into this. In fact, see the guidelines to contributing to Roslyn.

With RavenDB, here are a few commit messages dealing (only) with this issue:


The results of the performance improvements on those cannot really be overstated. In some cases, we save an allocation by just 8 or 16 bytes. But in a hot path, that means that we reduce GB of memory being allocated.

The problems with allocations is that they can be pretty well hidden, here are a code sample with hidden allocations:

var fileNames = new List<Tuple<string, bool>>();
foreach (var file in System.IO.Directory.GetFiles(dirToCheck, "*.dat"))
    using (var fs = new FileStream(file, FileMode.Open))
        uint crc = CalculateCrc(fs);
        var crcStr = File.ReadAllText(file+".crc");
        var crcMatch = crc == int.Parse(crcStr);
        fileNames.Add(Tuple.Create<string, bool>(file, crcMatch));

We will assume that CalculateCrc does absolutely no allocations, just calling ReadByte() until the end. We’ll further assume that there are 10,000 files in the directory.

How much memory is going to be allocated here? I’ll freely admit that I’ll probably miss some, but here is what I found:

    • The fileNames list, which will allocate (using power of 2) a total of 256KB (32,764 items total in the list below, times 8 bytes per pointer)

new Tuple<string,bool>[4]
new Tuple<string,bool>[8]
new Tuple<string,bool>[16]
new Tuple<string,bool>[32]
new Tuple<string,bool>[64]
new Tuple<string,bool>[128]
new Tuple<string,bool>[256]
new Tuple<string,bool>[512]
new Tuple<string,bool>[1024]
new Tuple<string,bool>[2048]
new Tuple<string,bool>[4096]
new Tuple<string,bool>[8192]
new Tuple<string,bool>[16384] <—This goes in the large object heap, is is 128Kb in size

  • The call to GetFiles will allocate about 0.5MB:
    • It also uses a list internally, then it calls ToArray on it, so the same costs as the list, plus another 78Kb for the array. And another 128Kb buffer is in the large object heap.
    • Note that I didn’t went too deep into the code, there are other allocations there, but it uses an enumerator internally (to feed the list), so I’ll assume not allocations.
  • Creating a FileStream isn’t all that expensive, but reading from it (which is done in the CalculateCrc method) will force it to allocate a 4KB buffer. So that is 39MB right there.
  • File.ReadAllText is a really nice convenience method, but it uses a StreamReader internally, which has a 1Kb buffer, and that uses a FileStream, which has yet another 4Kb buffer. So another 48MB.
  • 10,000 strings for the file names returned from GetFiles, assuming that each file name is 8.3 characters, that gives us 107Kb.
  • 10,000 strings for the CRC values we read using ReadAllText, assuming each is also 11 characters (text value of 0x576B6624), we get another 107Kb.
  • Oh, and I almost forgot, we have another 10,000 strings allocated here, the file+”.crc” filename, which is another 146Kb.
  • Finally, we have 10,000 tuples, which comes to about 90Kb.

Overall, calling this method will allocate about 90MB(!), including several allocations that fall into the large object heap.

Note that most of those calls are done on your behalf, inside the List handling, but mostly instead the I/O routines.

It has been our experience that allocations inside various I/O routines are a really common problem, especially when you have no way to send your own buffer (which you can pool, reuse, etc). For example, there is no way to share the buffer between different FileStreams instances.

I created a Pull Request in CoreCLR for this feature. But in the meantime, profiling memory allocations (vs. just the memory used) is a great way to find such issues.

Concurrent max value

time to read 2 min | 324 words

For a feature in RavenDB, I need to figure out the maximum number of outputs per document an index has. Now, indexing runs in multiple threads at the same time, so we ended up with the following code:

var actualIndexOutput = maxActualIndexOutput;
if (actualIndexOutput > numberOfAlreadyProducedOutputs)
    // okay, now let verify that this is indeed the case, in thread safe manner,
    // this way, most of the time we don't need volatile reads, and other sync operations
    // in the code ensure we don't have too stale a view on the data (beside, stale view have
    // to mean a smaller number, which we then verify).
    actualIndexOutput = Thread.VolatileRead(ref maxActualIndexOutput);
    while (actualIndexOutput > numberOfAlreadyProducedOutputs)
        // if it changed, we don't care, it is just another max, and another thread probably
        // set it for us, so we only retry if this is still smaller
        actualIndexOutput = Interlocked.CompareExchange(
            ref maxActualIndexOutput, 

The basic idea is that this code path is hit a lot, once per document indexed per index. And it also needs to be thread safe, so we first do an unsafe operation, then a thread safe operations.

The idea is that we’ll quickly arrive at the actual max number of index inputs,  but we don’t have to pay the price of volatile reads or thread synchronization.

The “you broke the build!” game

time to read 2 min | 344 words

Recently I pulled some code from a colleague, and tried to test it. It worked, which was fine, so I let it run the tests, and went out to lunch.

When I came back, I was surprised to discover that the build has failed, not because of some test failing, but because it couldn’t compile. To be rather more exact, we go the following error:

 [optimized-build] Using type script compiler: C:\Program Files (x86)\Microsoft SDKs\TypeScript\1.4\tsc.exe
 [optimized-build] System.ComponentModel.Win32Exception thrown:
 [optimized-build] --------------------------
 [optimized-build] The filename or extension is too long
 [optimized-build] --------------------------

That was strange. I checked several times, and we had no such thing. No one had a veryVeryLongFileNameThatNeededToBeVeryExplicitAboutWhatItWasDoingAndDidNotCareAboutLength.ts.

And the tsc.exe location was in its normal place. This is from a part in our build process that gather all the TypeScript files and merge them into a single optimized bundle. And it suddenly failed. Now, on the colleague machine, it worked. The previous commit before I merged it, it worked. The merge was a clean one, and very obvious that nothing was going on there.

It took me a while, but I finally figured out that the error occurred because my colleague has added a new TypeScript file.

How can adding a file break the build?

As it turns out, the code we were calling did something like this:

tsc.exe file1.ts file2.ts file3.ts

And at some point, the size of the command line we were passing to the process has exceeded 32KB. And that is when everything broke loose.

Luckily, we were able to write those things to a file and ask the compiler to load them directly, instead of passing them via the command line:

tsc.exe @arguments.txt

And everything was okay again…


No future posts left, oh my!


  1. Technical observations from my wife (3):
    13 Nov 2015 - Production issues
  2. Production postmortem (13):
    13 Nov 2015 - The case of the “it is slow on that machine (only)”
  3. Speaking (5):
    09 Nov 2015 - Community talk in Kiev, Ukraine–What does it take to be a good developer
  4. Find the bug (5):
    11 Sep 2015 - The concurrent memory buster
  5. Buffer allocation strategies (3):
    09 Sep 2015 - Bad usage patterns
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats