Ayende @ Rahien

It's a girl

Excerpts from the RavenDB Performance team report: Optimizing Compare, Don’t you shake that branch at me!

Note, this post was written by Federico. In the previous post after inspecting the decompiled source using ILSpy  we were able to uncover potential things we could do.

By now we already squeeze almost all the obvious inefficiencies that we had uncovered through static analysis of the decompiled code, so now we will need another strategy. For that we need to analyze the behavior in runtime in the average case. We did something like that when in this post when we made an example using a 16 bytes compare with equal arrays.

To achieve that analysis live we will need to somehow know the size of the typical memory block while running the test under a line-by-line profiler run. We built a modified version of the code that stored the size of the memory chunk to compare and then we built an histogram with that (that’s why exact replicability matters). From our workload the histogram showed that there were a couple of clusters for the length of the memory to be compared. The first cluster was near 0 bytes but not exactly 0. The other cluster was centered around 12 bytes, which makes sense as the keys of the documents were around 11 bytes. This gave us a very interesting insight. Armed with that knowledge we made our first statistical based optimization.

You can notice the if statement at the start of the method, which is a pretty standard bounding condition. If the memory blocks are empty, therefore they are equal. In a normal environment such check is so simple that nobody would bother, but in our case when we are measuring the runtime in the nanoseconds, 3 extra instructions and a potential branch-miss do count.

That code looks like this:

image

That means that not only I am making the check, we are also forcing a short-jump every single time it happens. But our histogram also tells us that memory blocks of 0 size almost never happen. So we are paying with 3 instructions and a branch for something that almost never happen. But we also knew that there was a cluster near the 0 that we could exploit. The problem is that we would be paying 3 cycles (1 nanosecond in our idealized processor) per branch. As our average is 97.5 nanoseconds, we have 1% improvement in almost any call (except the statistically unlikely case) if we are able to get rid of it.

Resistance is futile, that branch needs to go. Smile

In C and Assembler and almost any low level architecture like GPUs, there are 2 common approaches to optimize this kind of scenarios.

  • The ostrich method. (Hide your head in the sand and pray it just work).
  • Use a lookup table.

The first is simple, if you don’t check and the algorithm can deal with the condition in the body, zero instructions always beats more than zero instruction (this case is a corner case anyways, no damage is dealt). This one is usually used in massive parallel computing where the cost of instructions is negligible while memory access is not. But it has its uses in more traditional superscalar and branch-predicting architectures (you just don’t have so much instructions budget to burn).

The second is more involved. You need to be able to “index” somehow the input and pay with less instructions than do the actual branches (at a minimum of 1 nanosecond each aka 3 instructions of our idealized processor). Then create a branch table and jump to the appropriate index which itself will jump to the proper code block using just 2 instructions.

Note: Branch tables are very well explained at http://en.wikipedia.org/wiki/Branch_table. If you made it that far you should read it, don’t worry I will wait.

As the key take away if your algorithm have a sequence of 0..n, you are in the best world, you already have your index. Which we did Smile.

I know what you are thinking: Will the C# JIT compiler be smart enough to convert such a pattern into a branch table?

The short answer is yes, if we give it a bit of help. The if-then-elseif pattern won’t cut it, but what about switch-case?

The compiler will create a switch opcode, in short our branch table, if our values are small and contiguous.

Therefore that is what we did. The impact? Big, but this is just starting. Here is what this looks like in our code:

image

I’ll talk about the details of branch tables in C# more in the next post, but I didn’t want to leave you hanging too much.

Excerpts from the RavenDB Performance team report: Optimizing Memory Comparisons, size does matter

Note, this post was written by Federico. In the previous post after inspecting the decompiled source using ILSpy  we were able to uncover potential things we could do. In this fragment we have a pretty optimized method to compare an entire 4 bytes per loop. What if we could do that on 8 bytes?

To achieve that we will use a ulong instead of a uint. This type of optimization makes sense for 2 reasons.

Most of our users are already running RavenDB in x64 where the native word is 8 bytes and Voron is compiled on x64 only. But even if that were not true, since the late 2000’ most CPUs would have a 64 bytes L1 cache line with half a cycle cost for a hit. So even if you can’t handle 64 bits in one go and the JIT or processor have to issue 2 instructions you are still getting a L1 cache hit and no pipeline stall. Which is GREAT Smile.

So without farther ado, this is the resulting code:

image

Ayende’s note: In the code, the lp += (IntPtr)8/8; is actually defined as lp += 1; What is actually happening is that we are increasing by 8 bytes (size of ulong), and this is how ILSpy decided to represent that for some reason.

The actual IL generated for this is good:

image

It is just that the translation here is kind of strange.

Therefore the question to ask here is: Will skipping over the parts of the memory block that is equal at a faster rate will compensate for the cost of doing a final check with 8  bytes instead of 4 bytes?

Well the answer is a resounding yes. It won’t have much impact in the first 32 bytes (around 3% or less). We won’t lose, but we won’t win much either. But after that it skyrocket.

// Bandwidth optimization kicks in
Size:       32 Original:     535 Optimized:   442 Gain:    5.01%
Size:       64 Original:     607 Optimized:   493 Gain:    7.08%
Size:    128 Original:     752 Optimized:   573 Gain:   11.77%
Size:     256 Original: 1,080 Optimized:   695 Gain:   35.69%
Size:     512 Original: 1,837 Optimized:   943 Gain:   74.40%
Size: 1,024 Original: 3,200 Optimized: 1,317 Gain: 122.25%
Size: 2,048 Original: 5,135 Optimized: 2,110 Gain: 123.13%
Size: 4,096 Original: 8,753 Optimized: 3,690 Gain: 117.29%

Those are real measurements. You can see that when bandwidth optimization kicks in the gains start to get really high. This means that changing the bandwidth size alone from 4 byte to 8 bytes got us an order of magnitude improvement stabilizing around 120%.

Not bad for 2 lines of work.

Excerpts from the RavenDB Performance team report: Optimizing Memory Comparisons, Digging into the IL

Note, this post was written by Federico. Where I had notes or stuff to extend, I explicitly marked it as such. In the previous post after inspecting the decompiled source using ILSpy  we were able to uncover potential things we could do.

Did you remember how dotPeek and ILSpy didn’t agree on the last for-loop?

dotPeek

image

ILSpy

image

Well to really know which one is right, lets dig deeper. Looks like dotPeek is just too smart for our purposes.

image

MSIL is an stack machine, so everything has to be pushed to the stack to be operated. And the lower you go the less context you have to make optimization choices. The compiler knows a lot more, therefore it can make sensible choices that the JIT can’t. Well this is one of those, the problem here is that the compiler is treating those native memory references in a very “un-native” way, leaving small room to the JIT to do its magic. Therefore we are going to give the compiler a nudge to point him in the right direction. 

We know that most architecture have a set of indexed instructions that will allow you to load from memory at a base address plus an offset and special ones optimized to operate with constants. Yeah all that magic in a single opcode.

Therefore if we can find a way that the compiler would emit such  a sequence, there is a high chance that the JIT will understand it and emit such a load statement. What could appear to be a long shot is actually quite easy. Instead of doing pointer arithmetic (pre/post increment and dereferencing) as usual, we will do something we would never do in C/C++; we will just ask for it at face value.

So what would be:

var v = *(lhs++) – *(rhs++); 

Now becomes:

var v = lhs[0] - rhs[0]; 

lhs++; 

rhs++;

What if we need the next one?

var v = lhs[1] - rhs[1]; 

And so on… However, that is true if and only if the number can be loaded into the stack using an special short instruction (a shortcut) that encodes the value to load as a named constant.

Why this work?

Because the MSIL pattern is unequivocal:

image

We push the first pointer (lhs) 

We load a byte from it and put it into an int32 register in the stack 

We push the second pointer (rhs) 

We load a byte from it and put it into an int32 register in the stack 

We subtract the two loaded int32. 

We store it into an stack variable (v) 

We load it into the stack from (v) 

We check if it is distinct from (0) 

The JIT now can figure out how to optimize this with a load + offset instruction easily. Moreover the offset is also a constant, anyone said “special opcode”?. Now let’s compare the IL code  from each approach.

Before Optimization:

image

After Optimization

image

While the amount of instructions is the same and the avid reader would have figured out by now; the code is not that different either.

However, the former translate to far more native instructions than the latter. Why? We will have to ask the JIT or the compiler guys, but my hypothesis is that the first version requires a much more deeper analysis than the second and in an effort to keep the JIT overhead low, that pattern can’t be optimized so much.

The bottom line is: “Do not optimize pointers in C# as you do in C/C++. Translating an optimized algorithm that uses pointers from C/C++ to C# will not be optimal.”

Remember this, it will make sense soon, because in the next post, we’ll tie it all together.

Excerpts from the RavenDB Performance team report: Optimizing Memory Comparisons

Note, this post was written by Federico. Where I had notes or stuff to extend, I explicitly marked it as such. In the previous post after inspecting the decompiled source using ILSpy  we were able to uncover potential things we could do.

Getting rid of unwanted type conversion may seem as an small cost, but let’s make an example. Let’s say we are comparing 2 memory arrays of 16 bytes and they are equal (our worst case scenario).

Just for the sake of simplification from the 3 potential causes the memory is aligned so there is no need to the first 2 unwanted conversions. That leaves us with the main body as the only source of unwanted conversions.

image

Now this loops moves our pointer 4 bytes each time and causes 2 conversions. Therefore for a 16 bytes array (a pretty average size) we are performing 8 conversions, that is grand total of 8 conversions. Assuming our idealized processor, at 0.33 ns per conversion instruction we have 2.64 ns or roughly 3% of the total time per average call. Getting rid of that is easy, as the size of an unsigned int is a constant.

private const int sizeOfUint = sizeof(uint);

Therefore the final executable code will be:

image

Here we have 2 interesting side effects:

  • We no longer have the conversion but also the constant got put instead of the indirection to an stack variable.
  • Almost every comparison you do over a constant that is 2^n based can be converted to a shift operation.

If the JIT is smart enough, this check can be compiled into a shift of 2 places and asking if the result is bigger than 0. Squeezing 4 instructions into 2 per each while cycle.

You guessed right, the JIT is. Smile

Excerpts from the RavenDB Performance team report: Optimizing Memory Compare/Copy Costs

Note, this post was written by Federico. Where I had notes or stuff to extend, I explicitly marked it as such.

TLDR: Optimizing at this level is really hard. To achieve gains of 20%+ for Compare and from 200% to 6% in Copy (depending on the workload) we will need to dig very deep at the IL level.

Another area we looked deeply into is, how do we move and access the memory. This type of optimization work is especially relevant if you are using Voron to handle big workloads. With small databases the improvements can be felt, but where they shine is when dealing with multi-gigabyte databases or high-throughput key-value retrieves and put workloads (did anyone thought Bulk-Inserts?).

Using FreeDB as in this previous post we build an experiment which we could use to pinpoint the pain points making sure we could also repro it every single time (no indexes, no extra call around). Under the microscope 2 pain points were evident when dealing with memory. Comparing and moving memory around.

We usually compare memory straight from the mmaped unmanaged memory when looking for a particular document in Voron Trees; and to copy from and to Voron pages when storing and retrieving documents. These are very core operations for any storage engine, Voron is not an special case. Before we started the optimization effort we already had a pretty optimized routine.

What this method does is:

  • If the memory blocks have zero size, there is no doubt they are equal.
  • If the memory blocks are bigger than the size of a word (32 bits) we do a pre-compare over the aligned memory blocks (for performance) in order to rule out all the equals.
  • As we cannot use words to calculate the output (handling the Endianness would cost us), we do a byte by byte comparison for the final check.                   

For our insert workload we were roughly in the 97.5 nanoseconds per memory compare in average. To put in context, if each assembler instruction could be executed in exactly 1 cycle (which usually is not true) then 3 instruction is an entire nanosecond, therefore our average instruction budget is 291 instructions. Remember this idealized processor, we will use this same comparison later for more complex analysis.

Memory compares can be of different sizes that is why controlling the environment is very important for this type of optimization work.

To deal with that and we were using many tricks from the optimization book. From ensuring that memory alignment is optimal to batch compares with bigger primitive sizes to pointer arithmetic. At first sight this one is the kind of method you won't optimize at all, it is pretty damn tight.

Ayende’s node – We have already done a optimization step on memory comparisons. We initially just shelled out to the native memcmp method, but the cost of doing a pinvoke call ended up being noticable, and we wrote the previously optimized version (and had several rounds of that) to alleviate that cost.

However, we took to the challenge because the payoff can be huge. For a very small bulk insert of 50,000 documents inserted in an empty database, we are talking about in the ballpark of 5 million compares (yeah you read it right). Even if we manage to squeeze 1% off, the sheer volume of calls will make it worthwhile. To achieve that we had to do the unthinkable, we had to resort to dig into the MSIL that method was generating. Armed with ILSpy we found out we may have a way to shave off some inefficiencies.

Here is the what this look like when we start actually putting analysis to action. You can see the method code (after decompilation, so we can be closer to the IL) as well as the issues that were discovered in the process.

image

Because of the size of the method the fastest way was to resort to use a C# decompile, even though we then matched it with the generated IL. The trick to use the C# decompiled version requires that we use a decompiler that is not too smart when dealing with the code. If the decompiler would have understood what was the original code intention and acted upon it, we would have never spotted some of the optimizations at this level. For example, the last loop decompiled with JetBrains dotPeek would look like this:

image

Always keep around an old version of a decompiler just in case you may need it Smile.

Ayende’s note: In most cases, you can select the level of details that a decompiler can give you. With Reflector, for example, you can select how deeply it will decompile things, but even so, doing stupid decompilation can be very beneficial by showing us what is actually going on.

Understanding where the inefficiencies may be, is one thing, being able to prove them is another matter. And we will tackle all of them in future posts.

We will also leave the memcpy analysis for later because it builds on the optimizations used in memcmp and also requires a deep analysis of the Buffer.Memcpy method already available in the .Net Framework (for internal use of course).

If what we did to the poor Etags was evil. You are now arriving at the gates of the underworld.

Ayende’s note: This is a pretty complex topic, and it goes on for quite a while. In order to maintain interest, and to avoid having people getting lost in the details, I broke it apart for several posts. In the meantime, given the details in this post, how would you suggest improving this?

Excerpts from the RavenDB Performance team report: Expensive headers, and cache effects

This ended up being a pretty obvious, in retrospect. We noticed in the profiler that we spent a lot of time working with headers. Now, RavenDB is using REST as the communication layer, so it is doing a lot with that, but we should be able to do better.

Then Tal dug into the actual implementation and found:

public string GetHeader(string key)
{
	if (InnerHeaders.Contains(key) == false)
		return null;
	return InnerHeaders.GetValues(key).FirstOrDefault();
}

public List<string> GetHeaders(string key)
{
	if (InnerHeaders.Contains(key) == false)
		return null;
	return InnerHeaders.GetValues(key).ToList();
}


public HttpHeaders InnerHeaders
{
	get
	{
		var headers = new Headers();
		foreach (var header in InnerRequest.Headers)
		{
			if (header.Value.Count() == 1)
				headers.Add(header.Key, header.Value.First());
			else
				headers.Add(header.Key, header.Value.ToList());
		}

		if (InnerRequest.Content == null)
			return headers;

		foreach (var header in InnerRequest.Content.Headers)
		{
			if (header.Value.Count() == 1)
				headers.Add(header.Key, header.Value.First());
			else
				headers.Add(header.Key, header.Value.ToList());
		}

		return headers;
	}
}

To be fair, this implementation was created very early on, and no one ever actually spent any time looking it since (why would they? it worked, and quite well). The problem is the number of copies that we have, and the fact that to pull a since header, we have to copy all the headers, sometimes multiple times. We replaced this with code that wasn’t doing stupid stuff, and we couldn’t even find the cost of working with headers in the profiler any longer.

But that brings up a really interesting question. How could we not know about this sort of thing? I mean, this isn’t the first time that we are doing a performance pass on the system. So how come we missed this?

The answer is that in this performance pass, we are doing something different. Usually we perf-test RavenDB as you would when using it on your own systems. But for the purpose of this suite of tests, and in order to find more stuff that we can optimize, we are actually working with a stripped down client, no caching, no attempt to optimize things across the entire board. In fact, we have put RavenDB in the worst possible situation, all new work, and no chance to do any sort of optimizations, then we start seeing how all of those code paths that were rarely hit started to light up quite nicely.

Excerpts from the RavenDB Performance team report: The long tale of a lambda

This nugget was discovered by Tal, who was measuring write throughput and noticed that a lot of the time wasn’t being handled in the proper code path, but on something on the side that seemed… off.

prefetchingQueue.Aggregate(0, (x,c) => x + SelectSerializedSizeOnDiskIfNotNull(c)) > context.Configuration.AvailableMemoryForRaisingBatchSizeLimit)

This piece of code is responsible for heuristics deep inside the bowels of RavenDB. In fact, it is the piece that decide whatever we have enough memory to index directly from memory or do we need to start dumping stuff and pay the I/O cost of them later.

As a result, this piece of code is called once for every document save. It was also quite wildly inefficient. The Aggregate implementation was pretty much as you imagined it, and it took three times as much time as actually process the rest of the request. The underlying reason was that we kept doing a foreach and a delegate invocation on each an every call. The more documents we had coming in, the more work we had to do. Shlemiel the painter at his best.

We first saw a major improvement by just removing the Aggregate() call in favor of a purpose built function that summed all those details for us directly. Next, we changed things so instead of doing O(N) work per request, we could do an O(1) work by doing this work one bit at a time and aggregating it on the fly. So whenever we added or removed something to the prefetching queue, we would also make sure to add / remove that from the global tally.

Once that is done, we saw almost 18% improvement in high write scenarios, because we weren’t just busy counting how much stuff we have in memory to figure out if we can put things in memory.

I can’t emphasize enough how important it is that the work throughout was done using a profiler (in our case, the excellent dotTrace) because if dotTrace wouldn’t have point a big finger at this line of code, we would have never have considered this to be problematic.

Excerpts from the RavenDB Performance team report: Dates take a lot of time

RavenDB uses a lot of dates, from the last modified metadata on a document to the timestamp of an index or when a query was started or… you get the point, lots and lots of dates.

Dates in RavenDB are usually formatted in the following manner:

2015-01-15T00:41:16.6616631

This is done using the following date time format:

yyyy'-'MM'-'dd'T'HH':'mm':'ss.fffffff

This is pretty awesome. It generate readable dates that are lexicographically sorted. There is just one problem with that, this is really expensive to do. How expensive? Well, outputting 10 million dates using the following manner:

dateTime.ToString(Default.DateTimeFormatsToWrite, CultureInfo.InvariantCulture)

This takes 13.3 seconds, or just about 750 dates per millisecond. The costs here are partly the allocations, but mostly it is about the fact that the format provider needs to first parse the format specifier, then do quite a bit of work to get it working. And DateTime itself isn’t very cheap. The solution presented is ugly, but it works, and it is fast.

public unsafe static string GetDefaultRavenFormat(this DateTime dt, bool isUtc = false)
{
    string result = new string('Z', 27 + (isUtc ? 1 : 0));

    var ticks = dt.Ticks;

    // n = number of days since 1/1/0001
    int n = (int)(ticks / TicksPerDay);
    // y400 = number of whole 400-year periods since 1/1/0001
    int y400 = n / DaysPer400Years;
    // n = day number within 400-year period
    n -= y400 * DaysPer400Years;
    // y100 = number of whole 100-year periods within 400-year period
    int y100 = n / DaysPer100Years;
    // Last 100-year period has an extra day, so decrement result if 4
    if (y100 == 4) y100 = 3;
    // n = day number within 100-year period
    n -= y100 * DaysPer100Years;
    // y4 = number of whole 4-year periods within 100-year period
    int y4 = n / DaysPer4Years;
    // n = day number within 4-year period
    n -= y4 * DaysPer4Years;
    // y1 = number of whole years within 4-year period
    int y1 = n / DaysPerYear;
    // Last year has an extra day, so decrement result if 4
    if (y1 == 4) y1 = 3;
    // If year was requested, compute and return it
    var year = y400 * 400 + y100 * 100 + y4 * 4 + y1 + 1;

    // n = day number within year
    n -= y1 * DaysPerYear;
    // Leap year calculation looks different from IsLeapYear since y1, y4,
    // and y100 are relative to year 1, not year 0
    bool leapYear = y1 == 3 && (y4 != 24 || y100 == 3);
    int[] days = leapYear ? DaysToMonth366 : DaysToMonth365;
    // All months have less than 32 days, so n >> 5 is a good conservative
    // estimate for the month
    int month = n >> 5 + 1;
    // m = 1-based month number
    while (n >= days[month]) month++;
    // If month was requested, return it

    // Return 1-based day-of-month
    var day = n - days[month - 1] + 1;

    fixed (char* chars = result)
    {
        var v = _fourDigits[year];
        chars[0] = v[0];
        chars[1] = v[1];
        chars[2] = v[2];
        chars[3] = v[3];
        chars[4] = '-';
        v = _fourDigits[month];
        chars[5] = v[2];
        chars[5 + 1] = v[3];
        chars[7] = '-';
        v = _fourDigits[day];
        chars[8] = v[2];
        chars[8 + 1] = v[3];
        chars[10] = 'T';
        v = _fourDigits[(ticks / TicksPerHour) % 24];
        chars[11] = v[2];
        chars[11 + 1] = v[3];
        chars[13] = ':';
        v = _fourDigits[(ticks / TicksPerMinute) % 60];
        chars[14] = v[2];
        chars[14 + 1] = v[3];
        chars[16] = ':';
        v = _fourDigits[(ticks / TicksPerSecond) % 60];
        chars[17] = v[2];
        chars[17 + 1] = v[3];
        chars[19] = '.';

        long fraction = (ticks % 10000000);
        v = _fourDigits[fraction / 10000];
        chars[20] = v[1];
        chars[21] = v[2];
        chars[22] = v[3];

        fraction = fraction % 10000;

        v = _fourDigits[fraction];
        chars[23] = v[0];
        chars[24] = v[1];
        chars[25] = v[2];
        chars[26] = v[3];
    }

    return result;
}

We use the same general pattern that we used with etags as well, although here we are also doing a lot of work to figure out the right parts of the date. Note that we don’t have any allocations, and we again use the notion of a lookup table to all the pre-computed 4 digits number. That allows us to process 10,000,000 dates in just over 2 seconds (2,061 ms, to be exact). Or roughly 4,850 dates per millisecond. In other words, we are about 15% of the speed of the original implementation.

This code is ugly, in fact, the last few posts has contained pretty much ugly code, that is hard to understand. But it is significantly faster than the alternative, and what is even more important, those pieces of code are actually being used in RavenDB’s hot path. In other words, that means that we have actually seen significant performance improvement when introducing them to the codebase.

Excerpts from the RavenDB Performance team report: Etags and evil code, part II

In my previous post, I talked about how we improved the performance of Etag parsing from 5 etags/ms to 3,500 etags/ms. In this post, I want to talk about the exact opposite problem, how we take an Etag and turn it into a string. Here is the original code that we had:

public unsafe override string ToString()
{
    var sb = new StringBuilder(36);
    var buffer = stackalloc byte[8];

    *((long*)buffer) = restarts;

    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[7]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[6]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[5]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[4]]);
    sb.Append('-');
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[3]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[2]]);
    sb.Append('-');
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[1]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[0]]);
    sb.Append('-');

    *((long*)buffer) = changes;

    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[7]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[6]]);
    sb.Append('-');
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[5]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[4]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[3]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[2]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[1]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[0]]);

    var etagAsString = sb.ToString();
    Debug.Assert(etagAsString.Length == 36); //prevent stupid bugs if something is refactored

    return etagAsString;
}

As you can see, we already optimized this a bit. It is using a string builder, it is using a lookup table to avoid costly byte to string. Note also that we use a stackalloc value, so there isn’t an actual allocation, but we are able to copy the values once, and then just directly access it. Which is cheaper than trying to do a lot of bit shifting.

So far so good. Running on 10 million Etags, this completes in 8.9 seconds. That is good, this gives us 1,125 Etags per milliseconds.

Here is the optimized version:

public unsafe override string ToString()
{
    var results = new string('-', 36);

    fixed (char* buf = results)
    {
        var buffer = stackalloc byte[8];
        *((long*)buffer) = restarts;
        var duget = GenericUtil.ByteToHexAsStringLookup[buffer[7]];
        buf[0] = duget[0];
        buf[1] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[6]];
        buf[2] = duget[0];
        buf[3] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[5]];
        buf[4] = duget[0];
        buf[5] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[4]];
        buf[6] = duget[0];
        buf[7] = duget[1];
        //buf[8] = '-';
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[3]];
        buf[9] = duget[0];
        buf[10] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[2]];
        buf[11] = duget[0];
        buf[12] = duget[1];
        //buf[13] = '-';
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[1]];
        buf[14] = duget[0];
        buf[15] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[0]];
        buf[16] = duget[0];
        buf[17] = duget[1];
        //buf[18] = '-';

        *((long*)buffer) = changes;

        duget = GenericUtil.ByteToHexAsStringLookup[buffer[7]];
        buf[19] = duget[0];
        buf[20] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[6]];
        buf[21] = duget[0];
        buf[22] = duget[1];
        //buf[23] = '-';
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[5]];
        buf[24] = duget[0];
        buf[25] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[4]];
        buf[26] = duget[0];
        buf[27] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[3]];
        buf[28] = duget[0];
        buf[29] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[2]];
        buf[30] = duget[0];
        buf[31] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[1]];
        buf[32] = duget[0];
        buf[33] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[0]];
        buf[34] = duget[0];
        buf[35] = duget[1];

        return results;
    }
}

Note that here we don’t bother with a string builder, we directly manipulate the string. And we still use all the other tricks (the lookup table, the no allocation, the works). This code managed to get to 5.5 seconds for 10,000,000 etags, or roughly 1,800 etags per millisecond. Roughly 37.5% improvement to a pretty important piece of code.

Do you see anything else that we can do to reduce the cost even further?

Excerpts from the RavenDB Performance team report: Etags and evil code, Part I

As part of the performance work we have been doing, we focused on the Etag class we use as a performance hot spot. For this post, I’m going to talk about Etag.Parse(string) and what we did to improve its performance.

Etag is a core type in RavenDB, it is how we represent changes in the database, and we deal with it a lot. As such, it turned up as a performance hot spot in our profiling. This post is based mostly around Maxim’s work to improve the overall performance.

One of the things that we do with Etag is to take a string and turn that into an instance of an Etag. An Etag looks just like a Guid, but its structure has meaning that we care about. Here is what a typical Etag looks like:

01000000-0000-0005-0000-00000000012C

We send them over the wire as strings for a lot of purposes. Here is the initial code that we used for parsing Etags:

public static Etag Parse2(string str)
{
    if (string.IsNullOrEmpty(str))
        throw new ArgumentException("str cannot be empty or null");
    if (str.Length != 36)
        throw new ArgumentException(string.Format("str must be 36 characters (Etag::Parse is {0})", str));

    var buffer = new byte[16]
    {
        byte.Parse(str.Substring(16, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(14, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(11, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(9, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(6, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(4, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(2, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(0, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(34, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(32, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(30, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(28, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(26, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(24, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(21, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(19, 2), NumberStyles.HexNumber)
    };

    return new Etag
    {
        restarts = BitConverter.ToInt64(buffer, 0),
        changes = BitConverter.ToInt64(buffer, 8)
    };
}

You can note several things here. First, we are being kinda funky here with the order of parsing. This is because we read the string in Big Endian format, to allow direct comparisons of the bytes strings.

The other issue is the shear number of allocations and calls we are making. Because this is such a small code base, we created 10,000 etags and parsed them a million times .That took 17.3 seconds. For a throughput of roughly 5 Etags per millisecond.

Then we sat down and re-wrote it using a lot of somewhat nasty tricks.

private readonly static int[] _asciisOfHexToNum = CreateHexCharsToNumsTable();

private static int[] CreateHexCharsToNumsTable()
{
    var c = new int['z' + 1];
    for (var i = '0'; i <= '9'; i++)
    {
        c[i] = (char)(i - '0');
    }
    for (var i = 'A'; i <= 'Z'; i++)
    {
        c[i] = (char)((i - 'A') + 10);
    }
    for (var i = 'a'; i <= 'z'; i++)
    {
        c[i] = (char)((i - 'a') + 10);
    }

    return c;
}

public unsafe static Etag Parse(string str)
{
    if (str == null || str.Length != 36)
        throw new ArgumentException("str cannot be empty or null");

    fixed (char* input = str)
    {
        var etag = new Etag();
        int fst = ((byte)(_asciisOfHexToNum[input[0]] * 16 + _asciisOfHexToNum[input[1]])) << 24 |
            ((byte)(_asciisOfHexToNum[input[2]] * 16 + _asciisOfHexToNum[input[3]])) << 16 |
            ((byte)(_asciisOfHexToNum[input[4]] * 16 + _asciisOfHexToNum[input[5]])) << 8 |
            (byte)(_asciisOfHexToNum[input[6]] * 16 + _asciisOfHexToNum[input[7]]);
        int snd = ((byte)(_asciisOfHexToNum[input[9]] * 16 + _asciisOfHexToNum[input[10]])) << 24 |
            ((byte)(_asciisOfHexToNum[input[11]] * 16 + _asciisOfHexToNum[input[12]])) << 16 |
            ((byte)(_asciisOfHexToNum[input[14]] * 16 + _asciisOfHexToNum[input[15]])) << 8 |
            ((byte)(_asciisOfHexToNum[input[16]] * 16 + _asciisOfHexToNum[input[17]]));
        etag.restarts = (uint)snd | ((long)fst << 32);


        fst = ((byte)(_asciisOfHexToNum[input[19]] * 16 + _asciisOfHexToNum[input[20]])) << 24 |
            ((byte)(_asciisOfHexToNum[input[21]] * 16 + _asciisOfHexToNum[input[22]])) << 16 |
            ((byte)(_asciisOfHexToNum[input[24]] * 16 + _asciisOfHexToNum[input[25]])) << 8 |
            ((byte)(_asciisOfHexToNum[input[26]] * 16 + _asciisOfHexToNum[input[27]]));
        snd = ((byte)(_asciisOfHexToNum[input[28]] * 16 + _asciisOfHexToNum[input[29]])) << 24 |
            ((byte)(_asciisOfHexToNum[input[30]] * 16 + _asciisOfHexToNum[input[31]])) << 16 |
            ((byte)(_asciisOfHexToNum[input[32]] * 16 + _asciisOfHexToNum[input[33]])) << 8 |
            ((byte)(_asciisOfHexToNum[input[34]] * 16 + _asciisOfHexToNum[input[35]]));
        etag.changes = (uint)snd | ((long)fst << 32);
        return etag;
    }
}

Here is what we did. We created a translation table, so every possible byte is pre calculated. That means that for the cost of actually parsing the string is basically doing a lookup into an array, which is constant type. There are no allocations, nor is there anything expensive going on. On the same machine the previous code took 17.3 to run, this code can process a million Etags in just 287 milliseconds. Or close to 3,500 Etag per milliseconds.

I leave the rate of improvement as an exercise for the reader. Also, can you think of any way in which we can improve this code even further? We weren’t able to figure out anything, but you never knows.

Excerpts from the RavenDB Performance team report: Voron vs. Esent

Another thing that turned up in the performance work was the Esent vs. Voron issue. We keep testing everything on both, and trying to see which one can outdo the other, fix a hotspot, then try again. When we run the YCSB benchmark we also compared between Esent vs. Voron as storage for our databases and we found that Voron was very good in read operation while Esent was slightly better in write operation. During the YCSB tests we found out one of the reason why Voron was a bit slower than Esent for writing, it was consuming 4 times the expected disk-space.

The reason for this high disk-space consumption was that the benchmark by default generates documents of exactly 1KB, with meta-data the actual size was 1.1KB. Voron internal implementation uses a B+ tree where the leafs are 4KB in size, 1KB was the threshold in which we decide not to save data to the leaf but to reference on it and save it on a new page. We ended up creating a new 4KB page to hold 1.1KB documents for each document that we saved. The benchmark actually hit the worst case scenario for our implementation, and caused us to use 4 times more disk space and write 4 times more data than we needed.  Changing this threshold reduce the disk-space consumption to the expected size, and gave Voron a nice boost.

We are also testing our software on a wide variety of systems, and with Voron specifically with run into an annoying issue. Voron is a write ahead log system, and we are very careful to write to the log in a very speedy manner. This is one of the ways in which we are getting really awesome speed for Voron. But when running on slow I/O system, and putting a lot of load on Voron, we started to see very large stalls after a while. Tracing the issue took a while, but eventually we figured out what was going on. Writing to the log is all well and good, but we need to also send the data to the actual data file at some point.

The way Voron does it, it batch a whole bunch of work, write it to the data file, then sync the data file to make sure it is actually persisted on disk. Usually, that isn’t really an issue. But on slow I/O, and especially under load, you get results like this:

Start to sync data file (8:59:52 AM). Written but unsynced data size 309 MB
FlushViewOfFile duration 00:00:13.3482163. FlushFileBuffers duration: 00:00:00.2800050.
End of data pager sync (9:00:05 AM). Duration: 00:00:13.7042229

Note that this is random write, because we may be doing writes to any part of the file, but that is still way too long. What was worse, and the reason we actually care is that we were doing that while holding the transaction lock.

We were able to re-design that part so even under slow I/O, we can take the lock for a very short amount of time, update the in memory data structure and then release the lock and spend some quality time gazing at our navel in peace while the I/O proceeded in its own pace, but now without blocking anyone else.

Excerpts from the RavenDB Performance team report: Routing

We have been quite busy lately doing all sort of chores. The kind of small features that you don’t really notice, but make a lot of difference over time. One of the ways that we did that was to assign a team for the sole purpose of improving the performance of RavenDB. This is done by writing a lot of stress tests, running under profilers and in general doing very evil things to the code to get it to work much faster. The following is part of the results that we have found so far, written mostly by Tal.

We decided to use Yahoo! Cloud Service Benchmark (AKA YCSB) for that purpose, while YCSB is great in creating stress on the target database servers it does not offer complex interface to stress complex scenario, and seems to be targeting key value solutions. YCSB allows you to basically control some parameters to the workload engine and implementation of five basic operation: Insert, Update, Read, Scan and Delete of a single record, in a single table/collection.

Note: I think that this is because it focused (at least initially) primarily on the Big Table type of NoSQL solutions.

This very thin interface allows us to mainly stress the PUT/GET document. The first thing we found is that running with indexes actually slowed us down since we are “wasting” precious CPU time on indexing and not doing any querying, that is enough to show you the true face of benchmark miracle, since where have you ever seen a database that you didn’t wish to run a query upon?  But this is a topic for another post… lets go back to performance.

The second thing we found was a hotspot in the route data retrieval  mechanism of the Web API/ While using Web API routing seems to work like magic, when you get to have approximately four hundred endpoints you will start suffering from performance issues retrieving route data. In our case we were spending about 1ms per http request just to retrieve the route data for the request, that may not seems like much but for 1,500 request that took 5 seconds to execute this took 1.5 seconds just for route data retrieval. Please, do not try to extrapolate the throughput of RavenDB from those numbers since we ran this test on a weak machine also under a profiling tool that slow it down significantly.

Note: Effectively, for our use case, we were doing an O(400) operation on every request, just to find what code we needed to run to handle this request.

We decided to cache the routes for each request but just straight forward caching wouldn’t really help in our case. The problem is that we have many unique requests that would fill the cache and then wouldn’t actually be of use.

Our major issue was requests that had a route that looked like so databases/{databaseName}/docs/{*docId}. While we we do want to cache routes such as: databases/{databaseName}/stats. The problem is that we have both types of routes and they are used quite often so we ended up with a greatest common route cache system. That means that /databases/Northwind/docs/Orders/order1 and /databases/Northwind/docs/Orders/order2 will go into the same bucket that is /databases/Northwind/docs.

In that way, we would still have to do some work per request, but now we don’t have to go through all four hundred routes we have to find the relevant routes.

What was the performance impact you wonder? We took off 90% of get route data time that is spending 0.15 seconds instead of 1.5 seconds for those 1,500 requests, reducing the overall runtime to 3.65 seconds, that is 27% less run time.

Note: Again, don’t pay attention to the actual times, it is the rate of improvement that we are looking at.

As an aside, another performance improvement that we made with regards to routing was routing discovery. During startup, we scan the RavenDB assembly for controllers, and register the routes. That is all great, as long as you are doing that once. And indeed, in most production scenarios, you’ll only do that once per server start. That means that the few dozen milliseconds that you spend on this aren’t really worth thinking about. But as it turned out, there was one key scenario that was impacted by this. Tests! In tests, you start and stop RavenDB per test, and reducing this cost is very important to ensuring a pleasant experience. We were able to also optimize the route discovery process, and we were actually able to reduce our startup time from cold boot to below what we had in 2.5.

Published at

Originally posted at

Message passing, performance–take 2

In my previous post, I did some rough “benchmarks” to see how message passing options behave. I got some great comments, and I thought I’ll expand on that.

The baseline for this was a blocking queue, and we managed to process using that we managed to get:

145,271,000 msgs in 00:00:10.4597977 for 13,888,510 ops/sec

And the async BufferBlock, using which we got:

43,268,149 msgs in 00:00:10 for 4,326,815 ops/sec.

Using LMAX Disruptor we got a disappointing:

29,791,996 msgs in 00:00:10.0003334 for 2,979,100 ops/sec

However, it was pointed out that I can significantly improve this if I changed the code to be:

var disruptor = new Disruptor.Dsl.Disruptor<Holder>(() => new Holder(), new SingleThreadedClaimStrategy(256), new YieldingWaitStrategy(), TaskScheduler.Default);

After which we get a very nice:
141,501,999 msgs in 00:00:10.0000051 for 14,150,193 ops/sec
Another request I got was for testing this with a concurrent queue, which is actually what it is meant to do. The code is actually the same as the blocking queue, we just changed Bus<string> to ConcurrentQueue<string>.
 
Using that, we got:
170,726,000 msgs in 00:00:10.0000042 for 17,072,593 ops/sec
And yes, this is pretty much just because I could. Any of those methods is quite significantly higher than anything close to what I actually need.

The cost of working with strings

Following my last post, I decided that it might be better to actually show what the difference is between direct string manipulation and working at lower levels.

I generated a sample CSV file with 10 million lines and 6 columns. The file size was 658MB. I then wrote the simplest code that I could possibly think of:

   1: public class TrivialCsvParser
   2: {
   3:     private readonly string _path;
   4:  
   5:     public TrivialCsvParser(string path)
   6:     {
   7:         _path = path;
   8:     }
   9:  
  10:     public IEnumerable<string[]> Parse()
  11:     {
  12:         using (var reader = new StreamReader(_path))
  13:         {
  14:             while (true)
  15:             {
  16:                 var line = reader.ReadLine();
  17:                 if (line == null)
  18:                     break;
  19:                 var fields = line.Split(',');
  20:                 yield return fields;
  21:             }
  22:         }
  23:     }
  24: }

This run in 8.65 seconds (with a no-op action) and kept the memory utilization at about 7MB.

Then next thing to try was just reading through the file without doing any parsing. So I wrote this:

   1: public class NoopParser
   2: {
   3:     private readonly string _path;
   4:  
   5:     public NoopParser(string path)
   6:     {
   7:         _path = path;
   8:     }
   9:  
  10:     public IEnumerable<object> Parse()
  11:     {
  12:         var buffer = new byte[1024];
  13:         using (var stream = new FileStream(_path,FileMode.Open, FileAccess.Read))
  14:         {
  15:             while (true)
  16:             {
  17:                 var result = stream.Read(buffer, 0, buffer.Length);
  18:                 if (result == 0)
  19:                     break;
  20:                 yield return null; // noop
  21:             }
  22:         }
  23:     }
  24: }

Note that this isn’t actually doing anything. But this took 0.83 seconds, so we see a pretty important big difference here. By the way, the amount of memory used isn’t noticeably different here. Both use about 7 MB. Probably because we aren’t actually holding up to any of the data in any meaningful way.

I have run the results using release build, and I run it multiple times, so the file is probably all in the OS cache. So I/O cost is pretty minimal here. However, note that we aren’t doing a lot of stuff that is being done by the TrivialCsvParser. For example, doing line searches, splitting the string to fields, etc. But interestingly enough, just removing the split will reduce the cost from 8.65 seconds to 3.55 seconds.

Voron & the FreeDB dataset

I got tired of doing arbitrary performance testing, so I decided to take the FreeDB dataset and start working with that. FreeDB is a data set used to look up CD information based on the a nearly unique disk id. This is a good dataset, because it contains a lot of data (over three million albums, and over 40 million songs), and it is production data. That means that it is dirty. This makes it perfect to run all sort of interesting scenarios.

The purpose of this post (and maybe the new few) is to show off a few things. First, we want to see how Voron behaves with realistic data set. Second, we want to show off the way Voron works, its API, etc.

To start with, I run my FreeDB parser, pointing it at /dev/null. The idea is to measure what is the cost of just going through the data is. We are using freedb-complete-20130901.tar.bz2 from Sep 2013.

After 1 minute, we went through 342,224 albums, and after 6 minutes we were at 2,066,871 albums. Reading the whole 3,328,488 albums took about a bit over ten minutes. So just the cost of parsing and reading the FreeDB dataset  is pretty expensive.

The end result is a list of objects that looks like this:

image

Now, let us see how we want to actually use this. We want to be able to:

  • Lookup an album by the disk ids
  • Lookup all the albums by an artist*.
  • Lookup albums by album title*.

This gets interesting, because we need to deal with questions such as: “Given Pearl Jam, if I search for Pearl, do I get them? Do I get it for jam?” For now, we are going to go with case insensitive, but we won’t be doing full text search, we will allow, however, prefix searches.

We are using the following abstraction for the destination:

   1: public abstract class Destination
   2: {
   3:     public abstract void Accept(Disk d);
   4:     public abstract void Done();
   5: }

Basically, we read data as fast as we can, and we shove it to the destination, until we are done. Here is the Voron implementation:

   1: public class VoronDestination : Destination
   2: {
   3:     private readonly StorageEnvironment _storageEnvironment;
   4:     private WriteBatch _currentBatch;
   5:     private readonly JsonSerializer _serializer = new JsonSerializer();
   6:     private int counter = 1;
   7:  
   8:     public VoronDestination()
   9:     {
  10:         _storageEnvironment = new StorageEnvironment(StorageEnvironmentOptions.ForPath("FreeDB"));
  11:         using (var tx = _storageEnvironment.NewTransaction(TransactionFlags.ReadWrite))
  12:         {
  13:             _storageEnvironment.CreateTree(tx, "albums");
  14:             _storageEnvironment.CreateTree(tx, "ix_artists");
  15:             _storageEnvironment.CreateTree(tx, "ix_titles");
  16:             tx.Commit();
  17:         }
  18:         _currentBatch = new WriteBatch();
  19:     }
  20:  
  21:     public override void Accept(Disk d)
  22:     {
  23:         var ms = new MemoryStream();
  24:         _serializer.Serialize(new JsonTextWriter(new StreamWriter(ms)), d);
  25:         ms.Position = 0;
  26:         var key = new Slice(EndianBitConverter.Big.GetBytes(counter++));
  27:         _currentBatch.Add(key, ms, "albums");
  28:  
  29:         if(d.Artist != null)
  30:             _currentBatch.MultiAdd(d.Artist.ToLower(), key, "ix_artists");
  31:         if (d.Title != null)
  32:             _currentBatch.MultiAdd(d.Title.ToLower(), key, "ix_titles");
  33:  
  34:         if (counter%1000 == 0)
  35:         {
  36:             _storageEnvironment.Writer.Write(_currentBatch);
  37:             _currentBatch = new WriteBatch();
  38:         }
  39:  
  40:     }
  41:  
  42:     public override void Done()
  43:     {
  44:         _storageEnvironment.Writer.Write(_currentBatch);
  45:     }
  46: }

Let us go over this in detail, shall we?

In line 10 we create a new storage environment. In this case, we want to just import the data, so we can create the storage inline. On lines 13 – 15, we create the relevant trees.

You can think about Voron trees in a very similar manner to the way you think about tables. They are a way to separate data into different parts of the storage. Note that this still all reside in a single file, so there isn’t a physical separation.

Note that we created an albums tree, which will contain the actual data. And ix_artists, ix_titles trees. Those are indexes into the albums tree. You can see them being used just a little lower.

In the Accept method, you can see that we use a WriteBatch, a native Voron notion that allows us to batch multiple operations into a single transaction. In this case, for every album, we are making 3 writes.

First, we write all of the data, as a JSON string, into a stream and put it in the albums tree. Then we create a simple incrementing integer to be the actual album key. Finally, we add the artist and title entries (lower case, so we don’t have to worry about case sensitivity in searches) into the relevant indexes.

At 60 seconds, we written 267,998 values to Voron. In fact, I explicitly designed it so we can see the relevant metrics. At 495 seconds we have reads 1,995,385 entries from the FreeDB file, we parsed  1,995,346 of them and written to Voron  1,610,998. As you can imagined, each step is running in a dedicated thread, so we can see how they behave on an individual basis. The good thing about this is that I can physically see the various costs, it is actually pretty cool

Here is the Voron directory at 60 seconds:

image

You can see that we have two journal files active (haven’t been applied to the data file yet) and the db.voron file is at 512 MB. The compression buffer is at 32 MB (this is usually twice as big as the biggest transaction, uncompressed).

The scratch buffer is used to hold in flight transaction information (until we send it to the data file), and you can see it is sitting on 256MB in size.

At 15 minutes, we have the following numbers: 3,035,452 entries read from the file,  3,035,426 parsed and 2,331,998 written to Voron. Note that we are reading the file & writing to Voron on the same disk, so that might impact the read performance.

At that time, we can see the following on the disk:

image

Note that we increase the size of most of our files by factor of 2, so some of the space in the db.voron file is probably not used. Note that we needed more scratch space to handle the in flight information.

The entire process took 22 minutes, start to finish. Although I have to note that this hasn’t been optimized at all, and I know we are doing a lot of stupid stuff through it.

You might have noticed something else, we actually “crashed” closed the Voron db, this was done to see what would happen when we open a relatively large db after an unordered shutdown.

We’ll actually get to play with the data in my next post. So far this has been pretty much just to see how things are behaving. And… I just realized something, I forgot to actually add an index on disk id Sad smile.

Which means that I have to import the data again. But before that, I also wrote the following:

   1: public class JsonFileDestination : Destination
   2: {
   3:     private readonly GZipStream _stream;
   4:     private readonly StreamWriter _writer;
   5:     private readonly JsonSerializer _serializer = new JsonSerializer();
   6:  
   7:     public JsonFileDestination()
   8:     {
   9:         _stream = new GZipStream(new FileStream("freedb.json.gzip", FileMode.CreateNew, FileAccess.ReadWrite), CompressionLevel.Optimal);
  10:         _writer = new StreamWriter(_stream);
  11:     }
  12:  
  13:     public override void Accept(Disk d)
  14:     {
  15:         _serializer.Serialize(new JsonTextWriter(_writer), d);
  16:         _writer.WriteLine();
  17:     }
  18:  
  19:     public override void Done()
  20:     {
  21:         _writer.Flush();
  22:         _stream.Dispose();
  23:     }
  24: }

This completed in ten minutes, for 3,328,488 entries. Or a rate of about 5,538 per / second. The result is a 845MB gzip file.

I had twofold reasons to want to do this. First, this gave me something to compare ourselves to, and more to the point, I can re-use this gzip file for my next tests, without having to go through the expensive parsing of the freedb file.

I did just that and ended up with the following:

   1: public class VoronEntriesDestination : EntryDestination
   2: {
   3:     private readonly StorageEnvironment _storageEnvironment;
   4:     private WriteBatch _currentBatch;
   5:     private int counter = 1;
   6:  
   7:     public VoronEntriesDestination()
   8:     {
   9:         _storageEnvironment = new StorageEnvironment(StorageEnvironmentOptions.ForPath("FreeDB"));
  10:         using (var tx = _storageEnvironment.NewTransaction(TransactionFlags.ReadWrite))
  11:         {
  12:             _storageEnvironment.CreateTree(tx, "albums");
  13:             _storageEnvironment.CreateTree(tx, "ix_diskids");
  14:             _storageEnvironment.CreateTree(tx, "ix_artists");
  15:             _storageEnvironment.CreateTree(tx, "ix_titles");
  16:             tx.Commit();
  17:         }
  18:         _currentBatch = new WriteBatch();
  19:     }
  20:  
  21:     public override int Accept(string d)
  22:     {
  23:         var disk = JObject.Parse(d);
  24:  
  25:         var ms = new MemoryStream();
  26:         var writer = new StreamWriter(ms);
  27:         writer.Write(d); writer.Flush();
  28:         ms.Position = 0;
  29:         var key = new Slice(EndianBitConverter.Big.GetBytes(counter++));
  30:         _currentBatch.Add(key, ms, "albums");
  31:         int count = 1;
  32:  
  33:         foreach (var diskId in disk.Value<JArray>("DiskIds"))
  34:         {
  35:             count++;
  36:             _currentBatch.MultiAdd(diskId.Value<string>(), key, "ix_diskids");
  37:         }
  38:  
  39:         var artist = disk.Value<string>("Artist");
  40:         if (artist != null)
  41:         {
  42:             count++; 
  43:             _currentBatch.MultiAdd(artist.ToLower(), key, "ix_artists");
  44:         }
  45:         var title = disk.Value<string>("Title");
  46:         if (title != null)
  47:         {
  48:             count++; 
  49:             _currentBatch.MultiAdd(title.ToLower(), key, "ix_titles");
  50:         }
  51:  
  52:         if (counter % 100 == 0)
  53:         {
  54:             _storageEnvironment.Writer.Write(_currentBatch);
  55:             _currentBatch = new WriteBatch();
  56:         }
  57:         return count;
  58:     }
  59:  
  60:     public override void Done()
  61:     {
  62:         _storageEnvironment.Writer.Write(_currentBatch);
  63:         _storageEnvironment.Dispose();
  64:     }
  65: }

Now we are actually properly disposing of things, and I also decreased the size of the batch, to see how it would respond. Note that it is now being fed directly from the gzip file, at a greatly reduced cost.

I also added tracking note only for how many albums we write, but also how many entries. By entries I mean, how many Voron entries (which include the values we add to the index).

I did find a bug where we would just double the file size without due consideration to its size, so now we are doing smaller file size increases.

Word of warning: I didn’t realized until after I was done with all the benchmarks, but I actually run all of those in DEBUG configuration, which basically means that it is utterly useless as a performance metric. That is especially true because we have a lot of verifier code that runs in DEBUG mode. So please don’t take those numbers as actual performance metrics, they aren’t valid.

Time # of albums # of entries
4 minutes 773,398 3,091,146
6 minutes 1,126,998 4,504,550
8 minutes 1,532,858 6,126,413
18 minutes 2,781,698 11,122,799
24 minutes 3,328,488 13,301,496

The status of the file system midway during the run. You can see that now we increase the file is smaller increments. And that we are using more scratch space, probably because we are under very heavy write load.

image

After the run:

image

Scratch & compression are only used when the database is running, and deleted on close. The database is 7GB in side, which is quite respectable. Now, to working with it, but I’ll save that for my next post, this one is long enough already.

Performance Testing with Voron OMG

I am not going to have a lot of commentary here, I think that I can let the numbers speak for themselves. Here are the current performance test of Voron vs. Esent on my machine.

image

Note, the numbers are in millions of writes per second here!

For Random Writes, the numbers are in thousands of writes per second:

image

Okay, I lied about no commentary. What you can see is that in our test, inserting 1.5 million items to storage. In sequential runs, we get such big variance because the test run itself is so short. (Nitpicker, yes, we’ll post longer benchmarks).

Note that this was running on a pretty good SSD drive, but what about HDD? Yes, that isn’t a mistake, Esent is doing 901 writes per second.

imageimage

As we have seen, we are pretty good at reading, but what are the current results?

image

As you can see, we are significantly better than Esent for reads in single and dual threaded modes, but Esent is faster when it can use more threads.

That annoyed me, so I set out to figure out what was blocking us. As it turned out, searching the B-tree was quite expensive, but when I added a very small optimization (remember the last few searched pages), we saw something very interesting).

image

(The Esent results are different than the previous run because we run it on a different disk (still SSD)).

What about random reads?

image

Now, one area where we currently suck is large writes, where we are doing really bad, but we’re working on that.

Overall, I am pretty happy about it all.

Versioned Collections

Last time we talked about collections, I talked about Linked Dictionary and Safe List. The first did some fancy dancing to reduce the cost of allocations, and the second would re-create the list on change.

Those implementations were good enough for us to go to hundreds of thousands of writes per second in the sequential scenario, but we noticed them cracking up under the pressure for the random writes scenario.

So I started looking at more options. As it turns out, there is a very simple immutable and highly efficient design for a list. Just hold to the end of the list, and every append is just creating a new node that points to the old tail of the list. This makes adding to the list an O(1) operation. It is very fast, simple and easy to work with.

This has just one issue, it allows efficient iteration only in reverse order. And there are a few places where we are actually relying on the ordering, so it would be nice not to lose that. Alex suggested using a internally mutable but externally immutable structure, such as this one. The benefit here is that we still get external immutability, but we get the performance of mutable access.

That works, as long as we use a list, which we needed for some pretty important things. However, the key collection, and the one that caused us most performance issues was the cost of getting values from the page table, which is a dictionary. We needed a really good solution there. I started to think that we would need to change all the code that uses this, but then I realized something very important. I don’t actually need this. I don’t need immutability, I just need snapshotting. And another aspect of that is that we have only a single writer thread for all of those items. So while we require that once we gave an instance to a reader, it can never change, that only applies to its externally observable behavior.

I would like to thank Tyler for the suggestion. Basically, we start off with a Concurrent Dictionary (and the only reason we need it is so we can write to the dictionary while we are reading from it, something that isn’t safe to do with standard dictionary). The values in that dictionary is an immutable list of the values with the transaction id. Whenever we modify a value, we can drop all the values older than the oldest existing transaction.

And when we search, we search for a value that is not greater than our own transaction id. That means that we effectively have a frozen snapshot of the page table, at a very little cost. I felt bad calling this as a generic collection name, so we actually named the class PageTable, since it has a very specific role in the code.

After this was done, we could not longer find any major cost associated with collections. So it is onward to the next challenge, reducing the cost of key comparisons…

Machine bias in profiler based optimizations

Take a look at the following profilers results. They both showcase pretty much the same codebase, without any major changes in between. However, they have been both run on different machines, and they result in very different performance optimization paths.

Main machine:

Laptop:

image

As you can see, when running the code on my main machine, the most expensive thing is actually writing to disk (WriteToJournal). However, on my laptop, writing to disk is actually very cheap.  The likely reason is, again, my laptop is faking writing to disk in favor of buffering writes, even though we explicitly tell it not to. That means that it behaves as if it is we already optimized writing to disk. My main machine behaves more like a server, and we see a lot more I/O costs, which pretty much mask any other costs that we might have.

Using the profiler to guide you with regards to how to optimize the system on any one of those machines would lead you in very different paths. With Voron, we are testing it in a variety of scenarios, to make sure that we aren’t optimizing for one particular machine and hurting another.

Reducing the cost of writing to disk

So, we found out that the major cost of random writes in our tests was actually writing to disk. Writing 500K sequential items resulted in about 300 MB being written. Writing 500K random items resulted in over 2.3 GB being written.

Note: I would like to point out Alex’s comment, which helped setup this post.

So the obvious thing to do would be to use compression. I decided to try and see what that would give us, and the easiest thing to do is to just enable file compression at the NTFS level. But we can probably do better. The major cost we have is writes to the journal file. And we only ever read from the journal file when we recover the database. We are also usually writing multiple pages at a time for most transactions, and for the scenario we care about, writing 100 random values in a single transaction, we are usually write about a 100 pages or so anyway. That means that we have got a pretty big buffer that we can try to compress all at once.

Sadly, we don’t really see a meaningful improvement under that scenario. Using NTFS compression slow us down considerably, while both LZ4 and Snappy resulted in greatly reduce file writes, but roughly the same performance.

image  image

Note that I have done just the minimal amount of work required to test this out. I changed how we are writing to the journal, and disabled reading from it. That was pretty disappointing, to tell you the truth, I fully expected that we’ll see some interesting improvements there. The LZ4 compression factor is about x10 for our data set. That means that 100 – 121 pages usually are compressed to 10 – 13 pages.

The good thing about this is that running this through the profiler shows that we are running on a high I/O machine, where the amount and speed of I/O doesn’t impact our performance much.

To test it out, I run the same test on an Amazon m3.2xlarge machine, and got the following results (writing to the C: drive):

image

So for now I think that I’ll do the following. We’ll focus on the other costs beyond I/O, and profile heavily on systems with higher I/O costs.

Compression should work, in fact, my first thought was that we pay for less I/O with more CPU, but that is a guess without looking at the numbers.

Here is the cost of committing using compression (lz4):

image

And here it is without compression:

image

So the good news is that we can visible see that we reduced the I/O cost from 8 seconds to 3.8 seconds. And even with compression cost of 2.6 seconds, we are still ahead.

However, the cost of the collections used is actually a lot worse from our perspective. So we are back to reducing computation costs first, then looking at compression again at a later point in time.

The cost of random writes

After pretty much getting to where we wanted with sequential writes, it is time to try to tackle the cost of random writes. Here is Esent & Voron doing random writes for 1 million entries:

image

Yes, it is sad that this is the case, because we can do about 40 times better for sequential writes.

However… what is the reason for the slowness? As usual, the answer is to profile, profile & profile. In fact, we want to do profiling on multiple machines, to check that we don’t have a machine bias.

At any rate, as you can see, we are actually pretty close to Esent in random writes, but why are we so much slower than our performance during sequential writes?

Well, we actually covered that, actually. The major reason for the cost is actually the fact that we have to deal with very different B-Tree shapes. You can see the difference between sequential and randomly inserted trees here. The more random you are, the bigger your tree is, the more pages you have to touch whenever you update. Our scenario calls for making 100 random modification in every transaction. That means that on every transaction, we have to touch many different parts of the tree.

Here are the results (in pages) for how much it cost us (in this test, we run 5,000 transactions of 100 items each.

 

Sequential

Random

Average

7.06

111.37

Max

11.00

134.00

Total

35,320.00

556,973.00

Std Dev

0.35

7.09

In other words, this means that on average, a sequential transaction wrote 7 pages, or about 28Kb. While the average random transaction wrote about 450Kb!

While the total size written for sequential was 137MB, and allocate 319MB. However, random sizes wrote 2.1 GB for 2.3 GB that were allocated. And that explains quite a lot about the whole thing. (Note that we count for writes only the journal writes, not the data writes, because those are the limiting factor for write performance).

Esent, by the way, writes sequentially about 465 MB and 2.3GB for random writes. Now we know why we are pretty close, because the vast majority of the time is actually spent just writing to the disk, and we are stalled by the sheer amount of data we have.

Interestingly enough, I don’t think that there is a good way to handle that nicely. Certainly not with a B-Tree. We’ll see how we can work with that at a future post.

Transaction merging, locks, background threads and performance

The following code is taken from the leveldb codebase:

image

This code is responsible for merging concurrent transactions. It operates using a mutex and conditional variables. And the idea really impressed me when I first saw it.

I implemented that in Voron because I think that this is a great way to gain higher concurrency rates while still having a single writer mode (which is much easier to work with).

However, as it turned out, this isn’t a really good idea in practice. The way conditional variables work, you lock the mutex, check your condition, and then wait on the conditional variable. Waking up from a conditional variable means that you have re-acquired the mutex.

What I want to happen:

  • Multiple concurrent threads will try to write at the same time.
  • One of them goes through and start writing, it takes all the pending writes and write them to disk.
  • It then releases all the waiting threads whose work it already completed.
  • They all move on from there without having to wait on one another.

However, because of the way conditional variables are implemented, what will actually happen is that they each will wake up one at a time, acquiring the mutex lock, then releasing it. However, while they are being released, they compete with one another, and they also compete with the next write.

We profiled this approach, and the result was that we could see how we were spending pretty much all of our time in just synchronizing threads.

Instead, we moved to a different approach, in which the first write will actually start off a new thread, dedicated to writing batches. The way it works, when a thread want to write a batch, it will add that to a queue and wake up the background writer thread, then wait on an event. The background writer will read all the current batches and merge them into a single write transaction.

When it is done, it will wake up all the sleeping writers. We tried to use TaskCompletionSource for that, initially, but we found that the inline nature of tasks made that too expensive to use. Instead, we use ManualResetEventSlim, and we explicitly wait / wake them. We even reuse events, so we don’t have to keep creating and disposing them.

The end result is that we have a pretty sequential process for actually doing concurrent writes, which turn it into a simple producers/consumer problem from threading perspective, and the actual writes into a simple write things out as fast as you can process them.

This also gives us the chance to do some re-ordering of operations to get better performance overall.

Voron Performance, the end is nigh (sequentially)

For the past few months, we have been doing a lot of work on performance for Voron. Performance for a storage engine is a very complex problem, composed of many competing issues.

For example, sequential writes vs. random writes. Or the cost of writes vs. the cost of reads. In this post, I am going to focus specifically on the single scenario we have focused most of our efforts on, sequential writes.

Sequential writes are important because they present the absolute sweet spot for a storage engine. In other words, this is the very first thing that needs to be fast, because if that ain’t fast ,nothing else would be. And because this represent the absolute sweat spot, it is perfectly fine and very common to explicitly design your system to ensure that this is what you’ll use.

We have been doing a lot there. In particular, some of our changes included:

  • Moving from fsync model (very slow) to unbuffered writer through writes.
  • Moving to vectored writes, reducing sys calls and the need for sequential memory space.
  • Writing our own immutable collection classes, meant specifically for what we’re doing.
  • Reduce # of tree searches.
  • Optimizing the cost of copying data from the user’s stream to our own store.

There have probably been others ,but those have been the major ones. Let us look at the numbers, shall we? I am going to compare us to Esent, since this is our current storage engine.

The test is writing out 10 million items, in 100,000 transactions, with 100 items per transaction. This is done in purely sequential manner. Item size is 128 bytes value and 16 bytes key.

Here are the original code numbers.

image

Note that this was run on a W520 Lenovo with an SSD drive. The actual details don’t really matter, what matters is that we are comparing two runs happening on the same machine.

And now, let us compare Esent and Voron as they currently stand:

image

Yes, you see clearly.

The big jump in Esent numbers have happened because we also worked on the test suite, and optimized things there, but the important thing is that we are now significantly faster than Esent.

In fact, we are fast enough that it took me a while to accept that we are actually doing it. But yes, you are seeing an order of magnitude improvement over what we used to have.

Feel free to take a moment to do a victory dance, that is what I’m doing right now.

However, note the title of this post. This is for a single threaded sequential inserts. The numbers are a lot worse for random writes, but we’ll deal with that in another post.

Voron Performance, the single biggest booster

One of the surprising points for improvement in our performance run was the following logic, responsible for copying the data from the user to our own memory:

   1: using (var ums = new UnmanagedMemoryStream(pos, value.Length, value.Length, FileAccess.ReadWrite))
   2: {
   3:     value.CopyTo(ums);
   4: }

Those three lines of code were responsible for no less than 25% of our performance. It was obvious that something needed to be done. My belief is that the unmanaged memory stream is just not optimized for this scenario, resulting in a lot of copying, allocations and costs.

Here is what we did instead. We create a temporary space that is allocated once, like this:

image

You can see that we are doing some interesting stuff there. In particular, we are allocated a managed buffer, but also force the GC to pin it. We keep this around for the entire lifetime of the database, too. The idea here is that we want to avoid the cost of pinning & unpinning it all the time, even if it means that we have an unmovable memory.

At any rate that important thing about this is that it gives us access to the same memory from managed and unmanaged perspectives. And that, in turn, leads to the following code:

image

We first read the values from the stream into the managed buffer, then copy them from the unmanaged pointer to the buffer to our own memory.

The idea here is that we accept a Stream abstraction, and that can only work with managed buffers, so we have to go through this route, instead of having to copy the memory directly. The reason we do that is that we don’t want to force the user of our API to materialize the data fully. We want to be able to stream it into the database.

At any rate, this has made some serious improvement to our performance, but I’ll be showing the details on a future post.

Voron Performance, let go off that tree, dude!

B-Trees has a few really awesome properties, chief among them is that they are playing very well with the hierarchical nature of data access that we have in our systems (L1 – L3 caches, main memory, disk, etc). Another really nice property is that searches in a B-Trees have very low costs.

However, it became apparent to us during performance tests that actually searching the tree was very costly. Assuming that we have the following pseudo code:

   1: for i in range(1, 1000 * 1000):
   2:    add(i, "val")

What is the cost of actually doing this? Well, an interesting tidbit about this is that every time we add an item to the tree, we need to search it, so we can find where in the tree that new value goes.

Cost of search the tree, based on size:

Number of entries

Cost

1

0.19

1,000

1.93

10,000

2.57

25,000

2.83

50,000

3.02

100,000

3.21

250,000

3.47

500,000

3.66

1,000,000

3.86

2,000,000

4.05

The cost, by the way, is O(log36 (N+1)). The log36 comes from the number of entries that we can fit in a single page. This cost ignores the actual search inside a page. But it is a good enough approximation for our needs.

Now, the cost of actually inserting 1 million items is the sum of all of those costs. Which means that the cost for 1 million is 3,576,242.35. This is another example of Schlemiel the Painter algorithm.

What we did was introduce a small cache, which remember the last few pages that we inserted to. That turned the cost down from searching the tree to checking the cache, where we can usually find the value, and gave a nice performance boost.

What about F# collections?

After disqualifying the BCL immutable collections for performance, I decided that I probably also need to look at the F# collection package, to see if they do any better.

I referenced the FSharp.Core dll and wrote the following:

   1: private static FSharpMap<long, object> FSharpAdd(int iterations)
   2: {
   3:     var dic = MapModule.Empty<long, object>();
   4:  
   5:     var sp = Stopwatch.StartNew();
   6:     var rnd = new Random(32);
   7:     for (int i = 0; i < iterations; i++)
   8:     {
   9:         foreach (var item in Enumerable.Range(rnd.Next(0, i), Math.Max(i * 2, 16)))
  10:         {
  11:             dic = dic.Add(item, null);
  12:         }
  13:     }
  14:  
  15:     Console.WriteLine(sp.Elapsed + " Adding items, fsharp map");
  16:  
  17:     return dic;
  18: }

As I was writing this post, the code is running, and I have had time to do some emails, writing the entire post, check the CPU status, and it is still running. It is likely going to be worse than the immutable collections scenario.

That make sense, since in the immutable collection scenario we had the ability to do many mutations all at once (using SetItems, which isn’t available on the FSharpMap).

However, it clocks in at just over 1 hour and 16 minutes, making it the slowest contender overall.

Tags:

Published at

Originally posted at

Comments (14)