Ayende @ Rahien

Hi!
My name is Ayende Rahien
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 5,949 | Comments: 44,548

filter by tags archive

Excerpts from the RavenDB Performance team reportOptimizing Memory Comparisons, size does matter


Note, this post was written by Federico. In the previous post after inspecting the decompiled source using ILSpy  we were able to uncover potential things we could do. In this fragment we have a pretty optimized method to compare an entire 4 bytes per loop. What if we could do that on 8 bytes?

To achieve that we will use a ulong instead of a uint. This type of optimization makes sense for 2 reasons.

Most of our users are already running RavenDB in x64 where the native word is 8 bytes and Voron is compiled on x64 only. But even if that were not true, since the late 2000’ most CPUs would have a 64 bytes L1 cache line with half a cycle cost for a hit. So even if you can’t handle 64 bits in one go and the JIT or processor have to issue 2 instructions you are still getting a L1 cache hit and no pipeline stall. Which is GREAT Smile.

So without farther ado, this is the resulting code:

image

Ayende’s note: In the code, the lp += (IntPtr)8/8; is actually defined as lp += 1; What is actually happening is that we are increasing by 8 bytes (size of ulong), and this is how ILSpy decided to represent that for some reason.

The actual IL generated for this is good:

image

It is just that the translation here is kind of strange.

Therefore the question to ask here is: Will skipping over the parts of the memory block that is equal at a faster rate will compensate for the cost of doing a final check with 8  bytes instead of 4 bytes?

Well the answer is a resounding yes. It won’t have much impact in the first 32 bytes (around 3% or less). We won’t lose, but we won’t win much either. But after that it skyrocket.

// Bandwidth optimization kicks in
Size:       32 Original:     535 Optimized:   442 Gain:    5.01%
Size:       64 Original:     607 Optimized:   493 Gain:    7.08%
Size:    128 Original:     752 Optimized:   573 Gain:   11.77%
Size:     256 Original: 1,080 Optimized:   695 Gain:   35.69%
Size:     512 Original: 1,837 Optimized:   943 Gain:   74.40%
Size: 1,024 Original: 3,200 Optimized: 1,317 Gain: 122.25%
Size: 2,048 Original: 5,135 Optimized: 2,110 Gain: 123.13%
Size: 4,096 Original: 8,753 Optimized: 3,690 Gain: 117.29%

Those are real measurements. You can see that when bandwidth optimization kicks in the gains start to get really high. This means that changing the bandwidth size alone from 4 byte to 8 bytes got us an order of magnitude improvement stabilizing around 120%.

Not bad for 2 lines of work.

Excerpts from the RavenDB Performance team reportOptimizing Memory Comparisons, Digging into the IL


Note, this post was written by Federico. Where I had notes or stuff to extend, I explicitly marked it as such. In the previous post after inspecting the decompiled source using ILSpy  we were able to uncover potential things we could do.

Did you remember how dotPeek and ILSpy didn’t agree on the last for-loop?

dotPeek

image

ILSpy

image

Well to really know which one is right, lets dig deeper. Looks like dotPeek is just too smart for our purposes.

image

MSIL is an stack machine, so everything has to be pushed to the stack to be operated. And the lower you go the less context you have to make optimization choices. The compiler knows a lot more, therefore it can make sensible choices that the JIT can’t. Well this is one of those, the problem here is that the compiler is treating those native memory references in a very “un-native” way, leaving small room to the JIT to do its magic. Therefore we are going to give the compiler a nudge to point him in the right direction. 

We know that most architecture have a set of indexed instructions that will allow you to load from memory at a base address plus an offset and special ones optimized to operate with constants. Yeah all that magic in a single opcode.

Therefore if we can find a way that the compiler would emit such  a sequence, there is a high chance that the JIT will understand it and emit such a load statement. What could appear to be a long shot is actually quite easy. Instead of doing pointer arithmetic (pre/post increment and dereferencing) as usual, we will do something we would never do in C/C++; we will just ask for it at face value.

So what would be:

var v = *(lhs++) – *(rhs++); 

Now becomes:

var v = lhs[0] - rhs[0]; 

lhs++; 

rhs++;

What if we need the next one?

var v = lhs[1] - rhs[1]; 

And so on… However, that is true if and only if the number can be loaded into the stack using an special short instruction (a shortcut) that encodes the value to load as a named constant.

Why this work?

Because the MSIL pattern is unequivocal:

image

We push the first pointer (lhs) 

We load a byte from it and put it into an int32 register in the stack 

We push the second pointer (rhs) 

We load a byte from it and put it into an int32 register in the stack 

We subtract the two loaded int32. 

We store it into an stack variable (v) 

We load it into the stack from (v) 

We check if it is distinct from (0) 

The JIT now can figure out how to optimize this with a load + offset instruction easily. Moreover the offset is also a constant, anyone said “special opcode”?. Now let’s compare the IL code  from each approach.

Before Optimization:

image

After Optimization

image

While the amount of instructions is the same and the avid reader would have figured out by now; the code is not that different either.

However, the former translate to far more native instructions than the latter. Why? We will have to ask the JIT or the compiler guys, but my hypothesis is that the first version requires a much more deeper analysis than the second and in an effort to keep the JIT overhead low, that pattern can’t be optimized so much.

The bottom line is: “Do not optimize pointers in C# as you do in C/C++. Translating an optimized algorithm that uses pointers from C/C++ to C# will not be optimal.”

Remember this, it will make sense soon, because in the next post, we’ll tie it all together.

Excerpts from the RavenDB Performance team reportOptimizing Memory Comparisons


Note, this post was written by Federico. Where I had notes or stuff to extend, I explicitly marked it as such. In the previous post after inspecting the decompiled source using ILSpy  we were able to uncover potential things we could do.

Getting rid of unwanted type conversion may seem as an small cost, but let’s make an example. Let’s say we are comparing 2 memory arrays of 16 bytes and they are equal (our worst case scenario).

Just for the sake of simplification from the 3 potential causes the memory is aligned so there is no need to the first 2 unwanted conversions. That leaves us with the main body as the only source of unwanted conversions.

image

Now this loops moves our pointer 4 bytes each time and causes 2 conversions. Therefore for a 16 bytes array (a pretty average size) we are performing 8 conversions, that is grand total of 8 conversions. Assuming our idealized processor, at 0.33 ns per conversion instruction we have 2.64 ns or roughly 3% of the total time per average call. Getting rid of that is easy, as the size of an unsigned int is a constant.

private const int sizeOfUint = sizeof(uint);

Therefore the final executable code will be:

image

Here we have 2 interesting side effects:

  • We no longer have the conversion but also the constant got put instead of the indirection to an stack variable.
  • Almost every comparison you do over a constant that is 2^n based can be converted to a shift operation.

If the JIT is smart enough, this check can be compiled into a shift of 2 places and asking if the result is bigger than 0. Squeezing 4 instructions into 2 per each while cycle.

You guessed right, the JIT is. Smile

Excerpts from the RavenDB Performance team reportOptimizing Memory Compare/Copy Costs


Note, this post was written by Federico. Where I had notes or stuff to extend, I explicitly marked it as such.

TLDR: Optimizing at this level is really hard. To achieve gains of 20%+ for Compare and from 200% to 6% in Copy (depending on the workload) we will need to dig very deep at the IL level.

Another area we looked deeply into is, how do we move and access the memory. This type of optimization work is especially relevant if you are using Voron to handle big workloads. With small databases the improvements can be felt, but where they shine is when dealing with multi-gigabyte databases or high-throughput key-value retrieves and put workloads (did anyone thought Bulk-Inserts?).

Using FreeDB as in this previous post we build an experiment which we could use to pinpoint the pain points making sure we could also repro it every single time (no indexes, no extra call around). Under the microscope 2 pain points were evident when dealing with memory. Comparing and moving memory around.

We usually compare memory straight from the mmaped unmanaged memory when looking for a particular document in Voron Trees; and to copy from and to Voron pages when storing and retrieving documents. These are very core operations for any storage engine, Voron is not an special case. Before we started the optimization effort we already had a pretty optimized routine.

What this method does is:

  • If the memory blocks have zero size, there is no doubt they are equal.
  • If the memory blocks are bigger than the size of a word (32 bits) we do a pre-compare over the aligned memory blocks (for performance) in order to rule out all the equals.
  • As we cannot use words to calculate the output (handling the Endianness would cost us), we do a byte by byte comparison for the final check.                   

For our insert workload we were roughly in the 97.5 nanoseconds per memory compare in average. To put in context, if each assembler instruction could be executed in exactly 1 cycle (which usually is not true) then 3 instruction is an entire nanosecond, therefore our average instruction budget is 291 instructions. Remember this idealized processor, we will use this same comparison later for more complex analysis.

Memory compares can be of different sizes that is why controlling the environment is very important for this type of optimization work.

To deal with that and we were using many tricks from the optimization book. From ensuring that memory alignment is optimal to batch compares with bigger primitive sizes to pointer arithmetic. At first sight this one is the kind of method you won't optimize at all, it is pretty damn tight.

Ayende’s node – We have already done a optimization step on memory comparisons. We initially just shelled out to the native memcmp method, but the cost of doing a pinvoke call ended up being noticable, and we wrote the previously optimized version (and had several rounds of that) to alleviate that cost.

However, we took to the challenge because the payoff can be huge. For a very small bulk insert of 50,000 documents inserted in an empty database, we are talking about in the ballpark of 5 million compares (yeah you read it right). Even if we manage to squeeze 1% off, the sheer volume of calls will make it worthwhile. To achieve that we had to do the unthinkable, we had to resort to dig into the MSIL that method was generating. Armed with ILSpy we found out we may have a way to shave off some inefficiencies.

Here is the what this look like when we start actually putting analysis to action. You can see the method code (after decompilation, so we can be closer to the IL) as well as the issues that were discovered in the process.

image

Because of the size of the method the fastest way was to resort to use a C# decompile, even though we then matched it with the generated IL. The trick to use the C# decompiled version requires that we use a decompiler that is not too smart when dealing with the code. If the decompiler would have understood what was the original code intention and acted upon it, we would have never spotted some of the optimizations at this level. For example, the last loop decompiled with JetBrains dotPeek would look like this:

image

Always keep around an old version of a decompiler just in case you may need it Smile.

Ayende’s note: In most cases, you can select the level of details that a decompiler can give you. With Reflector, for example, you can select how deeply it will decompile things, but even so, doing stupid decompilation can be very beneficial by showing us what is actually going on.

Understanding where the inefficiencies may be, is one thing, being able to prove them is another matter. And we will tackle all of them in future posts.

We will also leave the memcpy analysis for later because it builds on the optimizations used in memcmp and also requires a deep analysis of the Buffer.Memcpy method already available in the .Net Framework (for internal use of course).

If what we did to the poor Etags was evil. You are now arriving at the gates of the underworld.

Ayende’s note: This is a pretty complex topic, and it goes on for quite a while. In order to maintain interest, and to avoid having people getting lost in the details, I broke it apart for several posts. In the meantime, given the details in this post, how would you suggest improving this?

Excerpts from the RavenDB Performance team reportExpensive headers, and cache effects


This ended up being a pretty obvious, in retrospect. We noticed in the profiler that we spent a lot of time working with headers. Now, RavenDB is using REST as the communication layer, so it is doing a lot with that, but we should be able to do better.

Then Tal dug into the actual implementation and found:

public string GetHeader(string key)
{
	if (InnerHeaders.Contains(key) == false)
		return null;
	return InnerHeaders.GetValues(key).FirstOrDefault();
}

public List<string> GetHeaders(string key)
{
	if (InnerHeaders.Contains(key) == false)
		return null;
	return InnerHeaders.GetValues(key).ToList();
}


public HttpHeaders InnerHeaders
{
	get
	{
		var headers = new Headers();
		foreach (var header in InnerRequest.Headers)
		{
			if (header.Value.Count() == 1)
				headers.Add(header.Key, header.Value.First());
			else
				headers.Add(header.Key, header.Value.ToList());
		}

		if (InnerRequest.Content == null)
			return headers;

		foreach (var header in InnerRequest.Content.Headers)
		{
			if (header.Value.Count() == 1)
				headers.Add(header.Key, header.Value.First());
			else
				headers.Add(header.Key, header.Value.ToList());
		}

		return headers;
	}
}

To be fair, this implementation was created very early on, and no one ever actually spent any time looking it since (why would they? it worked, and quite well). The problem is the number of copies that we have, and the fact that to pull a since header, we have to copy all the headers, sometimes multiple times. We replaced this with code that wasn’t doing stupid stuff, and we couldn’t even find the cost of working with headers in the profiler any longer.

But that brings up a really interesting question. How could we not know about this sort of thing? I mean, this isn’t the first time that we are doing a performance pass on the system. So how come we missed this?

The answer is that in this performance pass, we are doing something different. Usually we perf-test RavenDB as you would when using it on your own systems. But for the purpose of this suite of tests, and in order to find more stuff that we can optimize, we are actually working with a stripped down client, no caching, no attempt to optimize things across the entire board. In fact, we have put RavenDB in the worst possible situation, all new work, and no chance to do any sort of optimizations, then we start seeing how all of those code paths that were rarely hit started to light up quite nicely.

Excerpts from the RavenDB Performance team reportThe long tale of a lambda


This nugget was discovered by Tal, who was measuring write throughput and noticed that a lot of the time wasn’t being handled in the proper code path, but on something on the side that seemed… off.

prefetchingQueue.Aggregate(0, (x,c) => x + SelectSerializedSizeOnDiskIfNotNull(c)) > context.Configuration.AvailableMemoryForRaisingBatchSizeLimit)

This piece of code is responsible for heuristics deep inside the bowels of RavenDB. In fact, it is the piece that decide whatever we have enough memory to index directly from memory or do we need to start dumping stuff and pay the I/O cost of them later.

As a result, this piece of code is called once for every document save. It was also quite wildly inefficient. The Aggregate implementation was pretty much as you imagined it, and it took three times as much time as actually process the rest of the request. The underlying reason was that we kept doing a foreach and a delegate invocation on each an every call. The more documents we had coming in, the more work we had to do. Shlemiel the painter at his best.

We first saw a major improvement by just removing the Aggregate() call in favor of a purpose built function that summed all those details for us directly. Next, we changed things so instead of doing O(N) work per request, we could do an O(1) work by doing this work one bit at a time and aggregating it on the fly. So whenever we added or removed something to the prefetching queue, we would also make sure to add / remove that from the global tally.

Once that is done, we saw almost 18% improvement in high write scenarios, because we weren’t just busy counting how much stuff we have in memory to figure out if we can put things in memory.

I can’t emphasize enough how important it is that the work throughout was done using a profiler (in our case, the excellent dotTrace) because if dotTrace wouldn’t have point a big finger at this line of code, we would have never have considered this to be problematic.

Excerpts from the RavenDB Performance team reportDates take a lot of time


RavenDB uses a lot of dates, from the last modified metadata on a document to the timestamp of an index or when a query was started or… you get the point, lots and lots of dates.

Dates in RavenDB are usually formatted in the following manner:

2015-01-15T00:41:16.6616631

This is done using the following date time format:

yyyy'-'MM'-'dd'T'HH':'mm':'ss.fffffff

This is pretty awesome. It generate readable dates that are lexicographically sorted. There is just one problem with that, this is really expensive to do. How expensive? Well, outputting 10 million dates using the following manner:

dateTime.ToString(Default.DateTimeFormatsToWrite, CultureInfo.InvariantCulture)

This takes 13.3 seconds, or just about 750 dates per millisecond. The costs here are partly the allocations, but mostly it is about the fact that the format provider needs to first parse the format specifier, then do quite a bit of work to get it working. And DateTime itself isn’t very cheap. The solution presented is ugly, but it works, and it is fast.

public unsafe static string GetDefaultRavenFormat(this DateTime dt, bool isUtc = false)
{
    string result = new string('Z', 27 + (isUtc ? 1 : 0));

    var ticks = dt.Ticks;

    // n = number of days since 1/1/0001
    int n = (int)(ticks / TicksPerDay);
    // y400 = number of whole 400-year periods since 1/1/0001
    int y400 = n / DaysPer400Years;
    // n = day number within 400-year period
    n -= y400 * DaysPer400Years;
    // y100 = number of whole 100-year periods within 400-year period
    int y100 = n / DaysPer100Years;
    // Last 100-year period has an extra day, so decrement result if 4
    if (y100 == 4) y100 = 3;
    // n = day number within 100-year period
    n -= y100 * DaysPer100Years;
    // y4 = number of whole 4-year periods within 100-year period
    int y4 = n / DaysPer4Years;
    // n = day number within 4-year period
    n -= y4 * DaysPer4Years;
    // y1 = number of whole years within 4-year period
    int y1 = n / DaysPerYear;
    // Last year has an extra day, so decrement result if 4
    if (y1 == 4) y1 = 3;
    // If year was requested, compute and return it
    var year = y400 * 400 + y100 * 100 + y4 * 4 + y1 + 1;

    // n = day number within year
    n -= y1 * DaysPerYear;
    // Leap year calculation looks different from IsLeapYear since y1, y4,
    // and y100 are relative to year 1, not year 0
    bool leapYear = y1 == 3 && (y4 != 24 || y100 == 3);
    int[] days = leapYear ? DaysToMonth366 : DaysToMonth365;
    // All months have less than 32 days, so n >> 5 is a good conservative
    // estimate for the month
    int month = n >> 5 + 1;
    // m = 1-based month number
    while (n >= days[month]) month++;
    // If month was requested, return it

    // Return 1-based day-of-month
    var day = n - days[month - 1] + 1;

    fixed (char* chars = result)
    {
        var v = _fourDigits[year];
        chars[0] = v[0];
        chars[1] = v[1];
        chars[2] = v[2];
        chars[3] = v[3];
        chars[4] = '-';
        v = _fourDigits[month];
        chars[5] = v[2];
        chars[5 + 1] = v[3];
        chars[7] = '-';
        v = _fourDigits[day];
        chars[8] = v[2];
        chars[8 + 1] = v[3];
        chars[10] = 'T';
        v = _fourDigits[(ticks / TicksPerHour) % 24];
        chars[11] = v[2];
        chars[11 + 1] = v[3];
        chars[13] = ':';
        v = _fourDigits[(ticks / TicksPerMinute) % 60];
        chars[14] = v[2];
        chars[14 + 1] = v[3];
        chars[16] = ':';
        v = _fourDigits[(ticks / TicksPerSecond) % 60];
        chars[17] = v[2];
        chars[17 + 1] = v[3];
        chars[19] = '.';

        long fraction = (ticks % 10000000);
        v = _fourDigits[fraction / 10000];
        chars[20] = v[1];
        chars[21] = v[2];
        chars[22] = v[3];

        fraction = fraction % 10000;

        v = _fourDigits[fraction];
        chars[23] = v[0];
        chars[24] = v[1];
        chars[25] = v[2];
        chars[26] = v[3];
    }

    return result;
}

We use the same general pattern that we used with etags as well, although here we are also doing a lot of work to figure out the right parts of the date. Note that we don’t have any allocations, and we again use the notion of a lookup table to all the pre-computed 4 digits number. That allows us to process 10,000,000 dates in just over 2 seconds (2,061 ms, to be exact). Or roughly 4,850 dates per millisecond. In other words, we are about 15% of the speed of the original implementation.

This code is ugly, in fact, the last few posts has contained pretty much ugly code, that is hard to understand. But it is significantly faster than the alternative, and what is even more important, those pieces of code are actually being used in RavenDB’s hot path. In other words, that means that we have actually seen significant performance improvement when introducing them to the codebase.

Excerpts from the RavenDB Performance team reportEtags and evil code, part II


In my previous post, I talked about how we improved the performance of Etag parsing from 5 etags/ms to 3,500 etags/ms. In this post, I want to talk about the exact opposite problem, how we take an Etag and turn it into a string. Here is the original code that we had:

public unsafe override string ToString()
{
    var sb = new StringBuilder(36);
    var buffer = stackalloc byte[8];

    *((long*)buffer) = restarts;

    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[7]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[6]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[5]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[4]]);
    sb.Append('-');
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[3]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[2]]);
    sb.Append('-');
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[1]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[0]]);
    sb.Append('-');

    *((long*)buffer) = changes;

    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[7]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[6]]);
    sb.Append('-');
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[5]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[4]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[3]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[2]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[1]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[0]]);

    var etagAsString = sb.ToString();
    Debug.Assert(etagAsString.Length == 36); //prevent stupid bugs if something is refactored

    return etagAsString;
}

As you can see, we already optimized this a bit. It is using a string builder, it is using a lookup table to avoid costly byte to string. Note also that we use a stackalloc value, so there isn’t an actual allocation, but we are able to copy the values once, and then just directly access it. Which is cheaper than trying to do a lot of bit shifting.

So far so good. Running on 10 million Etags, this completes in 8.9 seconds. That is good, this gives us 1,125 Etags per milliseconds.

Here is the optimized version:

public unsafe override string ToString()
{
    var results = new string('-', 36);

    fixed (char* buf = results)
    {
        var buffer = stackalloc byte[8];
        *((long*)buffer) = restarts;
        var duget = GenericUtil.ByteToHexAsStringLookup[buffer[7]];
        buf[0] = duget[0];
        buf[1] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[6]];
        buf[2] = duget[0];
        buf[3] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[5]];
        buf[4] = duget[0];
        buf[5] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[4]];
        buf[6] = duget[0];
        buf[7] = duget[1];
        //buf[8] = '-';
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[3]];
        buf[9] = duget[0];
        buf[10] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[2]];
        buf[11] = duget[0];
        buf[12] = duget[1];
        //buf[13] = '-';
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[1]];
        buf[14] = duget[0];
        buf[15] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[0]];
        buf[16] = duget[0];
        buf[17] = duget[1];
        //buf[18] = '-';

        *((long*)buffer) = changes;

        duget = GenericUtil.ByteToHexAsStringLookup[buffer[7]];
        buf[19] = duget[0];
        buf[20] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[6]];
        buf[21] = duget[0];
        buf[22] = duget[1];
        //buf[23] = '-';
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[5]];
        buf[24] = duget[0];
        buf[25] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[4]];
        buf[26] = duget[0];
        buf[27] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[3]];
        buf[28] = duget[0];
        buf[29] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[2]];
        buf[30] = duget[0];
        buf[31] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[1]];
        buf[32] = duget[0];
        buf[33] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[0]];
        buf[34] = duget[0];
        buf[35] = duget[1];

        return results;
    }
}

Note that here we don’t bother with a string builder, we directly manipulate the string. And we still use all the other tricks (the lookup table, the no allocation, the works). This code managed to get to 5.5 seconds for 10,000,000 etags, or roughly 1,800 etags per millisecond. Roughly 37.5% improvement to a pretty important piece of code.

Do you see anything else that we can do to reduce the cost even further?

Excerpts from the RavenDB Performance team reportEtags and evil code, Part I


As part of the performance work we have been doing, we focused on the Etag class we use as a performance hot spot. For this post, I’m going to talk about Etag.Parse(string) and what we did to improve its performance.

Etag is a core type in RavenDB, it is how we represent changes in the database, and we deal with it a lot. As such, it turned up as a performance hot spot in our profiling. This post is based mostly around Maxim’s work to improve the overall performance.

One of the things that we do with Etag is to take a string and turn that into an instance of an Etag. An Etag looks just like a Guid, but its structure has meaning that we care about. Here is what a typical Etag looks like:

01000000-0000-0005-0000-00000000012C

We send them over the wire as strings for a lot of purposes. Here is the initial code that we used for parsing Etags:

public static Etag Parse2(string str)
{
    if (string.IsNullOrEmpty(str))
        throw new ArgumentException("str cannot be empty or null");
    if (str.Length != 36)
        throw new ArgumentException(string.Format("str must be 36 characters (Etag::Parse is {0})", str));

    var buffer = new byte[16]
    {
        byte.Parse(str.Substring(16, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(14, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(11, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(9, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(6, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(4, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(2, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(0, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(34, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(32, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(30, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(28, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(26, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(24, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(21, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(19, 2), NumberStyles.HexNumber)
    };

    return new Etag
    {
        restarts = BitConverter.ToInt64(buffer, 0),
        changes = BitConverter.ToInt64(buffer, 8)
    };
}

You can note several things here. First, we are being kinda funky here with the order of parsing. This is because we read the string in Big Endian format, to allow direct comparisons of the bytes strings.

The other issue is the shear number of allocations and calls we are making. Because this is such a small code base, we created 10,000 etags and parsed them a million times .That took 17.3 seconds. For a throughput of roughly 5 Etags per millisecond.

Then we sat down and re-wrote it using a lot of somewhat nasty tricks.

private readonly static int[] _asciisOfHexToNum = CreateHexCharsToNumsTable();

private static int[] CreateHexCharsToNumsTable()
{
    var c = new int['z' + 1];
    for (var i = '0'; i <= '9'; i++)
    {
        c[i] = (char)(i - '0');
    }
    for (var i = 'A'; i <= 'Z'; i++)
    {
        c[i] = (char)((i - 'A') + 10);
    }
    for (var i = 'a'; i <= 'z'; i++)
    {
        c[i] = (char)((i - 'a') + 10);
    }

    return c;
}

public unsafe static Etag Parse(string str)
{
    if (str == null || str.Length != 36)
        throw new ArgumentException("str cannot be empty or null");

    fixed (char* input = str)
    {
        var etag = new Etag();
        int fst = ((byte)(_asciisOfHexToNum[input[0]] * 16 + _asciisOfHexToNum[input[1]])) << 24 |
            ((byte)(_asciisOfHexToNum[input[2]] * 16 + _asciisOfHexToNum[input[3]])) << 16 |
            ((byte)(_asciisOfHexToNum[input[4]] * 16 + _asciisOfHexToNum[input[5]])) << 8 |
            (byte)(_asciisOfHexToNum[input[6]] * 16 + _asciisOfHexToNum[input[7]]);
        int snd = ((byte)(_asciisOfHexToNum[input[9]] * 16 + _asciisOfHexToNum[input[10]])) << 24 |
            ((byte)(_asciisOfHexToNum[input[11]] * 16 + _asciisOfHexToNum[input[12]])) << 16 |
            ((byte)(_asciisOfHexToNum[input[14]] * 16 + _asciisOfHexToNum[input[15]])) << 8 |
            ((byte)(_asciisOfHexToNum[input[16]] * 16 + _asciisOfHexToNum[input[17]]));
        etag.restarts = (uint)snd | ((long)fst << 32);


        fst = ((byte)(_asciisOfHexToNum[input[19]] * 16 + _asciisOfHexToNum[input[20]])) << 24 |
            ((byte)(_asciisOfHexToNum[input[21]] * 16 + _asciisOfHexToNum[input[22]])) << 16 |
            ((byte)(_asciisOfHexToNum[input[24]] * 16 + _asciisOfHexToNum[input[25]])) << 8 |
            ((byte)(_asciisOfHexToNum[input[26]] * 16 + _asciisOfHexToNum[input[27]]));
        snd = ((byte)(_asciisOfHexToNum[input[28]] * 16 + _asciisOfHexToNum[input[29]])) << 24 |
            ((byte)(_asciisOfHexToNum[input[30]] * 16 + _asciisOfHexToNum[input[31]])) << 16 |
            ((byte)(_asciisOfHexToNum[input[32]] * 16 + _asciisOfHexToNum[input[33]])) << 8 |
            ((byte)(_asciisOfHexToNum[input[34]] * 16 + _asciisOfHexToNum[input[35]]));
        etag.changes = (uint)snd | ((long)fst << 32);
        return etag;
    }
}

Here is what we did. We created a translation table, so every possible byte is pre calculated. That means that for the cost of actually parsing the string is basically doing a lookup into an array, which is constant type. There are no allocations, nor is there anything expensive going on. On the same machine the previous code took 17.3 to run, this code can process a million Etags in just 287 milliseconds. Or close to 3,500 Etag per milliseconds.

I leave the rate of improvement as an exercise for the reader. Also, can you think of any way in which we can improve this code even further? We weren’t able to figure out anything, but you never knows.

Excerpts from the RavenDB Performance team reportVoron vs. Esent


Another thing that turned up in the performance work was the Esent vs. Voron issue. We keep testing everything on both, and trying to see which one can outdo the other, fix a hotspot, then try again. When we run the YCSB benchmark we also compared between Esent vs. Voron as storage for our databases and we found that Voron was very good in read operation while Esent was slightly better in write operation. During the YCSB tests we found out one of the reason why Voron was a bit slower than Esent for writing, it was consuming 4 times the expected disk-space.

The reason for this high disk-space consumption was that the benchmark by default generates documents of exactly 1KB, with meta-data the actual size was 1.1KB. Voron internal implementation uses a B+ tree where the leafs are 4KB in size, 1KB was the threshold in which we decide not to save data to the leaf but to reference on it and save it on a new page. We ended up creating a new 4KB page to hold 1.1KB documents for each document that we saved. The benchmark actually hit the worst case scenario for our implementation, and caused us to use 4 times more disk space and write 4 times more data than we needed.  Changing this threshold reduce the disk-space consumption to the expected size, and gave Voron a nice boost.

We are also testing our software on a wide variety of systems, and with Voron specifically with run into an annoying issue. Voron is a write ahead log system, and we are very careful to write to the log in a very speedy manner. This is one of the ways in which we are getting really awesome speed for Voron. But when running on slow I/O system, and putting a lot of load on Voron, we started to see very large stalls after a while. Tracing the issue took a while, but eventually we figured out what was going on. Writing to the log is all well and good, but we need to also send the data to the actual data file at some point.

The way Voron does it, it batch a whole bunch of work, write it to the data file, then sync the data file to make sure it is actually persisted on disk. Usually, that isn’t really an issue. But on slow I/O, and especially under load, you get results like this:

Start to sync data file (8:59:52 AM). Written but unsynced data size 309 MB
FlushViewOfFile duration 00:00:13.3482163. FlushFileBuffers duration: 00:00:00.2800050.
End of data pager sync (9:00:05 AM). Duration: 00:00:13.7042229

Note that this is random write, because we may be doing writes to any part of the file, but that is still way too long. What was worse, and the reason we actually care is that we were doing that while holding the transaction lock.

We were able to re-design that part so even under slow I/O, we can take the lock for a very short amount of time, update the in memory data structure and then release the lock and spend some quality time gazing at our navel in peace while the I/O proceeded in its own pace, but now without blocking anyone else.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. The RavenDB Comic Strip (3):
    28 May 2015 - Part III – High availability & sleeping soundly
  2. Special Offer (2):
    27 May 2015 - 29% discount for all our products
  3. RavenDB Sharding (3):
    22 May 2015 - Adding a new shard to an existing cluster, splitting the shard
  4. Challenge (45):
    28 Apr 2015 - What is the meaning of this change?
  5. Interview question (2):
    30 Mar 2015 - fix the index
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats