Ayende @ Rahien

It's a girl

Excerpts from the RavenDB Performance team report: Optimizing Memory Comparisons

Note, this post was written by Federico. Where I had notes or stuff to extend, I explicitly marked it as such. In the previous post after inspecting the decompiled source using ILSpy  we were able to uncover potential things we could do.

Getting rid of unwanted type conversion may seem as an small cost, but let’s make an example. Let’s say we are comparing 2 memory arrays of 16 bytes and they are equal (our worst case scenario).

Just for the sake of simplification from the 3 potential causes the memory is aligned so there is no need to the first 2 unwanted conversions. That leaves us with the main body as the only source of unwanted conversions.

image

Now this loops moves our pointer 4 bytes each time and causes 2 conversions. Therefore for a 16 bytes array (a pretty average size) we are performing 8 conversions, that is grand total of 8 conversions. Assuming our idealized processor, at 0.33 ns per conversion instruction we have 2.64 ns or roughly 3% of the total time per average call. Getting rid of that is easy, as the size of an unsigned int is a constant.

private const int sizeOfUint = sizeof(uint);

Therefore the final executable code will be:

image

Here we have 2 interesting side effects:

  • We no longer have the conversion but also the constant got put instead of the indirection to an stack variable.
  • Almost every comparison you do over a constant that is 2^n based can be converted to a shift operation.

If the JIT is smart enough, this check can be compiled into a shift of 2 places and asking if the result is bigger than 0. Squeezing 4 instructions into 2 per each while cycle.

You guessed right, the JIT is. Smile

Excerpts from the RavenDB Performance team report: Optimizing Memory Compare/Copy Costs

Note, this post was written by Federico. Where I had notes or stuff to extend, I explicitly marked it as such.

TLDR: Optimizing at this level is really hard. To achieve gains of 20%+ for Compare and from 200% to 6% in Copy (depending on the workload) we will need to dig very deep at the IL level.

Another area we looked deeply into is, how do we move and access the memory. This type of optimization work is especially relevant if you are using Voron to handle big workloads. With small databases the improvements can be felt, but where they shine is when dealing with multi-gigabyte databases or high-throughput key-value retrieves and put workloads (did anyone thought Bulk-Inserts?).

Using FreeDB as in this previous post we build an experiment which we could use to pinpoint the pain points making sure we could also repro it every single time (no indexes, no extra call around). Under the microscope 2 pain points were evident when dealing with memory. Comparing and moving memory around.

We usually compare memory straight from the mmaped unmanaged memory when looking for a particular document in Voron Trees; and to copy from and to Voron pages when storing and retrieving documents. These are very core operations for any storage engine, Voron is not an special case. Before we started the optimization effort we already had a pretty optimized routine.

What this method does is:

  • If the memory blocks have zero size, there is no doubt they are equal.
  • If the memory blocks are bigger than the size of a word (32 bits) we do a pre-compare over the aligned memory blocks (for performance) in order to rule out all the equals.
  • As we cannot use words to calculate the output (handling the Endianness would cost us), we do a byte by byte comparison for the final check.                   

For our insert workload we were roughly in the 97.5 nanoseconds per memory compare in average. To put in context, if each assembler instruction could be executed in exactly 1 cycle (which usually is not true) then 3 instruction is an entire nanosecond, therefore our average instruction budget is 291 instructions. Remember this idealized processor, we will use this same comparison later for more complex analysis.

Memory compares can be of different sizes that is why controlling the environment is very important for this type of optimization work.

To deal with that and we were using many tricks from the optimization book. From ensuring that memory alignment is optimal to batch compares with bigger primitive sizes to pointer arithmetic. At first sight this one is the kind of method you won't optimize at all, it is pretty damn tight.

Ayende’s node – We have already done a optimization step on memory comparisons. We initially just shelled out to the native memcmp method, but the cost of doing a pinvoke call ended up being noticable, and we wrote the previously optimized version (and had several rounds of that) to alleviate that cost.

However, we took to the challenge because the payoff can be huge. For a very small bulk insert of 50,000 documents inserted in an empty database, we are talking about in the ballpark of 5 million compares (yeah you read it right). Even if we manage to squeeze 1% off, the sheer volume of calls will make it worthwhile. To achieve that we had to do the unthinkable, we had to resort to dig into the MSIL that method was generating. Armed with ILSpy we found out we may have a way to shave off some inefficiencies.

Here is the what this look like when we start actually putting analysis to action. You can see the method code (after decompilation, so we can be closer to the IL) as well as the issues that were discovered in the process.

image

Because of the size of the method the fastest way was to resort to use a C# decompile, even though we then matched it with the generated IL. The trick to use the C# decompiled version requires that we use a decompiler that is not too smart when dealing with the code. If the decompiler would have understood what was the original code intention and acted upon it, we would have never spotted some of the optimizations at this level. For example, the last loop decompiled with JetBrains dotPeek would look like this:

image

Always keep around an old version of a decompiler just in case you may need it Smile.

Ayende’s note: In most cases, you can select the level of details that a decompiler can give you. With Reflector, for example, you can select how deeply it will decompile things, but even so, doing stupid decompilation can be very beneficial by showing us what is actually going on.

Understanding where the inefficiencies may be, is one thing, being able to prove them is another matter. And we will tackle all of them in future posts.

We will also leave the memcpy analysis for later because it builds on the optimizations used in memcmp and also requires a deep analysis of the Buffer.Memcpy method already available in the .Net Framework (for internal use of course).

If what we did to the poor Etags was evil. You are now arriving at the gates of the underworld.

Ayende’s note: This is a pretty complex topic, and it goes on for quite a while. In order to maintain interest, and to avoid having people getting lost in the details, I broke it apart for several posts. In the meantime, given the details in this post, how would you suggest improving this?

Excerpts from the RavenDB Performance team report: Expensive headers, and cache effects

This ended up being a pretty obvious, in retrospect. We noticed in the profiler that we spent a lot of time working with headers. Now, RavenDB is using REST as the communication layer, so it is doing a lot with that, but we should be able to do better.

Then Tal dug into the actual implementation and found:

public string GetHeader(string key)
{
	if (InnerHeaders.Contains(key) == false)
		return null;
	return InnerHeaders.GetValues(key).FirstOrDefault();
}

public List<string> GetHeaders(string key)
{
	if (InnerHeaders.Contains(key) == false)
		return null;
	return InnerHeaders.GetValues(key).ToList();
}


public HttpHeaders InnerHeaders
{
	get
	{
		var headers = new Headers();
		foreach (var header in InnerRequest.Headers)
		{
			if (header.Value.Count() == 1)
				headers.Add(header.Key, header.Value.First());
			else
				headers.Add(header.Key, header.Value.ToList());
		}

		if (InnerRequest.Content == null)
			return headers;

		foreach (var header in InnerRequest.Content.Headers)
		{
			if (header.Value.Count() == 1)
				headers.Add(header.Key, header.Value.First());
			else
				headers.Add(header.Key, header.Value.ToList());
		}

		return headers;
	}
}

To be fair, this implementation was created very early on, and no one ever actually spent any time looking it since (why would they? it worked, and quite well). The problem is the number of copies that we have, and the fact that to pull a since header, we have to copy all the headers, sometimes multiple times. We replaced this with code that wasn’t doing stupid stuff, and we couldn’t even find the cost of working with headers in the profiler any longer.

But that brings up a really interesting question. How could we not know about this sort of thing? I mean, this isn’t the first time that we are doing a performance pass on the system. So how come we missed this?

The answer is that in this performance pass, we are doing something different. Usually we perf-test RavenDB as you would when using it on your own systems. But for the purpose of this suite of tests, and in order to find more stuff that we can optimize, we are actually working with a stripped down client, no caching, no attempt to optimize things across the entire board. In fact, we have put RavenDB in the worst possible situation, all new work, and no chance to do any sort of optimizations, then we start seeing how all of those code paths that were rarely hit started to light up quite nicely.

Excerpts from the RavenDB Performance team report: The long tale of a lambda

This nugget was discovered by Tal, who was measuring write throughput and noticed that a lot of the time wasn’t being handled in the proper code path, but on something on the side that seemed… off.

prefetchingQueue.Aggregate(0, (x,c) => x + SelectSerializedSizeOnDiskIfNotNull(c)) > context.Configuration.AvailableMemoryForRaisingBatchSizeLimit)

This piece of code is responsible for heuristics deep inside the bowels of RavenDB. In fact, it is the piece that decide whatever we have enough memory to index directly from memory or do we need to start dumping stuff and pay the I/O cost of them later.

As a result, this piece of code is called once for every document save. It was also quite wildly inefficient. The Aggregate implementation was pretty much as you imagined it, and it took three times as much time as actually process the rest of the request. The underlying reason was that we kept doing a foreach and a delegate invocation on each an every call. The more documents we had coming in, the more work we had to do. Shlemiel the painter at his best.

We first saw a major improvement by just removing the Aggregate() call in favor of a purpose built function that summed all those details for us directly. Next, we changed things so instead of doing O(N) work per request, we could do an O(1) work by doing this work one bit at a time and aggregating it on the fly. So whenever we added or removed something to the prefetching queue, we would also make sure to add / remove that from the global tally.

Once that is done, we saw almost 18% improvement in high write scenarios, because we weren’t just busy counting how much stuff we have in memory to figure out if we can put things in memory.

I can’t emphasize enough how important it is that the work throughout was done using a profiler (in our case, the excellent dotTrace) because if dotTrace wouldn’t have point a big finger at this line of code, we would have never have considered this to be problematic.

Excerpts from the RavenDB Performance team report: Dates take a lot of time

RavenDB uses a lot of dates, from the last modified metadata on a document to the timestamp of an index or when a query was started or… you get the point, lots and lots of dates.

Dates in RavenDB are usually formatted in the following manner:

2015-01-15T00:41:16.6616631

This is done using the following date time format:

yyyy'-'MM'-'dd'T'HH':'mm':'ss.fffffff

This is pretty awesome. It generate readable dates that are lexicographically sorted. There is just one problem with that, this is really expensive to do. How expensive? Well, outputting 10 million dates using the following manner:

dateTime.ToString(Default.DateTimeFormatsToWrite, CultureInfo.InvariantCulture)

This takes 13.3 seconds, or just about 750 dates per millisecond. The costs here are partly the allocations, but mostly it is about the fact that the format provider needs to first parse the format specifier, then do quite a bit of work to get it working. And DateTime itself isn’t very cheap. The solution presented is ugly, but it works, and it is fast.

public unsafe static string GetDefaultRavenFormat(this DateTime dt, bool isUtc = false)
{
    string result = new string('Z', 27 + (isUtc ? 1 : 0));

    var ticks = dt.Ticks;

    // n = number of days since 1/1/0001
    int n = (int)(ticks / TicksPerDay);
    // y400 = number of whole 400-year periods since 1/1/0001
    int y400 = n / DaysPer400Years;
    // n = day number within 400-year period
    n -= y400 * DaysPer400Years;
    // y100 = number of whole 100-year periods within 400-year period
    int y100 = n / DaysPer100Years;
    // Last 100-year period has an extra day, so decrement result if 4
    if (y100 == 4) y100 = 3;
    // n = day number within 100-year period
    n -= y100 * DaysPer100Years;
    // y4 = number of whole 4-year periods within 100-year period
    int y4 = n / DaysPer4Years;
    // n = day number within 4-year period
    n -= y4 * DaysPer4Years;
    // y1 = number of whole years within 4-year period
    int y1 = n / DaysPerYear;
    // Last year has an extra day, so decrement result if 4
    if (y1 == 4) y1 = 3;
    // If year was requested, compute and return it
    var year = y400 * 400 + y100 * 100 + y4 * 4 + y1 + 1;

    // n = day number within year
    n -= y1 * DaysPerYear;
    // Leap year calculation looks different from IsLeapYear since y1, y4,
    // and y100 are relative to year 1, not year 0
    bool leapYear = y1 == 3 && (y4 != 24 || y100 == 3);
    int[] days = leapYear ? DaysToMonth366 : DaysToMonth365;
    // All months have less than 32 days, so n >> 5 is a good conservative
    // estimate for the month
    int month = n >> 5 + 1;
    // m = 1-based month number
    while (n >= days[month]) month++;
    // If month was requested, return it

    // Return 1-based day-of-month
    var day = n - days[month - 1] + 1;

    fixed (char* chars = result)
    {
        var v = _fourDigits[year];
        chars[0] = v[0];
        chars[1] = v[1];
        chars[2] = v[2];
        chars[3] = v[3];
        chars[4] = '-';
        v = _fourDigits[month];
        chars[5] = v[2];
        chars[5 + 1] = v[3];
        chars[7] = '-';
        v = _fourDigits[day];
        chars[8] = v[2];
        chars[8 + 1] = v[3];
        chars[10] = 'T';
        v = _fourDigits[(ticks / TicksPerHour) % 24];
        chars[11] = v[2];
        chars[11 + 1] = v[3];
        chars[13] = ':';
        v = _fourDigits[(ticks / TicksPerMinute) % 60];
        chars[14] = v[2];
        chars[14 + 1] = v[3];
        chars[16] = ':';
        v = _fourDigits[(ticks / TicksPerSecond) % 60];
        chars[17] = v[2];
        chars[17 + 1] = v[3];
        chars[19] = '.';

        long fraction = (ticks % 10000000);
        v = _fourDigits[fraction / 10000];
        chars[20] = v[1];
        chars[21] = v[2];
        chars[22] = v[3];

        fraction = fraction % 10000;

        v = _fourDigits[fraction];
        chars[23] = v[0];
        chars[24] = v[1];
        chars[25] = v[2];
        chars[26] = v[3];
    }

    return result;
}

We use the same general pattern that we used with etags as well, although here we are also doing a lot of work to figure out the right parts of the date. Note that we don’t have any allocations, and we again use the notion of a lookup table to all the pre-computed 4 digits number. That allows us to process 10,000,000 dates in just over 2 seconds (2,061 ms, to be exact). Or roughly 4,850 dates per millisecond. In other words, we are about 15% of the speed of the original implementation.

This code is ugly, in fact, the last few posts has contained pretty much ugly code, that is hard to understand. But it is significantly faster than the alternative, and what is even more important, those pieces of code are actually being used in RavenDB’s hot path. In other words, that means that we have actually seen significant performance improvement when introducing them to the codebase.

Excerpts from the RavenDB Performance team report: Etags and evil code, part II

In my previous post, I talked about how we improved the performance of Etag parsing from 5 etags/ms to 3,500 etags/ms. In this post, I want to talk about the exact opposite problem, how we take an Etag and turn it into a string. Here is the original code that we had:

public unsafe override string ToString()
{
    var sb = new StringBuilder(36);
    var buffer = stackalloc byte[8];

    *((long*)buffer) = restarts;

    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[7]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[6]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[5]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[4]]);
    sb.Append('-');
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[3]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[2]]);
    sb.Append('-');
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[1]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[0]]);
    sb.Append('-');

    *((long*)buffer) = changes;

    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[7]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[6]]);
    sb.Append('-');
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[5]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[4]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[3]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[2]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[1]]);
    sb.Append(GenericUtil.ByteToHexAsStringLookup[buffer[0]]);

    var etagAsString = sb.ToString();
    Debug.Assert(etagAsString.Length == 36); //prevent stupid bugs if something is refactored

    return etagAsString;
}

As you can see, we already optimized this a bit. It is using a string builder, it is using a lookup table to avoid costly byte to string. Note also that we use a stackalloc value, so there isn’t an actual allocation, but we are able to copy the values once, and then just directly access it. Which is cheaper than trying to do a lot of bit shifting.

So far so good. Running on 10 million Etags, this completes in 8.9 seconds. That is good, this gives us 1,125 Etags per milliseconds.

Here is the optimized version:

public unsafe override string ToString()
{
    var results = new string('-', 36);

    fixed (char* buf = results)
    {
        var buffer = stackalloc byte[8];
        *((long*)buffer) = restarts;
        var duget = GenericUtil.ByteToHexAsStringLookup[buffer[7]];
        buf[0] = duget[0];
        buf[1] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[6]];
        buf[2] = duget[0];
        buf[3] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[5]];
        buf[4] = duget[0];
        buf[5] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[4]];
        buf[6] = duget[0];
        buf[7] = duget[1];
        //buf[8] = '-';
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[3]];
        buf[9] = duget[0];
        buf[10] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[2]];
        buf[11] = duget[0];
        buf[12] = duget[1];
        //buf[13] = '-';
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[1]];
        buf[14] = duget[0];
        buf[15] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[0]];
        buf[16] = duget[0];
        buf[17] = duget[1];
        //buf[18] = '-';

        *((long*)buffer) = changes;

        duget = GenericUtil.ByteToHexAsStringLookup[buffer[7]];
        buf[19] = duget[0];
        buf[20] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[6]];
        buf[21] = duget[0];
        buf[22] = duget[1];
        //buf[23] = '-';
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[5]];
        buf[24] = duget[0];
        buf[25] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[4]];
        buf[26] = duget[0];
        buf[27] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[3]];
        buf[28] = duget[0];
        buf[29] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[2]];
        buf[30] = duget[0];
        buf[31] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[1]];
        buf[32] = duget[0];
        buf[33] = duget[1];
        duget = GenericUtil.ByteToHexAsStringLookup[buffer[0]];
        buf[34] = duget[0];
        buf[35] = duget[1];

        return results;
    }
}

Note that here we don’t bother with a string builder, we directly manipulate the string. And we still use all the other tricks (the lookup table, the no allocation, the works). This code managed to get to 5.5 seconds for 10,000,000 etags, or roughly 1,800 etags per millisecond. Roughly 37.5% improvement to a pretty important piece of code.

Do you see anything else that we can do to reduce the cost even further?

Excerpts from the RavenDB Performance team report: Etags and evil code, Part I

As part of the performance work we have been doing, we focused on the Etag class we use as a performance hot spot. For this post, I’m going to talk about Etag.Parse(string) and what we did to improve its performance.

Etag is a core type in RavenDB, it is how we represent changes in the database, and we deal with it a lot. As such, it turned up as a performance hot spot in our profiling. This post is based mostly around Maxim’s work to improve the overall performance.

One of the things that we do with Etag is to take a string and turn that into an instance of an Etag. An Etag looks just like a Guid, but its structure has meaning that we care about. Here is what a typical Etag looks like:

01000000-0000-0005-0000-00000000012C

We send them over the wire as strings for a lot of purposes. Here is the initial code that we used for parsing Etags:

public static Etag Parse2(string str)
{
    if (string.IsNullOrEmpty(str))
        throw new ArgumentException("str cannot be empty or null");
    if (str.Length != 36)
        throw new ArgumentException(string.Format("str must be 36 characters (Etag::Parse is {0})", str));

    var buffer = new byte[16]
    {
        byte.Parse(str.Substring(16, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(14, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(11, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(9, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(6, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(4, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(2, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(0, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(34, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(32, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(30, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(28, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(26, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(24, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(21, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(19, 2), NumberStyles.HexNumber)
    };

    return new Etag
    {
        restarts = BitConverter.ToInt64(buffer, 0),
        changes = BitConverter.ToInt64(buffer, 8)
    };
}

You can note several things here. First, we are being kinda funky here with the order of parsing. This is because we read the string in Big Endian format, to allow direct comparisons of the bytes strings.

The other issue is the shear number of allocations and calls we are making. Because this is such a small code base, we created 10,000 etags and parsed them a million times .That took 17.3 seconds. For a throughput of roughly 5 Etags per millisecond.

Then we sat down and re-wrote it using a lot of somewhat nasty tricks.

private readonly static int[] _asciisOfHexToNum = CreateHexCharsToNumsTable();

private static int[] CreateHexCharsToNumsTable()
{
    var c = new int['z' + 1];
    for (var i = '0'; i <= '9'; i++)
    {
        c[i] = (char)(i - '0');
    }
    for (var i = 'A'; i <= 'Z'; i++)
    {
        c[i] = (char)((i - 'A') + 10);
    }
    for (var i = 'a'; i <= 'z'; i++)
    {
        c[i] = (char)((i - 'a') + 10);
    }

    return c;
}

public unsafe static Etag Parse(string str)
{
    if (str == null || str.Length != 36)
        throw new ArgumentException("str cannot be empty or null");

    fixed (char* input = str)
    {
        var etag = new Etag();
        int fst = ((byte)(_asciisOfHexToNum[input[0]] * 16 + _asciisOfHexToNum[input[1]])) << 24 |
            ((byte)(_asciisOfHexToNum[input[2]] * 16 + _asciisOfHexToNum[input[3]])) << 16 |
            ((byte)(_asciisOfHexToNum[input[4]] * 16 + _asciisOfHexToNum[input[5]])) << 8 |
            (byte)(_asciisOfHexToNum[input[6]] * 16 + _asciisOfHexToNum[input[7]]);
        int snd = ((byte)(_asciisOfHexToNum[input[9]] * 16 + _asciisOfHexToNum[input[10]])) << 24 |
            ((byte)(_asciisOfHexToNum[input[11]] * 16 + _asciisOfHexToNum[input[12]])) << 16 |
            ((byte)(_asciisOfHexToNum[input[14]] * 16 + _asciisOfHexToNum[input[15]])) << 8 |
            ((byte)(_asciisOfHexToNum[input[16]] * 16 + _asciisOfHexToNum[input[17]]));
        etag.restarts = (uint)snd | ((long)fst << 32);


        fst = ((byte)(_asciisOfHexToNum[input[19]] * 16 + _asciisOfHexToNum[input[20]])) << 24 |
            ((byte)(_asciisOfHexToNum[input[21]] * 16 + _asciisOfHexToNum[input[22]])) << 16 |
            ((byte)(_asciisOfHexToNum[input[24]] * 16 + _asciisOfHexToNum[input[25]])) << 8 |
            ((byte)(_asciisOfHexToNum[input[26]] * 16 + _asciisOfHexToNum[input[27]]));
        snd = ((byte)(_asciisOfHexToNum[input[28]] * 16 + _asciisOfHexToNum[input[29]])) << 24 |
            ((byte)(_asciisOfHexToNum[input[30]] * 16 + _asciisOfHexToNum[input[31]])) << 16 |
            ((byte)(_asciisOfHexToNum[input[32]] * 16 + _asciisOfHexToNum[input[33]])) << 8 |
            ((byte)(_asciisOfHexToNum[input[34]] * 16 + _asciisOfHexToNum[input[35]]));
        etag.changes = (uint)snd | ((long)fst << 32);
        return etag;
    }
}

Here is what we did. We created a translation table, so every possible byte is pre calculated. That means that for the cost of actually parsing the string is basically doing a lookup into an array, which is constant type. There are no allocations, nor is there anything expensive going on. On the same machine the previous code took 17.3 to run, this code can process a million Etags in just 287 milliseconds. Or close to 3,500 Etag per milliseconds.

I leave the rate of improvement as an exercise for the reader. Also, can you think of any way in which we can improve this code even further? We weren’t able to figure out anything, but you never knows.

Excerpts from the RavenDB Performance team report: Voron vs. Esent

Another thing that turned up in the performance work was the Esent vs. Voron issue. We keep testing everything on both, and trying to see which one can outdo the other, fix a hotspot, then try again. When we run the YCSB benchmark we also compared between Esent vs. Voron as storage for our databases and we found that Voron was very good in read operation while Esent was slightly better in write operation. During the YCSB tests we found out one of the reason why Voron was a bit slower than Esent for writing, it was consuming 4 times the expected disk-space.

The reason for this high disk-space consumption was that the benchmark by default generates documents of exactly 1KB, with meta-data the actual size was 1.1KB. Voron internal implementation uses a B+ tree where the leafs are 4KB in size, 1KB was the threshold in which we decide not to save data to the leaf but to reference on it and save it on a new page. We ended up creating a new 4KB page to hold 1.1KB documents for each document that we saved. The benchmark actually hit the worst case scenario for our implementation, and caused us to use 4 times more disk space and write 4 times more data than we needed.  Changing this threshold reduce the disk-space consumption to the expected size, and gave Voron a nice boost.

We are also testing our software on a wide variety of systems, and with Voron specifically with run into an annoying issue. Voron is a write ahead log system, and we are very careful to write to the log in a very speedy manner. This is one of the ways in which we are getting really awesome speed for Voron. But when running on slow I/O system, and putting a lot of load on Voron, we started to see very large stalls after a while. Tracing the issue took a while, but eventually we figured out what was going on. Writing to the log is all well and good, but we need to also send the data to the actual data file at some point.

The way Voron does it, it batch a whole bunch of work, write it to the data file, then sync the data file to make sure it is actually persisted on disk. Usually, that isn’t really an issue. But on slow I/O, and especially under load, you get results like this:

Start to sync data file (8:59:52 AM). Written but unsynced data size 309 MB
FlushViewOfFile duration 00:00:13.3482163. FlushFileBuffers duration: 00:00:00.2800050.
End of data pager sync (9:00:05 AM). Duration: 00:00:13.7042229

Note that this is random write, because we may be doing writes to any part of the file, but that is still way too long. What was worse, and the reason we actually care is that we were doing that while holding the transaction lock.

We were able to re-design that part so even under slow I/O, we can take the lock for a very short amount of time, update the in memory data structure and then release the lock and spend some quality time gazing at our navel in peace while the I/O proceeded in its own pace, but now without blocking anyone else.

Excerpts from the RavenDB Performance team report: Routing

We have been quite busy lately doing all sort of chores. The kind of small features that you don’t really notice, but make a lot of difference over time. One of the ways that we did that was to assign a team for the sole purpose of improving the performance of RavenDB. This is done by writing a lot of stress tests, running under profilers and in general doing very evil things to the code to get it to work much faster. The following is part of the results that we have found so far, written mostly by Tal.

We decided to use Yahoo! Cloud Service Benchmark (AKA YCSB) for that purpose, while YCSB is great in creating stress on the target database servers it does not offer complex interface to stress complex scenario, and seems to be targeting key value solutions. YCSB allows you to basically control some parameters to the workload engine and implementation of five basic operation: Insert, Update, Read, Scan and Delete of a single record, in a single table/collection.

Note: I think that this is because it focused (at least initially) primarily on the Big Table type of NoSQL solutions.

This very thin interface allows us to mainly stress the PUT/GET document. The first thing we found is that running with indexes actually slowed us down since we are “wasting” precious CPU time on indexing and not doing any querying, that is enough to show you the true face of benchmark miracle, since where have you ever seen a database that you didn’t wish to run a query upon?  But this is a topic for another post… lets go back to performance.

The second thing we found was a hotspot in the route data retrieval  mechanism of the Web API/ While using Web API routing seems to work like magic, when you get to have approximately four hundred endpoints you will start suffering from performance issues retrieving route data. In our case we were spending about 1ms per http request just to retrieve the route data for the request, that may not seems like much but for 1,500 request that took 5 seconds to execute this took 1.5 seconds just for route data retrieval. Please, do not try to extrapolate the throughput of RavenDB from those numbers since we ran this test on a weak machine also under a profiling tool that slow it down significantly.

Note: Effectively, for our use case, we were doing an O(400) operation on every request, just to find what code we needed to run to handle this request.

We decided to cache the routes for each request but just straight forward caching wouldn’t really help in our case. The problem is that we have many unique requests that would fill the cache and then wouldn’t actually be of use.

Our major issue was requests that had a route that looked like so databases/{databaseName}/docs/{*docId}. While we we do want to cache routes such as: databases/{databaseName}/stats. The problem is that we have both types of routes and they are used quite often so we ended up with a greatest common route cache system. That means that /databases/Northwind/docs/Orders/order1 and /databases/Northwind/docs/Orders/order2 will go into the same bucket that is /databases/Northwind/docs.

In that way, we would still have to do some work per request, but now we don’t have to go through all four hundred routes we have to find the relevant routes.

What was the performance impact you wonder? We took off 90% of get route data time that is spending 0.15 seconds instead of 1.5 seconds for those 1,500 requests, reducing the overall runtime to 3.65 seconds, that is 27% less run time.

Note: Again, don’t pay attention to the actual times, it is the rate of improvement that we are looking at.

As an aside, another performance improvement that we made with regards to routing was routing discovery. During startup, we scan the RavenDB assembly for controllers, and register the routes. That is all great, as long as you are doing that once. And indeed, in most production scenarios, you’ll only do that once per server start. That means that the few dozen milliseconds that you spend on this aren’t really worth thinking about. But as it turned out, there was one key scenario that was impacted by this. Tests! In tests, you start and stop RavenDB per test, and reducing this cost is very important to ensuring a pleasant experience. We were able to also optimize the route discovery process, and we were actually able to reduce our startup time from cold boot to below what we had in 2.5.

Published at

Originally posted at

RavenDB–Indexing performance timeline

We get people asking us about the details of RavenDB indexes, why a certain index is busy, what is costly, etc. This is a pretty common set of questions, and while we were always able to give the customer a good answer, I was never truly happy with that. A lot of that was about understanding the system to a very deep level and being able to read fairly obscure to an outsider. Like most things that aren’t obvious and easy to use, that bugged me, deeply. Therefor, we spent a few weeks adding internal metrics and exposing everything in a nice and easy fashion.

Let me show you the details from one of our production databases:

image

The Y axis is for different indexes. On the Y axis, you have time. Note that the top part is for map operations, and the bottom part is for reduce operations, and that they are running in parallel. You can also visually see that those the map & reduce are actually running in parallel.

What is even better, you can visually see each individual indexing batch, let us focus on the long indexing batch in the middle:

image

As you can see, we are indexing 512 documents, but it took us a really long time to index them, why is that? We can see that the reason that this batch took so long is that the Orders/Stats and Products/Stats index were very long. Clicking on them, we can see:

image

 

The Orders/Stats index accepted a 105 documents, and outputted 1,372 map values. And the costly part of this index was to actually commit those (along with the Products/Stats, which also output a similar amount) to the map storage.

Another slow index is exposed here:

image

In this case, we have an index that is big enough that we need to flush it to disk from RAM. That explains things.

I’m really excited about this feature.

Tags:

Published at

Originally posted at

Comments (7)

Current status, Voron on Linux

We have been quite for a while on that front, but that is just because we were head down and working hard on it. The current status is that we are getting there Smile.

Recently we have been fighting with a SIGSEGV (segmentation fault) that appear out of nowhere and kind of ruin our fun. To be rather more exact, we have been on that bug for over a month. We have been getting great help from Sine Nomine about figure out the root cause of that, and today Neale Ferguson finally tracked down the final clues that gave up a simple reproduction.

Take the following code and run it on a Linux machine (in our case, Ubuntu 14.4 with Mono 3.10). It would crash with a really nasty error:

class MainClass
{
    public unsafe static void Main(string[] args)
    {
        var f = Syscall.mmap(IntPtr.Zero, 1024 * 1024, MmapProts.PROT_WRITE | MmapProts.PROT_READ, MmapFlags.MAP_SHARED | MmapFlags.MAP_ANONYMOUS, -1, 0);

        var p = new byte*[]{ (byte*)f.ToPointer() };

        Syscall.munmap(f,1024*1024);

        Console.WriteLine(p.Length);

        p = null;

        GC.Collect();
        GC.WaitForPendingFinalizers();
    }
}

The underlying reason, to the best of my understanding at the moment, is that the sgen GC implementation in Mono is going inspecting the p array for its values. But instead of just treating this as an opaque value (because I wouldn’t expect the GC to follow a pointer reference), it is trying to access the value. Since by this time, this value is no longer valid, it is accessing invalid memory location, and dying horribly.

The good news htat once we had a simple repro, the Mono team was really quick on their feet and this has already been fixed.

Tags:

Published at

Originally posted at

Comments (2)

Side by side indexes in RavenDB

 

One of the new features we just finished working on is the notion of side by side indexes. You can see how to access it on the image, but what is it?

image

As usual lately, this is part and parcel of our current focus in operations. The basic notion is simple. You have an index in production that you want to update, but just updating it on the fly isn’t going to work.

The problem is that updating the index force a complete reindexing of the data, and in most production systems, we have enough documents that this can take a while. That means that during the indexing process, your application would see partial results, and may see that for a while. Common work around was to add support in your application to change the index name, so you would create a new index, wait for it to become non stale and then you would update the application configuration so you would query the new index.

With the side by side option, this is all handled for you. When you save an index with side by side option, you get the following options:

image

Basically, we create the index under a temporary name, and start indexing it. And you can choose when RavenDB automatically will make the switch (note that there is an OR condition between those, so any of those that match would work).

This give you the option of changing an index, and have no interruptions of service, because RavenDB will take care of all of that for you. You can even mark an index with a “force to change via side-by-side”, which will protect it from accidental changes.

Tags:

Published at

Originally posted at

Comments (8)

Data Subscriptions in RavenDB

Most of the time, RavenDB  is being used for OLTP scenarios, and it is doing great work there. However, we have customers that use RavenDB as the source for data processing jobs. Those jobs calls for processing all / most of the documents in the database.

A typical example would be an ordering system, where each document is an order, and we need to process the orders. You can certainly handle this right now, but that requires you to maintain quite a bit of state in order to do so efficiently. What we have done is to take this entire process and make this very simple to use. I know that this is a bit vague, but let me try showing you some code, and hopefully it will clear things up.

var id = store.Subscriptions.Create(new SubscriptionCriteria
{
    BelongsToCollection = "Orders",
    PropertiesMatch =
    {
        {"Valid", true}
    },
});

Here we create an subscription, along with its configuration. In this case, give me all the orders, but only those which are valid. We can also do a bunch more basic filters like that. The result of this operation is an id, which I can use to open the subscription.

Note that subscriptions are persistent and long lived, so you are expected to hold on to that id and make use of it. Using the subscription id, we can open the subscription:

IObservable<User> subscription = store.Subscriptions.Open<User>(id, new SubscriptionConnectionOptions
{
    IgnoreSubscribersErrors = false,
    BatchOptions = new SubscriptionBatchOptions()
    {
        AcknowledgmentTimeout = TimeSpan.FromMinutes(1),
        MaxDocCount = 1024 * 16,
        MaxSize = 1024 * 1024 * 4
    },
    ClientAliveNotificationInterval = TimeSpan.FromSeconds(10),
});

You can see that we specify some details about the running connection. In particular, we limit the size of a single batch, and the heart beat intervals. I’ll touch on the error handling a bit later, first, let us see how we actually get the data. Because the subscription is an IObservable<RavenJObject>, that means that you can just utilize reactive extensions and subscribe to the incoming stream as you would expect to. And it means that you’ll continue to get them, even for items that were added after you opened the subscription.

What is interesting is the error handling scenario. Data subscriptions will ensure that you’ll receive each document at least once, but what happens if there is an error? In that case, it depends on your configuration. In the code above, we don’t allow errors, so we you’ll get the same document over and over until you have successfully processed it. If you set IgnoreSubscribersErrors to true, we’ll ignore the errors raised by subscribers.

The nice thing about this is that it works even in the presence of crashes. Under the scenes, once you have successfully processed a document, we’ll send a confirmation to the server about it, so if there is a crash, we know we already processed it. If you crashed midway, we’ll just resend you the relevant document when you open the subscription again.

Tags:

Published at

Originally posted at

Comments (9)

RavenDB Comics, the 1st draft

We were talking about comics at lunch. And the next day, Tal came to the office with the following.

image

Nitpicker corner: This is a doodle, not a finished product, yes, we know that there are typos.

I like this enough that we will probably be doing something significant with this. Opinions are welcome.

Do you have a RavenDB as a super hero idea? Or a Dilbert’s style skit for RavenDB?

Tags:

Published at

Originally posted at

Iterative index tryouts in RavenDB

The release of 3.0 means that we are free to focus on the greater features that we have in store for vNext, but that doesn’t mean that we aren’t still working on small features that are meant to make your life easier.

In this case, this feature is here to deal with index definition churn. Consider the case where you have a large database, and you want to define an index, but aren’t quite sure exactly what you want. This is especially the case when you are working with the more complex indexes. The problem is that creating the index would mean having to process all the data in your database. That is usually fine, because RavenDB can still operate normally under those conditions, but it also means that you have to wait for this to happen, check the results, make a change to the index, and so on.

Because of the way we are optimizing new index creation, we are trying to do as much as possible, as soon as possible. So by the time we expose partial results for the index, you are already waiting for too long. This isn’t a problem for standard usage, because this behavior reduces the overall time for indexing a new index, but it is a problem when you are developing, because it increase the time for first visible result.

Because of that, we have added this feature:

image

This will create a temporary test index:

image

This index is going to operate on a partial data set, instead of all the data, but aside from that, it is identical in all respects to any other index, and you can test it out using all the normal tools we offer. Once you have change the index definition to your liking, you can make the index permanent, which actually goes into processing all documents.

This make rapid index prototyping on a large dataset much easier.

Tags:

Published at

Originally posted at

Minimal incremental backups improvements in Voron

Voron uses a fairly standard journaling system to ensure ACID compliance. Whenever you commit a transaction, we write the changed pages to the log.

Let us take the simplest example, here is a very simple page:

image

Now we write the following code:

for(var i = 0; i< 3; i++ )
{
using(var tx = env.NewTransaction(TransactionFlags.ReadWrite))
{
var users = tx.ReadTree("users");
users.Add("Users/" + NextUserId(), GetUserData());
tx.Commit();
}
}

After executing this code, the page is going to look like this:

image

But what is going to happen in the log? We have to commit each transaction separately, so what we end up is the following journal file:

image

The log contains 3 entries for the same page. Which make sense, because we changed that in 3 different transactions. Now, let us say that we have a crash, and we need to apply the journal. We are going to scan through the journal, and only write Page #23 from tx #43. For the most part, we don’t care for the older version of the page. That is all well and good, until incremental backups comes into play.

The default incremental backup approach used by Voron is simplicity itself. Just create a stable copy of the journal files, and you are pretty much done, there isn’t anything else that you need to do. Restoring involves applying those transactions in the order they appear in the journal, and rainbows and unicorns and peace on Earth.

However, in many cases, the journal files contain a lot of outdated data. I mean, we really don’t care for the state of Page #23 in tx #42. This came into play when we started to consider how we’ll create Raft snapshots on an ongoing basis for large datasets. Just using incremental backup was enough to give us a lot of promise, but it also expose the issue with the size of the journal becoming an issue.

That is when we introduced the notion of a minimal incremental backup. A minimal incremental backup is a lot more complex to create, but it actually restore in the exact same way as a normal incremental backup.

Conceptually, it is a very simple idea. Read the journals, but instead of just copying them as is, scan through them and find the latest version of all the pages that appear in the journal. In this case, we’ll have a Page #23 from tx #43. And then we generate a single transaction for all the latest versions of all the pages in the transactions we read.

We tried it on the following code:

for (int xi = 0; xi < 100; xi++)
{
using (var tx = envToSnapshot.NewTransaction(TransactionFlags.ReadWrite))
{
var tree = envToSnapshot.CreateTree(tx, "test");

for (int i = 0; i < 1000; i++)
{
tree.Add("users/" + i, "john doe/" + i);
}

tx.Commit();
}
}


This is an interesting experiment, because we are making modifications to the same keys (and hence, probably the same pages), multiple times. This also reflects a common scenario in which we have a high rate of updates.

Min incremental backup created an 8Kb file to restore this. While the standard incremental backup created a 67Kb file for the purpose.

That doesn’t sounds much, until you realize that those are compressed files, and the uncompressed sizes were 80Kb for the minimal incremental backup, and 1.57Mb for the incremental backup. Restoring the min incremental backup is a lot more efficient. However, it ain’t all roses.

An incremental backup will result in the ability to replay transactions one at a time. A min incremental backup will merge transactions, so you get the same end results, but you can’t stop midway (for example, to do partial rollback). Taking a min incremental backup is also more expensive, instead of doing primarily file I/O, we have to read the journals, understand them and output the set of pages that we actually care about.

For performance and ease of use, we limit the size of a merged transaction generated by a min incremental backup to about 512 MB. So if you have made changes to over 512MB of your data since the last backup, we’ll still generate a merged view of all the pages, but we’ll actually apply that across multiple transactions. The idea is to avoid trying to apply a very big transaction and consume all resources from the system in this manner.

Note that this feature was developed specifically to enable better behavior when snapshotting state machines for use within the Raft protocol. Because Raft is using snapshots to avoid having an infinite log, and because the Voron journal is effectively the log of all the disk changes made in the system, that was something that we had to do. Otherwise we couldn’t rely on incremental backups for snapshots (we would have just switched the Raft log with the Voron journal, probably we no save in space). That would have forced us to rely on full backups, and we don’t want to take a multi GB backup very often if we can potentially avoid it.

Tags:

Published at

Originally posted at

Comments (2)

End of year discount for all our products

To celebrate the new year, we offer a 21% discount for all our products. This is available for the first 33 customers that use the coupon code: 0x21-celebrate-new-year

In previous years, we offered a similar number of uses for the coupon code, and they run out fast, so hurry up. This offer is valid for:

Happy Holidays and a great new years.

On a personal note, this marks the full release of all our product lines, and it took an incredible amount of work. I'm very pleased that we have been able to get the new version out there and in your hands, and to have you start making use of the features that we have been working on for so long.

Published at

Originally posted at

Introducing Rachis: Raven’s Raft implementation

Rachis, def: Spinal column, also the distal part of the shaft of a feather that bears the web.

Rachis is the name we picked for RavenDB’s Raft implementation. Raft is a consensus protocol that can actually be understood without resorting to Greek philosophy. You can read all about it in here (there is a very cool interactive visualization there). I would also like to thank Diego Ongaro for both the Raft paper and a lot of help while I tried to understand the finer points of it.

Why Raft?

Raft is a distributed consensus protocol. It allows you to reach an order set of operations across your entire cluster. This means that you can apply a set of operations on a state machine, and have the same final state machine in all nodes in the cluster. It is also drastically simpler to understand than Paxos, which is the more known alternative.

What is Rachis?

Well, it is a Raft implementation. To be rather more exact, it is a Raft implementation with the following features:

  • (Obviously) the ability to manage a distributed set of state machine and reliability commits updates to said state machines.
  • Dynamic topology (nodes can join and leave the cluster on the fly, including state sync).
  • Large state machines (snapshots, efficient transfers, non voting members).
  • ACID local log using Voron.
  • Support for in memory and persistent state machines.
  • Support for voting & non voting members.
  • A lot of small tweaks for best behavior in strange situations (forced step down, leader timeout and many more).

What are you going to use this for?

To reach a consensus, of course Smile. More to the point, we got some really nice idea where this is going to allow us to do some really nice stuff. In particular, we want to use that as the backbone for the event and time series replication models.

But I’m getting ahead of myself. Before we do that, I want to build a simple reliable distributed service. We’ll call it Tail/Feather and it will be awesome, in a weekend project kind of way. I’ll post full details about this in my next post.

Where can I look at it?

The current version is here, note that you’ll need to pull Voron as well (from the ravendb repository) to get it working.

How does this work?

You can read the Raft paper and the full thesis, of course, but there are some subtleties that I had to work through (with great help from Diego), so it is worth going into a bit more detail.

Clusters are typically composed of odd number of servers (3,5 or 7), which can communicate freely with one another. The startup process for a cluster require us to designate a single server as the seed. This is the only server that can become the cluster leader during the cluster bootstrap process. This is done to make sure that during startup, before we had the chance to tell the servers about the cluster topology, they won’t consider themselves a single node cluster and start accepting requests before we add them to the cluster.

Only the seed server will become the leader, and all the others will wait for instructions. We can then let the seed server know about the other nodes in the cluster. It will initiate a join operation which will reach to the other node, setup the appropriate cluster topology. At that point, all the other servers are on equal footing, and there is no longer any meaningful distinction between them. The notion of a seed node it only  relevant for cluster bootstrap, once that is done, all servers have the same configuration, and there isn’t any difference between them.

Dynamically adding and removing nodes from the cluster

Removing a node from the cluster is a simple process. All we need to do is to update the cluster topology, and we are done. The removed server will get a notification to let it know that is has been disconnected from the cluster, and will move itself to a passive state (note that it is okay if it doesn’t get this notification, we are just being nice about it Smile).

The process of adding a server is a bit more complex. Not only are we having to add a new node, we need to make sure that it has the same state as all other nodes in the cluster. In order to do that, we handle it in multiple stages. A node added to the cluster can be in one of three states: Voting (full member of the cluster, able to become a leader), Non Voting (just listening to what is going on, can’t be a leader), Promotable (getting up to speed with the cluster state). Non voting members are a unique case, they are there to enable some advance scenarios (cross data center communication, as we currently envision it).

Promotable is a lot more interesting. Adding a node to an existing cluster can be a long process, especially if we are managing a lot of data. In order to handle that, we adding a server to the promotable category, in which case we are starting to send it the state it needs to catch up with the rest of the cluster. Once it has caught up with the cluster (it has all the committed entries in the cluster), we will automatically move it to the voting members in the cluster.

Note that it is fine for crashes to happen throughout this process. The cluster leader can crash during this, and we’ll recover and handle this properly.

Normal operations

During normal operations, there is going to be a leader that is going to be accepting all the requests for the cluster, and handle committing them cluster wide. During those operations, you can spread reads across members in the cluster, for better performance.

Now, if you don’t mind, I’m going to be writing Tail/Feather now, and see how long it takes.

The process of performance problem fixes with RavenDB

This post isn’t so much about this particular problem, but about the way we solved this.

We have a number of ways to track performance problems, but this is a good example, we can see that for some reason, this test has failed because it took too long to run:

image

In order to handle that, I don’t want to run the test, I don’t actually care that much about this. So I wanted to be able to run this independently.

To do that, I added:

image

This opens us the studio with all the data that we have for this test. Which is great, since this means that we can export the data.

image

That done, we can import it to an instance that we control, and start testing the performance.  In particular, we can run in under a profiler, to see what it is doing.

The underlying reason ended up being an issue with how we flush things to disk, which was easily fixed once we could narrow it down. The problem was just getting it working in a reproducible manner. This approach, being able to just stop midway through a test and capture the full state of the system is invaluable in troubleshooting what is going on.

Tags:

Published at

Originally posted at

Comments (7)

Beyond RavenDB 3.0: The future road map for RavenDB

We are pretty much done with RavenDB 3.0, we are waiting for fixes to internal apps we use to process orders and support customers, and then we can actually make a release. In the meantime, that means that we need to start looking beyond the 3.0 release. We had a few people internally focus on post 3.0 work for the past few months, and we have a rough outline for what we done there. Primarily we are talking about better distribution and storage models.

Storage models – the polyglot database

Under this umbrella we put dedicated database engines to support specific needs. We are talking about distributed counters (high scale out, rapid throughput), time series and event store as the primary areas that we are focused on. For example, the counters stuff is pretty much complete, but we didn’t have time to actually make that into a fully mature product.

I talked about this several times in the past, so I’ll not get into too many details here.

Distribution models

We have been working on a Raft implementation for the past few months, and it is now in the stage where we are starting to integrate it into the rest of our software. Raft is planned to be the core replication protocol for the time series and events databases. But you are probably going to see if first as topology super layer for RavenDB and RavenFS.

Distributed topology management

Replication support in RavenDB and RavenFS follow the multi master system. You can write to any node, and your write will be distributed by the server to all the nodes. This has several advantages, in particular, the fact that we can operate in disconnected or partially disconnected manner, and that we need little coordination between clients to get everything working. It also has the disadvantage of allow conflicts. In fact, if you are writing to multiple replicating nodes, and aren’t careful about how you are splitting writes, you are pretty much guaranteed to have conflicts. We have repeatedly heard that this is both a good thing and something that customers really don’t want to deal with.

It is a good thing because we don’t have data loss, it is a bad thing because if you aren’t ready to handle this, some of your data is inaccessible because of the conflict until it is resolved.

Because of that, we are considering implementing a server side topology management system. The actual replication mechanics are going to remain the same. But the difference is how we are going to decide how to work with it.

A cluster (in this case, a set of RavenDB servers and all databases and file systems on them)  is composed of cooperating nodes. The cluster is managed via Raft, which is used to store the topology information of the cluster. Topology include each of the nodes in the system, as well as any of the databases and file systems on the cluster. The cluster will select a leader, and that leader will also be the primary node for writes for all databases. In other words, assume we have a 3 node cluster, and 5 databases in the cluster. All the databases are replicated to all three nodes, and a single node is going to serve as the write primary for all operations.

During normal operations, clients will query any server for the replication topology (and cache that) every 5 minutes or so. If a node is down, we’ll switch over to an alternative node. If the leader is down, we’ll query all other nodes to try to find out who the new leader is, then continue using that leader’s topology from now on. This give us the advantage that a down server cause clients to switch over and stay switched. That avoid an operational hazard when you bring a down node back up again.

Clients will include the topology version they have in all communication with the server. If the topology version doesn’t match, the server will return an error, and the client will query all nodes it knows about to find the current topology version. It will always chose the latest topology version, and continue from there.

Note that there are still a chance for conflicts, a leader may become disconnected from the network, but not be aware of that, and accept writes to the database. Another node will take over as the cluster leader and clients will start writing to it. There is a gap where a conflict can occur, but it is pretty small one, and we have good mechanisms to deal with conflicts, anyway.

We are also thinking about exposing a system similar to the topology for clients directly. Basically, a small distributed and consistent key/value store. Mostly meant for configuration.

Thoughts?

Tags:

Published at

Originally posted at

Comments (7)

RavenDB Wow! Features presentation

In Oredev, beside sitting in a booth and demoing why RavenDB is cool for about one trillion times, I also gave a talk. I intended it to be a demo packed 60 minutes, but then I realized that I only have 40 minutes for the entire thing.

The only thing to do was to speed things out, I think I breathed twice throughout the entire presentation. And I think it went great.

RAVENDB: WOW! FEATURES - THE THINGS THAT YOU DIDN'T KNOW THAT YOUR DATABASE CAN DO FOR YOU from Øredev Conference on Vimeo.

Tags:

Published at

Originally posted at

Comments (4)

RavenDB 3.0 RTM!

RavenDB 3.0 is out and about!

RavenDB

It is available on out downloads page and on Nuget. You can read all about what is new with RavenDB 3.0 here.

This is a stable release, fully supported. It is the culmination of over a year and a half of work, a very large team and enough improvements to make you dance a jig.

You can play with the new version here, and all of our systems has been running on 3.0 for a while now, of course.

And with that, I’m exhausted, thrilled and very excited. Have fun playing with 3.0, and check by tomorrow to see some of the cool Wow features.

Open-mouthed smile

Tags:

Published at

Originally posted at

Comments (9)