Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,565
|
Comments: 51,184
Privacy Policy · Terms
filter by tags archive
time to read 7 min | 1350 words

As part of the performance work we have been doing, we focused on the Etag class we use as a performance hot spot. For this post, I’m going to talk about Etag.Parse(string) and what we did to improve its performance.

Etag is a core type in RavenDB, it is how we represent changes in the database, and we deal with it a lot. As such, it turned up as a performance hot spot in our profiling. This post is based mostly around Maxim’s work to improve the overall performance.

One of the things that we do with Etag is to take a string and turn that into an instance of an Etag. An Etag looks just like a Guid, but its structure has meaning that we care about. Here is what a typical Etag looks like:

01000000-0000-0005-0000-00000000012C

We send them over the wire as strings for a lot of purposes. Here is the initial code that we used for parsing Etags:

public static Etag Parse2(string str)
{
    if (string.IsNullOrEmpty(str))
        throw new ArgumentException("str cannot be empty or null");
    if (str.Length != 36)
        throw new ArgumentException(string.Format("str must be 36 characters (Etag::Parse is {0})", str));

    var buffer = new byte[16]
    {
        byte.Parse(str.Substring(16, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(14, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(11, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(9, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(6, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(4, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(2, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(0, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(34, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(32, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(30, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(28, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(26, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(24, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(21, 2), NumberStyles.HexNumber),
        byte.Parse(str.Substring(19, 2), NumberStyles.HexNumber)
    };

    return new Etag
    {
        restarts = BitConverter.ToInt64(buffer, 0),
        changes = BitConverter.ToInt64(buffer, 8)
    };
}

You can note several things here. First, we are being kinda funky here with the order of parsing. This is because we read the string in Big Endian format, to allow direct comparisons of the bytes strings.

The other issue is the shear number of allocations and calls we are making. Because this is such a small code base, we created 10,000 etags and parsed them a million times .That took 17.3 seconds. For a throughput of roughly 5 Etags per millisecond.

Then we sat down and re-wrote it using a lot of somewhat nasty tricks.

private readonly static int[] _asciisOfHexToNum = CreateHexCharsToNumsTable();

private static int[] CreateHexCharsToNumsTable()
{
    var c = new int['z' + 1];
    for (var i = '0'; i <= '9'; i++)
    {
        c[i] = (char)(i - '0');
    }
    for (var i = 'A'; i <= 'Z'; i++)
    {
        c[i] = (char)((i - 'A') + 10);
    }
    for (var i = 'a'; i <= 'z'; i++)
    {
        c[i] = (char)((i - 'a') + 10);
    }

    return c;
}

public unsafe static Etag Parse(string str)
{
    if (str == null || str.Length != 36)
        throw new ArgumentException("str cannot be empty or null");

    fixed (char* input = str)
    {
        var etag = new Etag();
        int fst = ((byte)(_asciisOfHexToNum[input[0]] * 16 + _asciisOfHexToNum[input[1]])) << 24 |
            ((byte)(_asciisOfHexToNum[input[2]] * 16 + _asciisOfHexToNum[input[3]])) << 16 |
            ((byte)(_asciisOfHexToNum[input[4]] * 16 + _asciisOfHexToNum[input[5]])) << 8 |
            (byte)(_asciisOfHexToNum[input[6]] * 16 + _asciisOfHexToNum[input[7]]);
        int snd = ((byte)(_asciisOfHexToNum[input[9]] * 16 + _asciisOfHexToNum[input[10]])) << 24 |
            ((byte)(_asciisOfHexToNum[input[11]] * 16 + _asciisOfHexToNum[input[12]])) << 16 |
            ((byte)(_asciisOfHexToNum[input[14]] * 16 + _asciisOfHexToNum[input[15]])) << 8 |
            ((byte)(_asciisOfHexToNum[input[16]] * 16 + _asciisOfHexToNum[input[17]]));
        etag.restarts = (uint)snd | ((long)fst << 32);


        fst = ((byte)(_asciisOfHexToNum[input[19]] * 16 + _asciisOfHexToNum[input[20]])) << 24 |
            ((byte)(_asciisOfHexToNum[input[21]] * 16 + _asciisOfHexToNum[input[22]])) << 16 |
            ((byte)(_asciisOfHexToNum[input[24]] * 16 + _asciisOfHexToNum[input[25]])) << 8 |
            ((byte)(_asciisOfHexToNum[input[26]] * 16 + _asciisOfHexToNum[input[27]]));
        snd = ((byte)(_asciisOfHexToNum[input[28]] * 16 + _asciisOfHexToNum[input[29]])) << 24 |
            ((byte)(_asciisOfHexToNum[input[30]] * 16 + _asciisOfHexToNum[input[31]])) << 16 |
            ((byte)(_asciisOfHexToNum[input[32]] * 16 + _asciisOfHexToNum[input[33]])) << 8 |
            ((byte)(_asciisOfHexToNum[input[34]] * 16 + _asciisOfHexToNum[input[35]]));
        etag.changes = (uint)snd | ((long)fst << 32);
        return etag;
    }
}

Here is what we did. We created a translation table, so every possible byte is pre calculated. That means that for the cost of actually parsing the string is basically doing a lookup into an array, which is constant type. There are no allocations, nor is there anything expensive going on. On the same machine the previous code took 17.3 to run, this code can process a million Etags in just 287 milliseconds. Or close to 3,500 Etag per milliseconds.

I leave the rate of improvement as an exercise for the reader. Also, can you think of any way in which we can improve this code even further? We weren’t able to figure out anything, but you never knows.

time to read 3 min | 544 words

Another thing that turned up in the performance work was the Esent vs. Voron issue. We keep testing everything on both, and trying to see which one can outdo the other, fix a hotspot, then try again. When we run the YCSB benchmark we also compared between Esent vs. Voron as storage for our databases and we found that Voron was very good in read operation while Esent was slightly better in write operation. During the YCSB tests we found out one of the reason why Voron was a bit slower than Esent for writing, it was consuming 4 times the expected disk-space.

The reason for this high disk-space consumption was that the benchmark by default generates documents of exactly 1KB, with meta-data the actual size was 1.1KB. Voron internal implementation uses a B+ tree where the leafs are 4KB in size, 1KB was the threshold in which we decide not to save data to the leaf but to reference on it and save it on a new page. We ended up creating a new 4KB page to hold 1.1KB documents for each document that we saved. The benchmark actually hit the worst case scenario for our implementation, and caused us to use 4 times more disk space and write 4 times more data than we needed.  Changing this threshold reduce the disk-space consumption to the expected size, and gave Voron a nice boost.

We are also testing our software on a wide variety of systems, and with Voron specifically with run into an annoying issue. Voron is a write ahead log system, and we are very careful to write to the log in a very speedy manner. This is one of the ways in which we are getting really awesome speed for Voron. But when running on slow I/O system, and putting a lot of load on Voron, we started to see very large stalls after a while. Tracing the issue took a while, but eventually we figured out what was going on. Writing to the log is all well and good, but we need to also send the data to the actual data file at some point.

The way Voron does it, it batch a whole bunch of work, write it to the data file, then sync the data file to make sure it is actually persisted on disk. Usually, that isn’t really an issue. But on slow I/O, and especially under load, you get results like this:

Start to sync data file (8:59:52 AM). Written but unsynced data size 309 MB
FlushViewOfFile duration 00:00:13.3482163. FlushFileBuffers duration: 00:00:00.2800050.
End of data pager sync (9:00:05 AM). Duration: 00:00:13.7042229

Note that this is random write, because we may be doing writes to any part of the file, but that is still way too long. What was worse, and the reason we actually care is that we were doing that while holding the transaction lock.

We were able to re-design that part so even under slow I/O, we can take the lock for a very short amount of time, update the in memory data structure and then release the lock and spend some quality time gazing at our navel in peace while the I/O proceeded in its own pace, but now without blocking anyone else.

time to read 4 min | 759 words

We have been quite busy lately doing all sort of chores. The kind of small features that you don’t really notice, but make a lot of difference over time. One of the ways that we did that was to assign a team for the sole purpose of improving the performance of RavenDB. This is done by writing a lot of stress tests, running under profilers and in general doing very evil things to the code to get it to work much faster. The following is part of the results that we have found so far, written mostly by Tal.

We decided to use Yahoo! Cloud Service Benchmark (AKA YCSB) for that purpose, while YCSB is great in creating stress on the target database servers it does not offer complex interface to stress complex scenario, and seems to be targeting key value solutions. YCSB allows you to basically control some parameters to the workload engine and implementation of five basic operation: Insert, Update, Read, Scan and Delete of a single record, in a single table/collection.

Note: I think that this is because it focused (at least initially) primarily on the Big Table type of NoSQL solutions.

This very thin interface allows us to mainly stress the PUT/GET document. The first thing we found is that running with indexes actually slowed us down since we are “wasting” precious CPU time on indexing and not doing any querying, that is enough to show you the true face of benchmark miracle, since where have you ever seen a database that you didn’t wish to run a query upon?  But this is a topic for another post… lets go back to performance.

The second thing we found was a hotspot in the route data retrieval  mechanism of the Web API/ While using Web API routing seems to work like magic, when you get to have approximately four hundred endpoints you will start suffering from performance issues retrieving route data. In our case we were spending about 1ms per http request just to retrieve the route data for the request, that may not seems like much but for 1,500 request that took 5 seconds to execute this took 1.5 seconds just for route data retrieval. Please, do not try to extrapolate the throughput of RavenDB from those numbers since we ran this test on a weak machine also under a profiling tool that slow it down significantly.

Note: Effectively, for our use case, we were doing an O(400) operation on every request, just to find what code we needed to run to handle this request.

We decided to cache the routes for each request but just straight forward caching wouldn’t really help in our case. The problem is that we have many unique requests that would fill the cache and then wouldn’t actually be of use.

Our major issue was requests that had a route that looked like so databases/{databaseName}/docs/{*docId}. While we we do want to cache routes such as: databases/{databaseName}/stats. The problem is that we have both types of routes and they are used quite often so we ended up with a greatest common route cache system. That means that /databases/Northwind/docs/Orders/order1 and /databases/Northwind/docs/Orders/order2 will go into the same bucket that is /databases/Northwind/docs.

In that way, we would still have to do some work per request, but now we don’t have to go through all four hundred routes we have to find the relevant routes.

What was the performance impact you wonder? We took off 90% of get route data time that is spending 0.15 seconds instead of 1.5 seconds for those 1,500 requests, reducing the overall runtime to 3.65 seconds, that is 27% less run time.

Note: Again, don’t pay attention to the actual times, it is the rate of improvement that we are looking at.

As an aside, another performance improvement that we made with regards to routing was routing discovery. During startup, we scan the RavenDB assembly for controllers, and register the routes. That is all great, as long as you are doing that once. And indeed, in most production scenarios, you’ll only do that once per server start. That means that the few dozen milliseconds that you spend on this aren’t really worth thinking about. But as it turned out, there was one key scenario that was impacted by this. Tests! In tests, you start and stop RavenDB per test, and reducing this cost is very important to ensuring a pleasant experience. We were able to also optimize the route discovery process, and we were actually able to reduce our startup time from cold boot to below what we had in 2.5.

time to read 3 min | 437 words

We get people asking us about the details of RavenDB indexes, why a certain index is busy, what is costly, etc. This is a pretty common set of questions, and while we were always able to give the customer a good answer, I was never truly happy with that. A lot of that was about understanding the system to a very deep level and being able to read fairly obscure to an outsider. Like most things that aren’t obvious and easy to use, that bugged me, deeply. Therefor, we spent a few weeks adding internal metrics and exposing everything in a nice and easy fashion.

Let me show you the details from one of our production databases:

image

The Y axis is for different indexes. On the Y axis, you have time. Note that the top part is for map operations, and the bottom part is for reduce operations, and that they are running in parallel. You can also visually see that those the map & reduce are actually running in parallel.

What is even better, you can visually see each individual indexing batch, let us focus on the long indexing batch in the middle:

image

As you can see, we are indexing 512 documents, but it took us a really long time to index them, why is that? We can see that the reason that this batch took so long is that the Orders/Stats and Products/Stats index were very long. Clicking on them, we can see:

image

 

The Orders/Stats index accepted a 105 documents, and outputted 1,372 map values. And the costly part of this index was to actually commit those (along with the Products/Stats, which also output a similar amount) to the map storage.

Another slow index is exposed here:

image

In this case, we have an index that is big enough that we need to flush it to disk from RAM. That explains things.

I’m really excited about this feature.

time to read 3 min | 416 words

We have been quite for a while on that front, but that is just because we were head down and working hard on it. The current status is that we are getting there Smile.

Recently we have been fighting with a SIGSEGV (segmentation fault) that appear out of nowhere and kind of ruin our fun. To be rather more exact, we have been on that bug for over a month. We have been getting great help from Sine Nomine about figure out the root cause of that, and today Neale Ferguson finally tracked down the final clues that gave up a simple reproduction.

Take the following code and run it on a Linux machine (in our case, Ubuntu 14.4 with Mono 3.10). It would crash with a really nasty error:

class MainClass
{
    public unsafe static void Main(string[] args)
    {
        var f = Syscall.mmap(IntPtr.Zero, 1024 * 1024, MmapProts.PROT_WRITE | MmapProts.PROT_READ, MmapFlags.MAP_SHARED | MmapFlags.MAP_ANONYMOUS, -1, 0);

        var p = new byte*[]{ (byte*)f.ToPointer() };

        Syscall.munmap(f,1024*1024);

        Console.WriteLine(p.Length);

        p = null;

        GC.Collect();
        GC.WaitForPendingFinalizers();
    }
}

The underlying reason, to the best of my understanding at the moment, is that the sgen GC implementation in Mono is going inspecting the p array for its values. But instead of just treating this as an opaque value (because I wouldn’t expect the GC to follow a pointer reference), it is trying to access the value. Since by this time, this value is no longer valid, it is accessing invalid memory location, and dying horribly.

The good news htat once we had a simple repro, the Mono team was really quick on their feet and this has already been fixed.

time to read 2 min | 335 words

 

One of the new features we just finished working on is the notion of side by side indexes. You can see how to access it on the image, but what is it?

image

As usual lately, this is part and parcel of our current focus in operations. The basic notion is simple. You have an index in production that you want to update, but just updating it on the fly isn’t going to work.

The problem is that updating the index force a complete reindexing of the data, and in most production systems, we have enough documents that this can take a while. That means that during the indexing process, your application would see partial results, and may see that for a while. Common work around was to add support in your application to change the index name, so you would create a new index, wait for it to become non stale and then you would update the application configuration so you would query the new index.

With the side by side option, this is all handled for you. When you save an index with side by side option, you get the following options:

image

Basically, we create the index under a temporary name, and start indexing it. And you can choose when RavenDB automatically will make the switch (note that there is an OR condition between those, so any of those that match would work).

This give you the option of changing an index, and have no interruptions of service, because RavenDB will take care of all of that for you. You can even mark an index with a “force to change via side-by-side”, which will protect it from accidental changes.

time to read 4 min | 680 words

Most of the time, RavenDB  is being used for OLTP scenarios, and it is doing great work there. However, we have customers that use RavenDB as the source for data processing jobs. Those jobs calls for processing all / most of the documents in the database.

A typical example would be an ordering system, where each document is an order, and we need to process the orders. You can certainly handle this right now, but that requires you to maintain quite a bit of state in order to do so efficiently. What we have done is to take this entire process and make this very simple to use. I know that this is a bit vague, but let me try showing you some code, and hopefully it will clear things up.

var id = store.Subscriptions.Create(new SubscriptionCriteria
{
    BelongsToCollection = "Orders",
    PropertiesMatch =
    {
        {"Valid", true}
    },
});

Here we create an subscription, along with its configuration. In this case, give me all the orders, but only those which are valid. We can also do a bunch more basic filters like that. The result of this operation is an id, which I can use to open the subscription.

Note that subscriptions are persistent and long lived, so you are expected to hold on to that id and make use of it. Using the subscription id, we can open the subscription:

IObservable<User> subscription = store.Subscriptions.Open<User>(id, new SubscriptionConnectionOptions
{
    IgnoreSubscribersErrors = false,
    BatchOptions = new SubscriptionBatchOptions()
    {
        AcknowledgmentTimeout = TimeSpan.FromMinutes(1),
        MaxDocCount = 1024 * 16,
        MaxSize = 1024 * 1024 * 4
    },
    ClientAliveNotificationInterval = TimeSpan.FromSeconds(10),
});

You can see that we specify some details about the running connection. In particular, we limit the size of a single batch, and the heart beat intervals. I’ll touch on the error handling a bit later, first, let us see how we actually get the data. Because the subscription is an IObservable<RavenJObject>, that means that you can just utilize reactive extensions and subscribe to the incoming stream as you would expect to. And it means that you’ll continue to get them, even for items that were added after you opened the subscription.

What is interesting is the error handling scenario. Data subscriptions will ensure that you’ll receive each document at least once, but what happens if there is an error? In that case, it depends on your configuration. In the code above, we don’t allow errors, so we you’ll get the same document over and over until you have successfully processed it. If you set IgnoreSubscribersErrors to true, we’ll ignore the errors raised by subscribers.

The nice thing about this is that it works even in the presence of crashes. Under the scenes, once you have successfully processed a document, we’ll send a confirmation to the server about it, so if there is a crash, we know we already processed it. If you crashed midway, we’ll just resend you the relevant document when you open the subscription again.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Production Postmortem (52):
    07 Apr 2025 - The race condition in the interlock
  2. RavenDB (13):
    02 Apr 2025 - .NET Aspire integration
  3. RavenDB 7.1 (6):
    18 Mar 2025 - One IO Ring to rule them all
  4. RavenDB 7.0 Released (4):
    07 Mar 2025 - Moving to NLog
  5. Challenge (77):
    03 Feb 2025 - Giving file system developer ulcer
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}