Ayende @ Rahien

It's a girl

Interview question: fix the index

This is something that goes into the “what to ask a candidate”.

Given the following class:

public class Indexer
{
    private Dictionary<string, List<string>> terms = 
        new Dictionary<string, List<string>>(StringComparer.OrdinalIgnoreCase);

    public void Index(string docId, string text)
    {
        var words = text.Split();
        foreach (var term in words)
        {
            List<string> val;
            if (terms.TryGetValue(term, out val) == false)
            {
                val = new List<string>();
                terms[term] = val;
            }
            val.Add(docId);
        }
    }

    public List<string> Query(string term)
    {
        List<string> val;
        terms.TryGetValue(term, out val);
        return val ?? new List<string>();
    }
}

This class have the following tests:

public class IndexTests
{
    [Fact]
    public void CanIndexAndQuery()
    {
        var index = new Indexer();
        index.Index("users/1", "Oren Eini");
        index.Index("users/2", "Hibernating Rhinos");

        Assert.Contains("users/1", index.Query("eini"));
        Assert.Contains("users/2", index.Query("rhinos"));
    }

    [Fact]
    public void CanUpdate()
    {
        var index = new Indexer();
        index.Index("users/1", "Oren Eini");
        //updating
        index.Index("users/1", "Ayende Rahien");

        Assert.Contains("users/1", index.Query("Rahien"));
        Assert.Empty(index.Query("eini"));
    }
}

The first test passes, but the second fails.

The task is to get the CanUpdate test to pass, while keeping memory utilization and CPU costs as small as possible. You can change the internal implementation of the Indexer as you see fit.

After CanUpdate is passing, implement a Delete(string docId) method.

Tags:

Published at

Originally posted at

Comments (14)

Timeouts, TCP and streaming operations

We got a bug report in the RavenDB mailing list that was interesting to figure out.  The code in question was:

foreach(var product in GetAllProducts(session)) // GetAllProducts is implemented using streaming
{
  ++i;
  if (i > 1000)
  {
    i = 0;
    Thread.Sleep(1000);
  }
}

This code would cause a timeout error to occur after a while. The question is why? We can assume that this code is running in a console application, and it can take as long as it wants to process things.

And the server is not impacted from what the client is doing, so why do we have a timeout error here? The quick answer is that we are filling in the buffers.

GetAllProducts is using the RavenDB streaming API, which push the results of the query to the client as soon as we have anything. That lets us parallelize work on both server and client, and avoid having to hold everything in memory.

However, if the client isn’t processing things fast enough, we run into an interesting problem. The server is sending the data to the client over TCP. The client machine will get the results, buffer them and send them to the client. The client will read them from the TCP buffers, then do some work (in this case, just sleeping). Because the rate in which the client is processing items is much smaller than the rate in which we are sending them, the TCP buffers become full.

At this point, the client machine is going to start dropping TCP packets. It doesn’t have any more room to put the data in, and the server will send it again, anyway. And that is what the server is doing, assuming that we have a packet loss over the network. However, that will only hold up for a while, because if the client isn’t going to recover quickly, the server will decide that it is down, and close the TCP connection.

At this point, there isn’t any more data from the server, so the client will catch up with the buffered data, and then wait for the server to send more data. That isn’t going to happen, because the server already consider the connection lost. And eventually the client will time out with an error.

A streaming operation require us to process the results quickly enough to not jam the network.

RavenDB also have the notion of subscriptions. With those, we require explicit client confirmation from the client to send the next batch, so a a slow client isn’t going to cause issues.

Tags:

Published at

Originally posted at

Comments (3)

RavenDB in Siberia

A good chunk of the RavenDB Core Team is going to be in CodeFest in Novosibirsk this weekend, including yours truly.

We are going to be handing out a lot of cool stuff, and we got some really nice things to talk about. And yes, I’ll keep the suspense until I get to meet people face to face.

Tags:

Published at

Originally posted at

Merge related entities using Multi Map/Reduce

A question came up in the mailing list regarding searching across related entities. In particular, the scenario is the notion of a player and characters in MMPROG game.

Here is what a Player document looks like:

{
  "Id": "players/bella@dona.self",
  "Name": "Bella Dona",
  "Billing": [ { ... }, { ... }],
  "Adult": false,
  "LastLogin": "2015-03-11"
}

And a player have multiple character documents:

{
  "Id": "characters/1234",
  "Name": "Black Dona",
  "Player": "players/bella@dona.self",
  "Race": "DarkElf",
  "Level": 24,
  "XP": 283831,
  "HP": 438,
  "Skills": [ { ... } , { ... } ]
}
{
  "Id": "characters/1321",
  "Name": "Blue Bell",
  "Player": "players/bella@dona.self",
  "Race": "Halfling",
  "Level": 2,
  "XP": 2831,
  "HP": 18,
  "Skills": [ { ... } , { ... } ]
}
{
  "Id": "characters/1143",
  "Name": "Brown Barber",
  "Player": "players/bella@dona.self",
  "Race": "WoodElf",
  "Level": 44,
  "XP": 983831,
  "HP": 718,
  "Skills": [ { ... } , { ... } ]
}

And what we want is an output like this:

{
    "Id" : "players/bella@dona.self",
    "Adult": false,
    "Characters" : [
        { "Id": "characters/1234",  "Name": "Black Dona" },
        { "Id": "characters/1321",  "Name": "Blue Bell" },
        { "Id": "characters/1143",  "Name": "Brown Barberl" },
    ]
}

Now, a really easy way to do that would be to issue two queries. One to find the player, and another to find its characters. That is actually the much preferred method to do this. But let us say that we need to do something that uses both documents types.

Give me all the players who aren’t adults that have a character over 40, for example. In order to do that, we are going to use a multi map reduce index to merge the two together. Here is how it is going to look like:

// map - Players

from player in docs.Players
select new 
{
  Player = player.Id,
  Adult = player.Adult,
  Characters = new object[0]
}

// map - Characters

from character in docs.Characters
select new
{
   character.Player,
   Adult = false,
   Characters = new [] 
   { 
     new { character.Id, character.Name }
   }
}

// reduce

from result in results
group result by result.Player into g
select new
{
   Player = g.Key,
   Adult = g.Any(x=>x.Adult),
   Characters = g.SelectMany(x=>x.Characters)
}

This gives you all the details, in a single place. And you can start working on queries from there.

Tags:

Published at

Originally posted at

Comments (1)

Taking full dumps for big IIS apps

If your application is running on IIS, you are getting quite a lot for free. To start with, monitoring and management tools are right there out of the box. You are also getting some… other effects.

In particular, we had RavenDB running inside IIS that exhibit a set of performance problems in a couple of nodes (and just on those nodes). We suspected that this might be related to memory usage, and we wanted to take a full process dump so we can analyze this offline.

Whenever we tried doing that, however, the process would just restart. The problem was that to reproduce this we had to wait for a particular load pattern to happen after the database was live for about 24 hours. So taking the dump at the right time was crucial. Initially we thought we used the wrong command, or something like that. The maddening this was, when we tried it on the same machine, using the same command, without the performance issue present, it just worked (and told us nothing).

Eventually we figured out that the problem was in IIS. Or, to be rather more exact, IIS was doing its job.

When the performance problem happened, the memory usage was high. We then needed to take a full process dump, which meant that we had to write a lot. IIS didn’t hear from the worker process during that time (since it was currently being dumped), and it killed it, creating a new one.

The solution was to ask IIS to not do that, the configuration is available in the advanced settings for application pool. Note that just changing that would force IIS to restart the process, which was another major annoyance.

image

Tags:

Published at

Originally posted at

Comments (2)

QCon London and In The Brain talk – Performance Optimizations in the wild

The RavenDB Core Team is going to be in the QCon London conference this week, so if you are there, stop by our booth, we got a lot of cool swag to give out and some really cool demos.

In addition to that, on Thursday I’m going to be giving an In The Brain talk about Performance Optimizations in the Wild, talking about the kind of performance work we have been doing recently.

The results of this work can be shown on the following graph:

image

Come to the talk to hear all about the details and what we did to get things working.

Published at

Originally posted at

Comments (3)

That ain’t going to take you anywhere

As part of our usual work routine, we field customer questions and inquiries. A pretty common one is to take a look at their system to make sure that they are making a good use of RavenDB.

Usually, this involves going over the code with the team, and making some specific recommendations. Merge those indexes, re-model this bit to allow for this widget to operate more cleanly, etc.

Recently we had such a review in which what I ended up saying is: “Buy a bigger server, hope this work, and rewrite this from scratch as fast as possible”.

The really annoying thing is that someone who was quite talented has obviously spent a lot of time doing a lot of really complex things to end up where they are now. It strongly reminded me of this image:

image

At this point, you can have the best horse in the world, but the only thing that will happen if it runs is that you are going to be messed up.

What was so bad? Well, to start with, the application was designed to work with a dynamic data model. That is probably also why RavenDB was selected, since that is a great choice for dynamic data.

Then the designers sat down and created the following system of classes:

public class Table
{
	public Guid TableId {get;set;}
	public List<FieldInformation> Fields {get;set;}
	public List<Reference> References {get;set;}
	public List<Constraint> Constraints {get;set;}
}

public class FieldInformation
{
	public Guid FieldId {get;set;}
	public string Name {get;set;}
	public string Type {get;set;}
	public bool Required {get;set;}
}

public class Reference
{
	public Guid ReferenceId {get;set;}
	public string Field {get;set;}
	public Guid ReferencedTableId {get;set;}
}

public class Instance
{
	public Guid InstanceId {get;set;}
	public Guid TableId {get;set;}
	public List<Guid> References {get;set;}
	public List<FieldValue> Values {get;set;}
}

public class FieldValue
{
	public Guid FieldId {get;set;}
	public string Value {get;set;}
}

I’ll let you draw your own conclusions about how the documents looked like, or just how many calls you needed to load a single entity instance.

For that matter, it wasn’t possible to query such a system directly, obviously, so they created a set of multi-map/reduce indexes that took this data and translated that into something resembling a real entity, then queried that.

But the number of documents, indexes and the sheer travesty going on meant that actually:

  • Saving something to RavenDB took a long time.
  • Querying was really complex.
  • The number of indexes was high
  • Just figuring out what is going on in the system was nigh impossible without a map, a guide and a lot of good luck.

Just to cap things off, this is a .NET project, and in order to connect to RavenDB they used direct REST calls using HttpClient. Blithely ignoring all the man-decades that were spent in creating a good client side experience and integration. For example, they made no use of Etags or Not-Modified-Since, so a lot of the things that RavenDB can do (even under such… hardship) to make things better weren’t supported, because the client code won’t cooperate.

I don’t generally say things like “throw this all away”, but there is no mid or long term approach that could possibly work here.

Lambda methods and implicit context

The C# compiler is lazy, which is usually a very good thing, but that can also give you some issues. We recently tracked down a memory usage issue to code that looked roughly like this.

var stats = new PerformatStats
{
    Size = largeData.Length
};
stats.OnCompletion += () => this.RecordCompletion(stats);

Write(stats, o =>
{
    var sp = new Stopwatch();
    foreach (var item in largeData)
    {
        sp.Restart();
        // do something with this
        stats.RecordOperationDuration(sp);
    }
});

On the surface, this looks good. We are only using largeData for a short while, right?

But behind the scene, something evil lurks. Here is what this actually is translated to by the compiler:

__DisplayClass3 cDisplayClass3_1 = new __DisplayClass3
{
    __this = this,
    largeData = largeData
};
cDisplayClass3_1.stats = new PerformatStats { Size = cDisplayClass3_1.largeData.Length };

cDisplayClass3_1.stats.OnCompletion += new Action(cDisplayClass3_1, __LambdaMethod1__);

Write(cDisplayClass3_1.stats, new Action(cDisplayClass3_1, __LambdaMethod2__));

You need to pay special attention to what is going on. We need to maintain the local state of the variables. So the compiler lift the local parameters into an object. (Called __DisplayClass3).

Creating spurious objects is something that we want to avoid, so the C# compiler says: “Oh, I’ve two lambdas in this call that need to get access to the local variables. Instead of creating two objects, I can create just a single one, and share it among both calls, thereby saving some space”.

Unfortunately for us, there is a slight issue here. The lifetime of the stats object is pretty long (we use it to report stats). But we also hold a reference to the completion delegate (we use that to report on stuff later on). Because the completion delegate holds the same lifted parameters object, and because that holds the large data object. It means that we ended up holding a lot of stuff in memory far beyond the time they were useful.

The annoying thing is that it was pretty hard to figure out, because we were looking at the source code, and the lambda that we know is likely to be long running doesn’t look like it is going to hold a referece to the largeData object.

Ouch.

Linux, Debts and Out Of Memory Killer

Imagine that you go to the bank, and ask for a 100,000$ mortgage. The nice guy in the bank agrees to lend you the money,  and since you need to pay that in 5 installments, you take 15,000$ to the contractor, and leave the rest in the bank until it is needed. The bank is doing brisk business, and promise a lot of customers that they can get their mortgage in the bank. Since most of the mortgages are also taken in installments, the bank never actually have enough money to hand over to all lenders. But it make do.

Until one dark day when you come to the bank and ask for the rest of the money, because it is time to install the kitchen cabinets, and you need to pay for that. The nice man in the bank tell you to wait a bit, and goes to see if they have any money. At this point, it would be embarrassing to tell you that they don’t have any money to give you, because they over committed themselves. The nice man from the bank murders you and bury your body in the desert, to avoid you complaining that you didn’t get the money that you were promised.  Actually, the nice man might go ahead and kill someone else (robbing them in the process), and then give you their money. You go home happy to your blood stained kitchen cabinets.

That is how memory management works in Linux.

After this dramatic opening, let us get down to what is really going on. Linux has a major problem. Its process model means that it is stuck up a tree and the only way down is via free fall. Whenever a process wants to create another process, the standard method in Linux is to call fork() and then call execv() to execute the new binary. The problem here is what fork() does. It needs to copy the entire process state to the new process. That include all memory, handles, registers, etc.

Let us assume that we have a process that allocated 1GB of memory for reading and writing, and then called fork(). The way things are setup, it is pretty cheap to create the new process, all we need to do is duplicate the kernel data structures and we are done. However, what happens when the memory that the process allocated? The fork() call requires that both processes will have access to that memory, and also that both of them may modify it. That means that we have a copy on write situation. Whenever one of the processes modify the memory, it is forcing the OS to copy that piece of memory to another physical memory location and remap the virtual addresses.

This allows the developer to do some really cool stuff. Redis implemented its backup strategy via the fork() call. By forking and then dumping the in memory process state to disk it can get consistent snapshot of the system with almost no code. It is the OS that is responsible for maintaining that invariant.

Speaking of invariants, it also means that there is absolutely no way that Linux can manage memory properly. If we have 2 GB of RAM on the machine, and we have a 1GB process that fork()-ed, what is going to happen? Well, it was promised 1 GB of RAM, and it got that. And it was also promised by fork() that both processes will be able to modify the full 1GB of RAM. If we also have some other processes taking memory (and assuming no swap for the moment), that pretty much means that someone is going to end up holding the dirty end of the stick.

Now, Linux has a configuration option that would prevent it (vm.overcommit_memory = 2, and the over commit ratio, but that isn’t really important. I’m including this here for the nitpickers, and yes, I’m aware that you can set oom_adj = –17 to protect myself from this issue, not the point.). This tell Linux that it shouldn’t over commit. In such cases, it would mean that the fork() method call would fail, and you’ll be left with an effectively a crippled system. So, we have the potential for a broken invariant. What is going to happen now?

Well, Linux promised you memory, and after exhausting all of the physical memory, it will start paging to swap file. But that can be exhausted to. That is when the Out Of Memory Killer gets to play, and it takes an axe and start choosing a likely candidate to be mercilessly murdered. The “nice” thing about this is that there is no control over that, and you might be a perfectly well behaved process that the OOM just doesn’t like this Monday, so buh-bye!

Looking around, it seems that we aren’t the only one that had run head first into this issue. The Oracle recommendation is to set things up to panic and reboot the entire machine when this happens, and that seems… unproductive.

The problem is that as a database, we aren’t really in control of how much we allocate, and we rely on the system to tell us when we do too much. Linux has no facility to do things like warn applications that memory is low, or even letting us know that by refusing to allocate more memory. Both are things that we already support, and would be very helpful.

That is quite annoying.

Tags:

Published at

Originally posted at

Comments (9)

Picture of the day, Rhino & Raven

We have an amateur photographer in the office, who like to arrange the rhinos in the office (now exceeding 100, I think) in various ways.

I really like this picture.

Published at

Originally posted at

Comments (2)

Buffer Managers, production code and alternative implementations

We are porting RavenDB to Linux, and as such, we run into a lot of… interesting issues. Today we run into a really annoying one.

We make use of the BufferManager class inside RavenDB to reduce memory allocations. On the .Net side of things, everything works just fine, and we never really had any issues with it.

On the Mono side of things, we started getting all sort of weird errors. From ArgumentOutOfRangeException to NullReferenceException to just plain weird stuff. That was the time to dig in and look into what is going on.

On the .NET side of things, BufferManager implementation is based on a selection criteria between large (more than 85Kb) and small buffers. For large buffers, there is a single large pool that is shared among all the users of the pool. For small buffers, the BufferManager uses a pool per active thread as well as a global pool, etc. In fact, looking at the code we see that it is really nice, and a lot of effort has been made to harden it and make it work nicely for many scenarios.

The Mono implementation, on the other hand, decides to blithely discard the API contract by ignoring the maximum buffer pool size. It seems because “no user code is designed to cope with this”. Considering the fact that RavenDB is certainly dealing with that, I’m somewhat insulted, but it seems par the course for Linux, where “memory is infinite until we kill you”* is the way to go.

But what is far worse is that this class is absolutely not thread safe. That was a lot of fun to discover. Considering that this piece of code is pretty central for the entire WCF stack, I’m not really sure how that worked. We ended up writing our own BufferManager impl for Mono, to avoid those issues.

* Yes, somewhat bitter here, I’ll admit. The next post will discuss this in detail.

Long running async and memory fragmentation

We are working on performance a lot lately, but performance isn’t just an issue of how fast you can do something, it is also an issue of how many resources we use while doing that. One of the things we noticed was that we are using more memory than we would like to, and even after we were freeing the memory we were using. Digging into the memory usage, we found that the problem was that we were suffering from fragmentation inside the managed heap.

More to the point, this isn’t a large object heap fragmentation, but actually fragmentation in the standard heap. The underlying reason is that we are issuing a lot of outstanding async I/O requests, especially to serve things like the Changes() API, wait for incoming HTTP requests, etc.

Here is what this looks like inside dotProfiler.

image

As you can see, we are actually using almost no memory, but heap fragmentation is killing us in terms of memory usage.

Looking deeper, we see:

image

We suspect that the issue is that we have pinned instances that we sent to async I/O, and that match what we have found elsewhere about this issue, but we aren’t really sure how to deal with it.

Ideas are more than welcome.

.NET Packaging mess

In the past few years, we had:

  • .NET Full
  • .NET Micro
  • .NET Client Profile
  • .NET Silverlight
  • .NET Portable Class Library
  • .NET WinRT
  • Core CLR
  • Core CLR (Cloud Optimized)*
  • MessingWithYa CLR

* Can’t care enough to figure out if this is the same as the previous one or not.

In each of those cases, they offered similar, but not identical API and options. That is completely ignoring the versioning side of things ,where we have .NET 2.0 (1.0 finally died a while ago), .NET 3.5, .NET 4.0 and .NET 4.5. I don’t think that something can be done about versioning, but the packaging issue is painful.

Here is a small example why:

image

In each case, we need to subtly tweak the system to accommodate the new packaging option. This is pure additional cost to the system, with zero net benefit. Each time that we have to do that, we add a whole new dimension to the testing and support matrix, leaving aside the fact that the complexity of the solution is increasing.

I wouldn’t mind it so much, if it weren’t for the fact that a lot of those are effectively drive-bys, it feels. Silverlight took a lot of effort, and it is dead. WinRT took a lot of effort, and it is effectively dead.

This adds a real cost in time and effort, and it is hurting the platform as a whole.

Now users are running into issues with the Core CLR not supporting stuff that we use. So we need to rip out MEF from some of our code, and implement it ourselves just to get things in the same place as before.

Excerpts from the RavenDB Performance team report: Optimizing Compare – The circle of life (a post-mortem)

Note, this post was written by Federico.

I want first to give special thanks to Thomas, Tobi, Alex and all the others that have made so many interesting comments and coded alternative approaches. It is a great thing to have so many people looking over your work and pointing out interesting facts that in the heat of the battle with the code we could easily overlook.

It was just three weeks since I wrote this series of posts, and suddenly three weeks seems to be a lot of time. There had been very interesting discussions regarding SSE, optimization, different approaches to optimize specific issues, about lowering measurement errors, that I was intending to write down in a follow up post. However, there was an earth shattering event happening this week that just made that post I wanted to write a moot exercise.

Let’s first recap how we ended up optimizing compares.

At first we are using memcmp via P/Invoke. Everything was great for a while, until we eventually hit the wall and we needed all the extra juice we could get. Back at the mid of 2014 we performed the first optimization rounds where for the first time introduced a fully managed pointers based solution. That solution served us well, until we started to optimize other paths and really stress RavenDB 3.0.

In the second round, that eventually spawned this series, we introduced branch-tables, bandwidth optimizations, and we had to resort to every trick on the book to convince the JIT to write pretty specific assembler code.

And here we are. 3 days ago, on the 3rd of February, Microsoft released CoreCLR for everyone to dig into. For everyone interested in performance it is a one in a million opportunity. Being able to see how the JIT works, how high-performance battle tested code is written (not looking at it in assembler) is quite an experience.

In there, specifically in the guts of the mscorlib source we noticed a couple of interesting comments specifically one that said:

// The attributes on this method are chosen for best JIT performance.

// Please do not edit unless intentional.

Needless to say that caught our attention. So we did what we usually do, just go and try it Open-mouthed smile.

The initial results were “interesting” to say the least. Instantaneously we could achieve faster speed-ups for our copy routines (more on that in a future series --- now I feel lucky I didn’t wrote more than the introduction). In fact, the results were “so good” that we decided to give it a try for compare, if we could at least use the SSE optimized native routine for big memory compares we would be in a great spot.

The result of our new tests? Well, we were in for a big surprise.

We had the most optimized routine all along. The code in question:

[DllImport("msvcrt.dll", CallingConvention = CallingConvention.Cdecl, SetLastError = false)]
[SuppressUnmanagedCodeSecurity]
[SecurityCritical]
[ReliabilityContract(Consistency.WillNotCorruptState, Cer.Success)]
public static extern int memcmp(byte* b1, byte* b2, int count);

We just needed to be a bit more permissive, aka disable P/Invoke security checks.

From the MSDN documentation for SuppressUnmanagedCodeSecurity:

This attribute can be applied to methods that want to call into native code without incurring the performance loss of a run-time security check when doing so. The stack walk performed when calling unmanaged code is omitted at run time, resulting in substantial performance savings.

Well I must say substantial is an understatement. We are already running in full-trust and also memcmp is safe function by default, then no harm is done. The same cannot be said by memcpy, where we have to be extra careful. But that is another story. 

  • Size: 2 Native (with checks): 353 Managed: 207 Native (no checks): 217 - Gain: -4,608297%
  • Size: 4 Native (with checks): 364 Managed: 201 Native (no checks): 225 - Gain: -10,66667%
  • Size: 8 Native (with checks): 354 Managed: 251 Native (no checks): 234 - Gain: 7,26496%
  • Size: 16 Native (with checks): 368 Managed: 275 Native (no checks): 240 - Gain: 14,58334%
  • Size: 32 Native (with checks): 426 Managed: 366 Native (no checks): 276 - Gain: 32,6087%
  • Size: 64 Native (with checks): 569 Managed: 447 Native (no checks): 384 - Gain: 16,40625%
  • Size: 128 Native (with checks): 748 Managed: 681 Native (no checks): 554 - Gain: 22,92418%
  • Size: 256 Native (with checks): 1331 Managed: 1232 Native (no checks): 1013 - Gain: 21,61895%
  • Size: 512 Native (with checks): 2430 Managed: 2552 Native (no checks): 1956 - Gain: 30,47035%
  • Size: 1024 Native (with checks): 4813 Managed: 4407 Native (no checks): 4196 - Gain: 5,028594%
  • Size: 2048 Native (with checks): 8202 Managed: 7088 Native (no checks): 6910 - Gain: 2,575982%
  • Size: 8192 Native (with checks): 30238 Managed: 23645 Native (no checks): 23490 - Gain: 0,6598592%
  • Size: 16384 Native (with checks): 59099 Managed: 44292 Native (no checks): 44041 - Gain: 0,5699277%

The gains are undeniable, especially where it matters the most for us (16 - 64 bytes). As you can see the managed optimizations are really good, at the 2048 level and up we are able to compete with the native version (accounting for the P/Invoke that is).

The .Compare() code ended up looking like this:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int Compare(byte* bpx, byte* bpy, int n)
{
    switch (n)
    {
        case 0: return 0;
        case 1: return *bpx - *bpy;
        case 2:
            {
                int v = *bpx - *bpy;
                if (v != 0)
                    return v;
 
                bpx++;
                bpy++;
 
                return *bpx - *bpy;
            }
             
        default: return StdLib.memcmp(bpx, bpy, n);
    }
}
 

With these results:

  • Size: 2 - Managed: 184 Native (no checks): 238
    Size: 4 - Managed: 284 Native (no checks): 280
  • Size: 8 - Managed: 264 Native (no checks): 257
  • Size: 16 - Managed: 302 Native (no checks): 295
  • Size: 32 - Managed: 324 Native (no checks): 313

The bottom-line here is, until we are allowed to generate SSE2/AVX instructions in .Net I doubt we will be able to do a better job, but we are eagerly looking forward for the opportunity to try it out when RyuJIT is released.  *

* We could try to write our own optimized routine in C++/Assembler (if that is even possible), but that is something it is out of scope at the moment (we have bigger fishes to fry performance-wise).

Cool index batch stats

I talked about this in the past, but since we just used to to diagnose a real live problem, I can’t help but do that again, take a look:

image

I especially like how it shows the actual costs, including using parallel computing.

Published at

Originally posted at

Crunching the numbers

I’m running the numbers, and I didn’t really believe them. So I put them in a chart, and I’m still not sure that I trust them.

Here you can see our income per month for the past year. Yes, July was pretty bad with everyone on vacation and no one is buying.

March 14 was pretty good, with people buying RavenConf tickets are a lot of excitement around that. September and October were great. In fact, October was out best month ever.

And then we get to February 2015. In other words, this month. In other words, this month where we still have 10 more days to make sales.

image

I’m pretty happy Smile. But I wished that I had hard numbers on why.

Excerpts from the RavenDB Performance team report: JSON & Structs in Voron

What seems to be a pretty set in stone rule of thumb is that when building a fool proof system, you will also find a better fool. In this case, the fool was us.

In particular, during profiling of our systems, we realized that we did something heinous, something so bad that it deserve a true face palm:

In essence, we spent all this time to create a really fast key/value store with Voron. We spent man years like crazy to get everything working just so… and then came the time to actually move RavenDB to use Voron. That was… a big task, easily as big as writing Voron in the first place. The requirements RavenDB makes of its storage engine are anything but trivial. But we got it working. And it worked good. Initial performance testing showed that we were  quite close to the performance of Esent, within 10% – 15%, which was good enough for us to go ahead with.

Get things working, get them working right and only then get them working fast.

So we spent a lot of time making sure that things worked, and worked right.

And then the time came to figure out if we can make things faster. Remember, Voron is heavily based around the idea of memory mapping and directly accessing values at very little cost. This is important, because we want to use Voron for a whole host of stuff. That said, in RavenDB we use it to store JSON documents, so we obviously store the JSON in it. There are quite a lot of book keeping information that we keep track of.

In particular, let us focus on the indexing stats. We need to keep track of the last indexed etag per index, when it was changed, number of successful indexing attempts, number of failed indexing attempts, etc.

Here is how we updated those stats:

image

As you can imagine, a lot of the time was actually spent just deserializing and deserializing the data back & forth from JSON. In fact, a lot of time was spend doing just that.

Now, indexing status needs to be queried on every single query, so that was a high priority for us to fix. We did this by not storing the data as JSON, but storing the data in its raw form directly into Voron.

Voron gives us a memory of size X, and we put all the relevant data into that memory, and pull the data directly from the raw memory. No need for serialization/deserialization step.

There is a minor issue about the use of strings .While using fixed size structure (like the indexing stats) is really easy, we just read/write to that memory. When dealing with variable length data (such as strings), you have harder time at it, but we dealt with that as well by defining a proper way to encode string lengths and read them back.

The results, by the way:

image

That means that we got close to 4 times faster! That is certainly going to help us down the road.

Improving the RavenDB Experience: Help is right where you need it

One of the things that we work very hard on is the whole experience of the product. Not just creating a good database, or writing great code, but also having good documentation, to explain exactly what is going on here.

We have now taken it a bit further, and we incorporate help documents in pretty much every section of the studio, so you can get relevant docs about what you are looking by just click the big help button.

image

 

It ain’t Clippy, so it won’t try to read your mind, but we do hope that it will end up being useful.

Tags:

Published at

Originally posted at

Comments (1)

RavenDB on Linux, first steps…

Early days, but we are now at the point where we are capable of running RavenDB itself on Linux Smile.

unnamed

That was the easy part, now we have to figure out all of our Windows dependencies and switch them with a Linux equivalent, find out where we are doing things like case insensitive file names, etc, etc, etc.

The current work is going to be pretty much a lot of “let us get all the tests running on Linux”, which is expected to last quite a while. After that, we’ll be able to make sure that we run in a manner consistent with a Linux admin expectations, and then… world domination*!

* Insert maniacal sound track here.

Tags:

Published at

Originally posted at

Comments (8)

The RavenDB Comics

The new flyers just arrived, and we also have a new comics to go with them. We are going to give them out at QCon London in March 4 – 5, I have a really good feeling about this. And I really like the comics.

Nitpicker: Yes, I know it isn’t readable at this resolution. We’ll post this online a few weeks after the QCon conference, come see us there!

Tags:

Published at

Originally posted at

Excerpts from the RavenDB Performance team report: Facets of information, Part II

In my previous post, I outlined the performance problem with facets, in this one, I want to discuss how we solved it.

For the purpose of discussion, we’ll deal with the following data model:

{ "Id": "phones/1", "Unlocked": false, "Brand": "Apple", ... }
{ "Id": "phones/2", "Unlocked": true,  "Brand": "LG", ... }
{ "Id": "phones/3", "Unlocked": false, "Brand": "Samsung", ... }
{ "Id": "phones/4", "Unlocked": true,  "Brand": "Apple", ... }
{ "Id": "phones/5", "Unlocked": false, "Brand": "Samsung", ... }

In order to handle that, we have to understand how Lucene actually work behind the scenes. Everything with Lucene is based on the notion of a document id. The results of a query is a set of document ids, and the internal structures refer to document ids on a regular basis.

So, if the result of a query is actually a set of document ids (Int32), we can represent them as a simple HashSet<int>. So far, so good. Now, we want to do a facet by the brand. How is Lucene actually storing that information?

There are two data structures of interest. The TermsEnum, which looks like this:

  • Brand:Apple – 2
  • Brand:LG –1
  • Brand:Samsung -2

This shows the number of documents per term. This is almost exactly what we would have wanted, except that this is global information, relevant for all the data. It our query is for unlocked phones only, we cannot use this data to answer the query.

For that, we need to use the TermsDocs, which looks like this:

  • Brand:Apple – [1,4]
  • Brand:LG – [2]
  • Brand:Samsung – [3,5]

So we have the term, and the document ids that contained that term.

Note that this is a very high level view of how this actually works, and it ignores a lot of stuff that isn’t actually relevant for what we need to understand here.

The important bit is that we have good way to get all the document matches for a particular term. Because we can do that, it gives us an interesting option. Instead of trying to calculate things on a case by case basis, we changed things around. We used the information inside the Lucene  index to create the following structure:

{
	"Brand": {
		"Apple": [1,4],
		"LG": [2],
		"Samsung": [3,5]
	}
}

There is caching and other obvious stuff also involved, naturally, but those aren’t important for the purpose of this talk.

Now, let us get back to our query and the required facets. We want:

Give me all the unlocked phones, and show me the facet for Brand.

A query in Lucene is going to return the following results:

[1,3,5]

Looking at the data like that, I think you can figure out where we are going with this. Here is the general algorithm that we have for answering facets.

var docsIds = ExecuteQuery();
var results = {};
foreach(var facet in facets) {
	var termsForFacet = terms[facet];
	foreach(var term in termsForFacet){
		results[facet] = CountIntersection(docIds, term.DoccumentsForTerm)
	}
}

The actual cost for calculating the facet is down to a mere set intersection, so the end result is that the performance of facets in the scenarios we test had improved by more than an order of magnitude.

I’m skipping over a lot of stuff, obviously. With facets we deal with distinct queries, ranges, dynamic aggregation, etc. But we were able to shove all of them into this style of work, and the results certainly justify the work.

Oh, and the normal case that we worked so hard to optimize normally, we are seeing 50% improvement there as well, so I’m pretty happy.

Excerpts from the RavenDB Performance team report: Facets of information, Part I

Facets are annoying. Mostly because they tend to be used relatively rarely, but when they are used, they are used a lot, and in many interesting ways. Because they are fairly rarely used, let me explain what they are first.

We want to show the users all the recent phones. We can do this using the following query:

from phone in session.Query<Phone>() 
orderby phone.ReleaseDate descending
select phone;

So far, so good. But we want to do better than just show a potentially big list to the user. So we start faceting the results. We pull some interesting pieces of data from the results, and allow the user to quickly narrow down the details. Typical facets for a phone query would be:

  • Brand
  • Screen Size
  • Camera
  • Cost

Now, we can certainly give the  option to search for them individually, but then you end up with something like this:

That is neither friendly nor very useful. so UX designers came up with this idea, instead:

image

This gives the user a much easier way to look at the large data sets, and quickly narrow down the search to just the things they want. It also gives the user some indication about the actual data. I can see how changing the facet selection would likely change the number of results I have. This is a great way to offer data exploration capabilities. And in any kind of data repository, this is crucial. Shopping sites are a natural choice, but content/document management systems, job sites, information centers are all commonly making use of this feature.

I’m sorry for the long intro, but I feel that this feature needs to be placed in context, to explain its history and the various iterations it has gone through.

We first got this feature in Sep 2011. It was a pull request from Matt Warren, apparently based on some code that I spiked. I intentionally goes there, because that is the first time we see this code, and it was implemented in the simplest possible manner that could possibly work, to showcase this feature and to make sure that it works end to end. The underlying implementation looked something like this:

private void HandleTermsFacet(string index, Facet facet, IndexQuery indexQuery, IndexSearcher currentIndexSearcher, 
	Dictionary<string, IEnumerable<FacetValue>> results)
{
	var terms = database.ExecuteGetTermsQuery(index,
		facet.Name,null,
		database.Configuration.MaxPageSize);
	var termResults = new List<FacetValue>();
	foreach (var term in terms)
	{
		var baseQuery = database.IndexStorage.GetLuceneQuery(index, indexQuery);
		var termQuery = new TermQuery(new Term(facet.Name, term));

		var joinedQuery = new BooleanQuery();
		joinedQuery.Add(baseQuery, BooleanClause.Occur.MUST);
		joinedQuery.Add(termQuery, BooleanClause.Occur.MUST);

		var topDocs = currentIndexSearcher.Search(joinedQuery, 1);

		if(topDocs.totalHits > 0)
			termResults.Add(new FacetValue
			{
				Count = topDocs.totalHits,
				Range = term
			});
	}

	results[facet.Name] = termResults;
}

In other words, the overall logic was something like this:

  • Get a faceted query, with a list of facets.
  • For each facet:
    • Get all the terms for the field (if this is Operating System, then that would be: Android, iOS, BlackBerry, Windows Phone)
    • For each of those terms, executed the original query and add a clause that limit the results to the current term and value.

As you can imagine, this has some.. performance issues. The more facets we had, and the more terms per facets, the costlier this got. Surprisingly enough, this code lasted almost unmodified for about 10 months, and the next major change was simply to process all of this in parallel.

Shortly thereafter, we have more changes. Moving from this trivial model to a much more efficient one, where we are executing the query once (PR from Michael Weber, by the way) per facet, instead of once per facet/term. Dynamic aggregation in RavenDB is also done through facets, and that added a much more common code path, so more changes started happening to the facets code. In particular, we started handling caching, we ended up reducing the time to run by a lot by restructuring how we were handling things so we would only executed a single query, then use low level index access to get all the other information we needed in an optimized fashion.

We got more customer usage sample, from the guys who had 70,000 facet queries to the people who wanted facets over multiple millions of results. All of that meant that we put a lot of thought into how we can reduce the performance of faceted queries, and over time, things because more and more complex, but much faster. In particular, by ordering the sequence of operations for computing the query, adding prefetching support and in general greatly reduced the amount of work we needed to do to just get the query working.

The end result was that the facet code looked something like this:

IndexedTerms.ReadEntriesForFields(currentState, fieldsToRead, allCollector.Documents,
    (term, docId) =>
    {
        List<Facet> facets;
        if (!sortedFacets.TryGetValue(term.Field, out facets))
            return;
        
        foreach (var facet in facets)
        {
            switch (facet.Mode)
            {
                case FacetMode.Default:
                    var facetValues = facetsByName.GetOrAdd(facet.DisplayName);
                    FacetValue existing;
                    if (facetValues.TryGetValue(term.Text, out existing) == false)
                    {
                        existing = new FacetValue
                        {
                            Range = GetRangeName(term)
                        };
                        facetValues[term.Text] = existing;
                    }
                    ApplyFacetValueHit(existing, facet, docId, null, currentState, fieldsCrc);
                    break;
                case FacetMode.Ranges:
                    // removed for brevity
                    break;
                default:
                    throw new ArgumentOutOfRangeException();
            }
        }

    });
});

There is a lot of stuff going on around this code, but this is where we actually computed the facets. Note that this basically calls the lambda once per every matching term for the specified fields for every document in the query.

And the performance was pretty good for our test cases, we run this through several customers who were heavy facet users, and they saw a marked improvement in performance of queries. But we recently got a customer that mentioned in passing that facet queries were expensive. As in, 5 – 10 seconds range per query. That raised some red flags, we consider requests that takes more than 100ms to be suspicious, so we asked for more information, and we realized that he was using facets slightly differently from most of our customers.

Usually, facets are used to… well, facet the data on a search. Typical performance issues with facets in the past were with large number of facets on a search result (I wasn’t kidding with the 70,000 facets). Because most of the customer feedback we got was around that, we focused on this issue primarily, to the point were even stuff like that was well behaving. This customer, however, was using facets not for searching, but as a way to manage the data itself. That meant that they were using a lot of facets on the entire data set.

So instead of having many facets on a medium size number of documents, we were using a small number of facets, but on the entire data set. Profiling this resulted in a few calls being highlighted:

  • GetOrAdd
  • TryGetValue

They are very cheap calls, on their own, but this piece of code was called literally tens of millions of times for this particular scenario. We spent some time with a profiler and started playing with the code, trying to see if we can remove the costly calls and optimize this code path.

The answers was… no. In fact, everything we did was in the range of 5% – 7% change (in both directions, mind).

This sucked, because going back to the user and saying: “that is the way it is” didn’t sit well with us. We got together, closed the shutters, turned off the light, and solely by the light of the monitors, we did a major hacking session. We completely changed the way we were handling facets, to avoid not just the problem with the calls to TryGetValue or GetOrAdd but to change direction to a much more efficient mode.

This post is getting pretty long, so I’ll discuss this in my next post. For now, can you suggest options for improvements?

Figure this code?

I have just finished writing this code, and it occurred to me that on its own, it makes very little sense.

private static int RedactedFunctionName(List<int> a, List<int> b)
{
    List<int> n,m;
    if (a.Count > b.Count)
    {
        n = a;
        m = b;
    }
    else
    {
        n = b;
        m = a;
    }

    int nSize = n.Count;
    int mSize = m.Count;

    double o1 = nSize + mSize;
    double o2 = nSize * Math.Log(mSize, 2);

    int result = 0;
    if (o1 < o2)
    {
        int mi = 0, ni = 0;
        while (mi < mSize && ni < nSize)
        {
            if (n[ni] > m[mi])
            {
                mi++;
            }
            else if (n[ni] < m[mi])
            {
                ni++;
            }
            else
            {
                ni++;
                mi++;
                result++;
            }
        }
    }
    else
    {
        for (int i = 0; i < mSize; i++)
        {
            if (n.BinarySearch(m[i]) >= 0)
                result++;
        }
    }
    return result;
}

Can you figure out what this function does, why is it behaving in this manner, and what preconditions it expects?

Tags:

Published at

Originally posted at

Comments (15)