Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,282 | Comments: 46,762

filter by tags archive

Implementing low level triePart II

time to read 1 min | 172 words

I was asked recently why I’m “burning” my interview questions by posting them on the blog. That actually has several reasons:

  1. If a candidate reads my blog and is able to produce high quality code based on this, well… that is pretty much the job description right there.
  2. We just finish another recruitment round, and we aren’t planning another one for at least 4 – 6 months.
  3. The fact that I’m posting the answers to a specific question doesn’t mean that the the subject matter is closed.

For example, let us take this question & answer. Note that this is approachable because  there is just standard .NET code.

However, the code as posted contains a bug, it is a small one, and all unit tests are passing, but it result is slightly inefficient behavior. Exposing that to a unit test is relatively easy, but going from the failing test back to the root cause and then fixing it would be an interesting investigative technique and good show of skills.

Implementing low level triePart I

time to read 5 min | 826 words

A few months ago I asked this question, and since then I’ve been asking it from some candidates. This is my 2nd tier question, and answering this correctly is going to give you quite a few brownie points.

The idea is to implement this interface:

But the only field that you can put in the implementation is a 32 KB byte array. See the original post for the full details.

Let us get rid of the easy stuff first, since the only value I can keep is a 32 KB byte array, saving and loading the values is trivial:

We can just save and load it directly, nothing much to do except validate that the size match on load.

Now, let us see what we need to do, here is an example of a trie:

Typically, you would implement that using nested dictionaries, but that isn’t going to work given this exercise limitations.

So let us consider a more primitive alternative. We need to store the following information about each node in the trie. We start by defining the following node:

This wouldn’t work for the limits we have, but we are going to be building this in pieces. So this is an important step to understand how we go about it.

We are going to store the data in the following format:

image

So on each level, we are able to jump to the next stage and check its values. Searching for the right value in the children array is going to be simply a matter of doing  binary search on the children array.

The full code listing is linked from the bottom of the post, if you want to see it in context.

And here it is when it is stored as an object tree. Note that each child store the full key, not just the part it owns, because it make it easier to visualize.

image

Let us look at how we are building this kind of trie. We’ll start with two pretty simple helper methods.

There isn’t much to really see here, FindDifference will find the first difference between two strings from a given offset and FindMatch does a standard binary search on the array. The only special thing here is the fact that we are returning the match position even if we failed, we needed that to be able to know where to put the next entry. This will be clear when we’ll look at the Insert method.

This is quite a bit of code, but it can neatly divide into three separate options. In the first case, we reach an empty section in the tree, so we can just create a new leaf and call it a day. This is why we use a ref variable here, because it allows us to mutate the given parameter, which can be either the root, or the array on a nested node.

If there are already values at this level, we check if we need to go into the next level (creating the new level if necessary).

If there isn’t a value at this level that start with the current prefix, we just add it to this level as a value.

Pretty simple, once you break it down. The fun part about this is that as written, this is actually safe for multi threaded use, so a single write can make modifications at the same time that multiple readers can read, without any need for synchronization.

Reading from the trie given this structure is pretty simple:

We simply use the same logic as before, but because we have to make no changes, this is much simple to work with.

The last bit that we still need to handle is the deletes. Here is how this is handled.

That is basically the inverse of Insert, but it need to handle patching up the trie on the way back up again.

And this is pretty much it, right?

Except… that this is not what I started with. Where is the byte array? We are allocating memory here like crazy and all we did was implement a pretty standard trie without the use of dictionaries.

What is the point?

Well, now that we have this implementation, we have a good understanding on what is actually required of us, and we can take this code and move it to the array, but that would be in the next post.

In the meantime, you can read the entire code in one go here.

A different view on creation

time to read 2 min | 262 words

The following code shows several increasingly complex way to create a shared instance of an object:

What are the differences? The _requestExecuters value is a concurrent dictionary. And the major difference is the kind of allocations that happen in each call.

In the first scenario, we’ll create a new RequestExecuter each time that we call this line. We’ll still use only a single instance, of course, but we still create (and discard) an instance per call.

In the second scenario, we are passing a delegate, so we’ll only create the RequestExecuter once. Or so it seems. The problem is that under concurrent load, it is possible that we’ll have two RequestExecuters created, only one of which will be used. If we have any unmanaged resources in the RequestExecuter, that can cause a leak. Another issue is that we are using the method parameter, which forces the compiler to capture it, and allocate a new delegate instance per call.

The third scenario is the same as the second one, but we aren’t capturing the parameter, so the compiler will not need to create a new delegate instance per call.

The forth one is using a lazy value. This way, we avoid the race in creating the RequestExecuter, but we still create a new lazy instance per call.

And the fifth one is using a lazy instance and a cached delegate version, so there are no extra allocations there. There is still the race to create the lazy instance, but that should happen rarely, and it isn’t holding any expensive resources.

How timers works in the CLR

time to read 3 min | 490 words

One of the coolest things about the CoreCLR being open sourced is that I can trawl through the source code and read random parts of the framework. One of the reasons to do this, is to be able to understand the implementation concerns, not just the design, which than allows us to produce a much better output.

In this case, I was investigating a hunch, and I found myself deep inside the code that runs the timers in .NET. The relevant code is here, and it is clearly commented as well as quite nice to read.

I’m just going to summarize a few interesting things I found in the code.

There is actually only one single real timer for the entire .NET process. I started out thinking this is handled via CreateTimerQueueTimer on Windows, but I couldn’t find a Linux implementation. Reading the code, the CLR actually implements this directly via this code. Simplified, it does the following:

image

This has some interesting implications. It means that timers are all going to be fired from the same thread, at the same time (not quite true, see below), and that there is likely going to be a problem with very long timers (a timer for three months from now will overflow int32, for example).

The list of timers is held in a linked list, and every time it is awakened, it runs through the list, finding the timer to trigger, and the next time to be triggered. The code in this code path is called with only a single timer, which is then used in the managed code for actually implementing the managed timers. It is important to note that actually running the timer callback is done by queuing that on the thread pool, not executing it on the timer thread.

On the managed side, there are some interesting comments explaining the expected usage and data structures used. There are two common cases, one is the use of timeout, in which case this is typically discarded before actual use, and the other is having the recurring timers, which tend to happen once in a long while. So the code favors adding / removing timers over actually finding which need to be executed.

Another thing to note is that this adding / removing / changing / iterating over timers is protected by a single lock. Every time the unmanaged timer wakes, it queue the callback on the thread pool, and then the FireNextTimers is called, which takes a look, iterates over all the timers, and queues all those timers to be executed on the thread pool.

This behavior is interesting, because it has some impact on commonly used cases. But I’ll discuss that on my next post.

Concurrent max value

time to read 2 min | 324 words

For a feature in RavenDB, I need to figure out the maximum number of outputs per document an index has. Now, indexing runs in multiple threads at the same time, so we ended up with the following code:

var actualIndexOutput = maxActualIndexOutput;
if (actualIndexOutput > numberOfAlreadyProducedOutputs)
{
    // okay, now let verify that this is indeed the case, in thread safe manner,
    // this way, most of the time we don't need volatile reads, and other sync operations
    // in the code ensure we don't have too stale a view on the data (beside, stale view have
    // to mean a smaller number, which we then verify).
    actualIndexOutput = Thread.VolatileRead(ref maxActualIndexOutput);
    while (actualIndexOutput > numberOfAlreadyProducedOutputs)
    {
        // if it changed, we don't care, it is just another max, and another thread probably
        // set it for us, so we only retry if this is still smaller
        actualIndexOutput = Interlocked.CompareExchange(
            ref maxActualIndexOutput, 
            numberOfAlreadyProducedOutputs,
            actualIndexOutput);
    }
}

The basic idea is that this code path is hit a lot, once per document indexed per index. And it also needs to be thread safe, so we first do an unsafe operation, then a thread safe operations.

The idea is that we’ll quickly arrive at the actual max number of index inputs,  but we don’t have to pay the price of volatile reads or thread synchronization.

Production postmortemThe case of the intransigent new database

time to read 3 min | 480 words

A customer called us to tell that they had a problem with RavenDB. As part of their process for handling new customers, they would create a new database, setup indexes, and then direct all the queries for that customer to that database.

Unfortunately, this system that has worked so well in development died a horrible death in production. But, and this was strange, only for new customers, and only in the create new customer stage. The problem was:

  • The user would create a new database in RavenDB. This just create a db record, and its location on disk. It doesn’t actually initialize a database.
  • On the first request, we initialize the db, creating it if needed. The first request will wait until this happens, then proceed.
  • On their production systems, that first request (which they used to create the indexes they require) would time out with an error.

Somehow, the creation of a new database would take way too long.

The first thought we had was they are creating the database on a path of an already existing database, maybe a big one that had a long initialization period, or maybe one that required recovery. But the customer validated that they were creating the database on an empty folder.

We looked at the logs, and the logs just showed a bunch of time were there was no activity. In fact, we had a single method call to open the database that took over 15 seconds to run. Except that on a new database, this method just create a bunch of files to start things out and is ready really quickly.

That is the point that led us to suspect that the issue was environmental. Luckily, as the result of many such calls, RavenDB comes with a pretty basic I/O Test tool. I asked the customer to run this on their production system, and I got the following:image

And now everything was clear. They were running on an I/O constrained system (a cloud machine), and they were running into an interesting problem. When RavenDB creates a database, it pre-allocate some files for its transactional journal.

Those files are 64MB in size, and the total write for a new Esent RavenDB database with default configuration is just over 65MB. If your write throughput is less than 1MB/sec sustained, that will be problematic.

I let the customer know about the configuration option to take less space at startup (Esent RavenDB databases can go as low as 5MB, Voron RavenDB starts at 256Kb), but I also gave them a hearty recommendation to make sure that their I/O rates improved, because this isn’t going to be the only case where slow I/O will kill them.

Exercise, learning and sharpening skills

time to read 2 min | 235 words

The major portion of the RavenDB core team is co-located in our main office in Israel. That means that we get to do things like work together, throw ideas off one another, etc.

But one of the things that I found stifling in other work places is a primary focus on the current work. I mean, the work is important, but one of the best ways to stagnate is to keep doing the same thing over and over again. Sure, you are doing great and can spin that yarn very well, but there is always that power loom that is going to show up any minute. So we try to do some skill sharpening.

There are a bunch of ways we do that. We try to have a lecture every week (given by one of our devs). The topics so far range from graph theory to distributed gossip algorithms to new frameworks and features that we deal with to the structure of the CPU and memory architecture.

We also have mini debuggatons. I’m not sure if the name fit, but basically, we show a particular problem, split into pairs and try to come up with a root cause and a solution. This post is written while everyone else in the office is busy looking at WinDBG and finding a memory leak issue, in the meantime, I’m posting to the blog and fixing bugs.

Troubleshooting, when F5 debugging can’t help you

time to read 6 min | 1054 words

You might have noticed that we have been doing a lot of work on the operational side of things. To make sure that we give you as good a story as possible with regards to the care & feeding of RavenDB. This post isn’t about this. This post is about your applications and systems, and how you are going to react when !@)(*#!@(* happens.

In particular, the question is what do you do when this happens?

This situation can crop up in many disguises. For example, you might be seeing a high memory usage in production, or experiencing growing CPU usage over time, or see request times go up, or any of a hundred and one different production issues that make for a hell of a night (somehow, they almost always happen at nighttime)

Here is how it usually people think about it.

The first thing to do is to understand what is going on. About the hardest thing to handle in this situations is when we have an issue (high memory, high CPU, etc) and no idea why. Usually all the effort is spent just figuring out what and why.. The problem with this process for troubleshooting issues is that it is very easy to jump to conclusions and have an utterly wrong hypothesis. Then you have to go through the rest of the steps to realize it isn’t right.

So the first thing that we need to do is gather information. And this post is primarily about the various ways that you can do that. In RavenDB, we have actually spent a lot of time exposing information to the outside world, so we’ll have an easier time figuring out what is going on. But I’m going to assume that you don’t have that.

The end all tool for this kind of errors in WinDBG. This is the low level tool that gives you access to pretty much anything you can want. It is also very archaic and not very friendly at all. The good thing about it is that you can load a dump into it. A dump is a capture of the process state at a particular point in time. It gives you the ability to see the entire memory contents and all the threads. It is an essential tool, but also the last one I want to use, because it is pretty hard to do so. Dump files can be very big, multiple GB are very common. That is because they contain the full memory dump of the process. There is also mini dumps, which are easier to work with, but don’t contain the memory dump, so you can watch the threads, but not the data.

The .NET Memory Profiler is another great tool for figuring things out. It isn’t usually so good for production analysis, because it uses the Profiler API to figure things out, but it has a wonderful feature of loading dump files (ironically, it can’t handle very large dump files because of memory issuesSmile) and give you a much nicer view of what is going on there.

For high CPU situations, I like to know what is actually going on. And looking at the stack traces is a great way to do that. WinDBG can help here (take a few mini dumps a few seconds apart), but again, that isn’t so nice to use.

Stack Dump is a tool that takes a lot of the pain away for having to deal with that. Because it just output all the threads information, and we have used that successfully in the past to figure out what is going on.

For general performance stuff “requests are slow”, we need to figure out where the slowness actually is. We have had reports that run the gamut from “things are slow, client machine is loaded” to “things are slow, the network QoS settings throttle us”. I like to start by using Fiddler to start figuring those things out. In particular, the statistics window is very helpful:

image

The obvious things are the bytes sent & bytes received. We have a few cases where a customer was actually sending 100s of MB in either of both directions, and was surprised it took some time. If those values are fine, you want to look at the actual performance listing. In particular, look at things like TCP/IP connect, time from client sending the request to server starting to get it, etc.

If you found the problem is actually at the network layer, you might not be able to immediately handle it. You might need to go a level or two lower, and look at the actual TCP traffic. This is where something like Wire Shark comes into play, and it is useful to figure out if you have specific errors at  that level (for example, a bad connection that cause a lot of packet loss will impact performance, but things will still work).

Other tools that are very important include Resource Monitor, Process Explorer and Process Monitor. Those give you a lot of information about what your application is actually doing.

One you have all of that information, you can form a hypothesis and try to test it.

If you own the application in question, the best way to improve your chances of figuring out what is going on is to add logging. Lots & lots of logging. In production, having the logs to support what is going on is crucial. I usually have several levels of logging. For example, what is the traffic in/out of my system. Next there is the actual system operations, especially anything that happens in the background. Finally, there are the debug/trace endpoints that will expose internal state and allow you to tweak various things at runtime.

Having good working knowledge on how to properly utilize the above mention tools is very important, and should be considered to be much more useful than learning a new API or a language feature.

Async event loops in C#

time to read 4 min | 656 words

I’m designing a new component, and I want to reduce the amount of complexity involved in dealing with it. This is a networked component, and after designing several such, I wanted to remove one area of complexity, which is the use of explicitly concurrent code. Because of that, I decided to go with the following architecture:

image

 

The network code is just reading messages from the network, and putting them in an in memory queue. Then we have a single threaded event loop that simply goes over the queue and process those messages.

All of the code that is actually processing messages is single threaded, which make it oh so much easier to work with.

Now, I can do this quite easily with a  BlockingCollection<T>, which is how I usually did those sort of things so far. It is simple, robust and easy to understand. It also tie down a full thread for the event loop, which can be a shame if you don’t get a lot of messages.

So I decided to experiment with async approaches. In particular, using the BufferBlock<T> from the DataFlow assemblies.

I came up with the following code:

var q = new BufferBlock<int>(new DataflowBlockOptions
{
CancellationToken = cts.Token,
});

This just create the buffer block, but the nice thing here is that I can setup a “global” cancellation token for all operations on this. The problem is that this actually generate bad exceptions (InvalidOperationException, instead of TaskCancelledException). Well, I’m not sure if bad is the right term, but it isn’t the one I would expect here, at least. If you pass a cancellation token directly to the method, you get the behavior I expected.

At any rate, the code for the event loop now looks like this:

private static async Task EventLoop(BufferBlock<object> bufferBlock, CancellationToken cancellationToken)
{
while (true)
{
object msg;
try
{
msg = await bufferBlock.ReceiveAsync(TimeSpan.FromSeconds(3), cancellationToken);
}
catch (TimeoutException)
{
NoMessagesInTimeout();
continue;
}
catch (Exception e)
{
break;
}
ProcessMessage(msg);
}
}

And that is pretty much it. We have a good way to handle timeouts, and processing messages, and we don’t take up a thread. We can also be easily cancelled. I still need to run this through a lot more testing, in particular, to verify that this doesn’t cause issues when we need to debug this sort of system, but it looks promising.

DSLs in Boo, and a look back

time to read 2 min | 268 words

About 6 years ago, I started writing the DSLs in Boo book, it came out in 2010, and today I got an email saying that this is now officially out of print. It was never a hugely popular book, so I’m not really surprised, but it really got me thinking.

I got to build several DSLs for production during the time I was writing this book, but afterward, I pretty much pivoted hard to RavenDB, and didn’t do much with DSLs since. However, the knowledge acquired during the writing of this book has actually been quite helpful when writing RavenDB itself.

I’m not talking about the design aspects of writing a DSLs, or the business decisions that are involved with that, although that is certainly a factor. I’m talking about the actual technical details of working with a language, a parser, etc.

In fact, you won’t see that, probably, but RavenDB indexes and transformers are actually DSLs, and they use a lot of the techniques that I talk about in the book. We start with something that looks like a C# code, but what ends up running is actually something that is far different. The Linq provider, too, rely heavily on those same techniques. We show you one thing but actually do something quite different under the cover.

It is interesting to see how the actual design of RavenDB was influenced by what my own history and the choices I made in various places. If I wasn’t well versed with abusing a language, I would probably have to go with something like CouchDB’s views, for example.

FUTURE POSTS

  1. Answer: What does this code do? - one day from now
  2. The metrics calculation methods - 4 days from now

There are posts all the way to Jan 23, 2017

RECENT SERIES

  1. Answer (9):
    16 Aug 2011 - Modifying execution approaches
  2. Challenge (48):
    19 Jan 2017 - What does this code do?
  3. Implementing low level trie (2):
    14 Dec 2016 - Part II
  4. The performance regression in the optimization (2):
    01 Dec 2016 - Part II
  5. Digging into the CoreCLR (4):
    25 Nov 2016 - Some bashing on the cost of hashing
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats