Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,354 | Comments: 47,172

filter by tags archive

Reducing the cost of occasionally needed information

time to read 3 min | 486 words

Consider the case when you need to call a function, and based on the result of the function, and some other information, do some work. That work require additional information, that can only be computed by calling the function.

Sounds complex? Let us look at the actual code, I tried to make it simple, but keep the same spirit.

This code has three different code paths. We are inserting a value that is already in the tree, so we update it and return. We insert a new value, but the page has enough space, so we add it and return.

Or, the interesting case for us, we need to split the page, which requires modifying the parent page (which may require split that page, etc). So the question is, how do we get the parent page?

This is important because we already found the parent page, we had to go through it to find the actual page we returned in FindPageFor. So ideally we’ll just return the list of parent pages whenever we call FindPageFor.

The problem is that in all read scenarios, which by far outweigh the  write operations, never need this value. What is more, even write operations only need it occasionally. On the other hand, when we do need it, we already did all the work needed to just give it to you, and it would be a waste to do it all over again.

We started the process (this particular story spans 4 years or so) with adding an optional parameter. If you passed a not null value, we’ll fill it with the relevant details. So far, so good, reads will send it null, but all the writes had to send it, and only a few of them had to actually use it.

The next step was to move the code to use an out lambda. In other words, when you called us, we’ll also give you a delegate that you can call, which will give you the details you need. This way, you’ll only need to pay the price when you actually needed it.

It looked like this:

However, that let to a different problem, while we only paid the price of calling the lambda when we needed it, we still paid the price  of allocating the labmda itself.

The next step was to create two overloads, one which is used only for reads, and one for writes. The writes one allows us to send a struct out parameter. The key here is that because it is a struct there is no allocation, and the method is written so we put the values that were previously captured on the lambda on the struct. Then if we need to actually generate the value, we do that, but have no additional allocations.

And this post is now about six times longer than the actual optimization itself.

Dynamic compression acceleration

time to read 3 min | 523 words

imageAfter talking about what specifying LZ4 acceleration do, let us get down and talk about how we use it.

In our benchmark, we run into a problem. Our benchmark hardware is too fast. I’m testing on a disk that can deliver 250,000 IOPS and can write over 1GB/sec. On the other hand, if I’m running on Amazon, the maximum IOPS that I can pay for is 20,000. Azure seems to have a limit of 5,000 IOPS per disk.  And standard SSD will give you about 75,000 IOPS, while HDD will typically have IOPS in the low hundreds. I’m using IOPS because it is an easy single metric to compare, but the same is abound disk bandwidth, write speed, etc.

That means that we are actually testing on hardware that is likely to be better than what we’ll typically run on. Which means that we need to be careful about what kind of optimizations we bring in. It would be easy to optimize for a specific machine, at the cost of worst performance in the general case.

Case in point, for many cases, the cost of actually writing to disk on that particular machine is low enough that it isn’t worth to put a lot of effort into compressing the data. That isn’t the case all the time, though, for example, is we are applying pressure on the I/O system, or if we have some other stuff going on that will impact our writes.

On that particular machine, however, it got to the point where we are seeing higher compression times than write times, which is pretty ridiculous. We aren’t actually saving anything by doing that. But instead of tailoring a solution to a single machine, we decided to go in a bit of a roundabout way. When we start, our initial assumption is that we are in a machine with higher I/O cost than CPU cost, which will cause us to put more effort into compressing the data to reduce the amount of writes we have to make.

On the other hand, after we start writing, we can start measuring the relative costs, and adjust accordingly. In other words, based on how expensive it is to compress the data versus writing the data to disk, we can dynamically adjust the cost of compression, until we hit the sweet spot for the machine and environment that we are running on. The beauty in this behavior is that it will automatically adjust to pressures, so if someone is running a backup on this disk, and slowing us down, we’ll learn about it and increase the compression rate to compensate. Once the I/O pressure is off, we can relax our compression ratio to reduce total resource consumed.

On our benchmark machine, we have managed to get the import of the full StackOverflow dataset, just over 18 million documents, totaling over 50 GB in size in under 8 minutes, reducing our overall benchmark time by almost 10%.

How does LZ4 acceleration work?

time to read 3 min | 403 words

LZ4 has an interesting feature, acceleration. It allows you to modify the compression ratio (and the corresponding compression speed). This is quite interesting for several scenarios. In particular, while higher compression rate is almost always good, you want to take into account the transfer speed as well. For example, if I’m interesting in writing to the disk, and the disk write rate is 400 MB / sec, it isn’t worth it to use the default acceleration level (which can produce about 385MB / sec), and I can reduce that so the speed of compression will not dominate my writes.

You can read more about it here.

We started playing with this feature today, and that brought the question, what does this actually do?

This is a really good question, but before I can answer it, we need to understand how compression works. Here is a very simple illustration of the concept.

We created a very stupid size constrained hash table, that maps from the current 4 bytes to the previous instance where we saw those 4 bytes. When we find a match, we check to see how much of a match we have, and then write it out as a back reference.

Note that if we don’t find a match, we update the last match position for this value, and move one byte forward, to see if there is a match in that location. This way, we are scanning the past 64KB to see if there is a match. This is meant to be a very rough approximation to how compression work, don’t get too hang up on the details.

The key point with acceleration is that it impacts what we’ll do if we didn’t find a match. Instead of moving by one byte, we are going to skip by more than that. The actual logic is in here, and if I’m reading this correctly, it will probe the data to compress in increasingly wider gaps until it find a match that it can use to reduce the output size.

What acceleration does is tell it to jump in even wider increments as it searches for a match. This reduce the number of potential matches it find, but also significantly reduces the amount of work that LZ4 need to do with comparing the data stream, hence how it both accelerate the speed and reduces the compression ratio.

Performance optimizationsOne step forward, ten steps back

time to read 3 min | 547 words

AvatarCrane01-300pxAs we continuously optimized more and more of our code, we kept seeing faster and faster benchmarks. In fact, the more we optimized, the faster we became. One would think that there is some sort of correlation there.

However, that is a mere theory that can be disproven, as this story will demonstrate.

When optimizing, you eventually expects to get into the land of diminishing returns. But something very strange happened, we have made a few changes, the each should have brought our speed up by a significant percentage, we had the micro benchmarks to prove that this is the case, and we were even able to see that the code was running much faster than before, but the overall benchmark time kept growing, and we started seeing higher and higher stalls in the process.

That… sucked. Especially because we couldn’t figure out what was going on. Every single metric we could see was saying that we should be seeing higher speed, our disk usage went up, our CPU usage went up a bit, we increased our memory buffers from 32 MB to 1GB, and every single indication we had told us that we are faster on a per operation basis. But the entire thing just started slowing down more and more.

Frustratingly, there was nothing we could really sink our teeth into. The system would just go into stalls and do nothing. We got to the point it looked like we broke the operating system, but nothing helped, stuff just didn’t want to work. It looked like we were waiting for I/O, but tracing at the syscall level showed that we were getting much faster response from the hardware than we saw in the application. Somewhere, stuff was getting lost.

Eventually we managed to track it down to the offending line:

image

So this is pretty obvious, right? We are waiting, so we are slow. But this line is called from a completely different part of the code, and it isn’t blocking anything else in the code path that is suffering from stalls. The key here is that this line is called from:

image

Still fine, right? We threw that into the thread pool, so it is fine to wait. But…

image

The line above is responsible for releasing our threads when the I/O operation has completed. Note that it needs to run on the thread pool as well, but because we are now much faster, we now have a lot of threads that are stuck in the call to SyncEnvironment, that overloaded the thread pool, and meant that the notification that we can proceed would come very late. We missed it in all of our profiling because we didn’t look at that code path at all, since it was obviously unrelated to the issue at hand.

Single roundtrip authentication

time to read 5 min | 851 words

imageOne of the things that we did in RavenDB 4.0 was support running on Linux, which meant giving up on Windows Authentication. To the nitpickers among you, I’m aware that we can do LDAP authentication, and we can do Windows Auth over HTTP/S on Linux. Let us say that given the expected results, all I can wish you is that you’ll have to debug / deploy / support such a system for real.

Giving up on Windows Authentication in general, even on Windows, is something that we did with great relief. Nothing like getting an urgent call from a customer and trying to figure out what a certain user isn’t authenticated, and trying to figure out the trust relationship between different domains, forests, DMZ and a whole host of other stuff that has utter lack of anything related to us. I hate those kind of cases.

In fact, I think that my new ill wish phrase will become “may you get a 2 AM call to support Kerberus authentication problems on a Linux server in the DMZ to a nested Windows domain”. But that might be going too far and damage my Karma.

That lead us to API Key authentication. And for that, we want to be able to get authenticate against the server, and get a token, which can then use in future requests. An additional benefit is that by building our own system we are able to actually own the entire thing and support it much better. A side issue here is that we need to support / maintain security critical code, which I’m not so happy about. But owning this also give me the option of doing things in a more optimized fashion. And in this case, we want to handle authentication in as few network roundtrips as possible, ideally one.

That is a bit challenging, since on the first request, we know nothing about the server. I actually implemented a couple of attempts to do so, but pretty much all of them were vulnerable to some degree after we did security analysis on them. You can look here for some of the details, so we gave that up in favor of mostly single round trip authentication.

The very first time the client starts, it will use a well known endpoint to get the server public key. When we need to authenticate, we’ll generate our own key pair and use it and the server public key to encrypt a hash of the secret key with the client’s public key, then send our own public key and the  encrypted data to the server for authentication. In return, the server will validate the client based on the hash of the password and the client public key, generate an authentication token, encrypt that with the client’s public key and send it back.

Now, this is a bit complex, I’ll admit, and it is utterly unnecessary if the users are using HTTPS, as they should. However, there are actually quite a few deployments where HTTPS isn’t used, mostly inside organizations where they choose to deploy over HTTP to avoid the complexity of cert management / update / etc. Our goal is to prevent leakage of the secret key over the network even in the case that the admin didn’t setup things securely. Note that in this case (not using HTTPS), anyone listening on the network is going to be able to grab the authentication token (which is valid for about half an hour), but that is out of scope for this discussion.

We can assume that communication between client & server are not snoopable, given that they are using public keys to exchange the data. We also ensure that the size of the data is always the same, so there is no information leakage there. The return trip with the authentication token is also

A bad actor can pretend to be a server and fool a client into sending the authentication request using the bad actor’s public key, instead of the server. This is done because we don’t have trust chains. The result is that the bad actor now has the hash of the password with the public key of the client (which is also known). So the bad guy can try do some offline password guessing, we mitigate that using secret keys that are a minimum of 192 bits and generated using cryptographically secured RNG.

The bad actor can off course send the data as is to the server, which will authenticate it normally, but since it uses the client’s public key both for the hash validation and for the encryption of the reply, the bad actor wouldn’t be able to get the token.

On the client side, after the first time that the client hits the server, it will be able to cache the server public key, and any future authentication token refresh can be done in a single roundtrip.

RavenDB 4.0 Features: Document Versioning

time to read 3 min | 487 words

After focusing for so long on the low level and performance of RavenDB 4.0, you might be forgiven if you think that we are neglecting real meaningful features in 4.0. That couldn’t be further from the truth, the problem is that this it is actually quite hard to talk about some features meaningfully before they have been through complete round of implementing, testing, wiring for UI and UX and then beating them with a stick to see what fall.

Document versioning in RavenDB isn’t new. It has been a feature for several releases now. In RavenDB 4.0 we have decided to reimagine them complete. One of the strong points of RavenDB design from the get go was that triggers allowed you to build new features into the product without making core changes. That allowed quickly adding new features, but could cause issues with feature intersections.

With RavenDB 4.0, versions are now part of the core code. That means that they get special treatment. Here is how it looks like in the UI:

image

You can switch between versions of the document, and see them as it was at various points in time.

Externally, this is pretty much all there is to it. Here is what this will look like when you are looking at a revision.

image

In terms of the client API, there is little change:

So what did change?

Revisions are now stored separately from documents. Previously we stored them as special documents, which led to some problems (indexes had to read to see that they aren’t interested in them, for example), and we had to special case a lot of stuff that is supposed to work on documents to not apply to attachments. Making them their own unique thing simplify everything significantly. We have also made sure that revisions can properly flow with replication, and prevent the potential for versioning conflicts which were tricky to solve. The downside of this is that desirable operations on revisions (load a revision with include, for example), will no longer work.

The whole thing is also significantly faster and cheaper to work with.  A major change we also did was to ensure that versioning is always on, so you don’t have to specify it at the DB creation time. Instead, you can configure it whenever you want, without having to add a bundle at runtime and risk unsupported operations.

That means that if at some point you want to start versioning your documents, you can just flip a switch, and it will immediately start versioning all changes for you.

Command line usability

time to read 2 min | 312 words

I hate the new “dotnet test” command. I don’t think that anyone ever thought about looking at the output it generates for real projects.

For example, here is a section form the log output our fast tests:

image

There is so much crap here, including duplicate information and a whole bunch of mess that it is very hard to find relevant information. For example, there is a failing test here. How long will it take you to find it?

Another important aspect for us is the fact that this will actually run the same process. If you have something that will crash the test process, you’ll never get to see what is going on. Here is what a crash due to stack overflow looks like using “dotnet test”

image

As a result, we moved to dotnet xunit, which is a much better test runner.

image

We get color coding, including red for failing tests, so we don’t have to hunt them.

What is more important, it will not hide crucial information from us because it feels like it. If there is a crash, we can actually see what happened.

image

I know it sounds trivial, but “dotnet test” doesn’t have it.

The cost of allocating memory and cheating like crazy for best performance

time to read 5 min | 813 words

Allocating memory is expensive. It is expensive in managed code, and it is expensive in native code, it is just expensive. There are many ways to alleviate those costs, but at some point, any generic allocation scheme is going to run to ground with the harsh reality of the cost of bookkeeping.

Many managed programs are actually much faster than their native counterparts precisely because of that. Manage runtimes have more information about the allocations, and can defer / batch and optimize memory usage accordingly. For example, in .NET the amount of code that you would typically run for allocation is miniscule. Effectively a pointer bump and that is it. And the garbage collector can do really good work on collecting those allocations if they are short lived, and the generational nature means that we typically have far less cost of memory book keeping than in most native applications, and when we do, we can optimize things by doing that all at once.

All of which is great, except that there is still a cost to it, a decidedly non trivial one if you are writing high performance software. So what is the answer?

Put simply, you cheat, and you cheat like mad. The issue is with the problem statement, if any generic allocation scheme is problematic, throw the generic portion out the window and specialize like an insect.

With RavenDB, we have quite early on realized that the cost of memory allocation was too prohibitive to allow us to use it for most common tasks. So we manage our own memory, but we do that with a pattern. All operations are always done inside a scope, that scope define the boundary of the operation. An operation can be something like writing a single document, or answering a query, etc. The idea is that this is (usually) short lived and well scoped. And that turns out to be key.

We use the notion of context and context pool. A context is just a holder for a bunch of (native) memory that we get from the OS and holds on to. And when we start an operation, we grab a context from the pool and start using it. Whenever we need to grab some memory, we go to the context and ask it for some.

The code that typically run in this case is:

Amusingly enough, we are actually doing a managed allocation as part of are unmanaged allocation. This is fine, because we typically allocate things in the ~KB range, so we don’t have a lot of individual AllocatedMemoryData instances to manage.

But the key here is that when the operation scope is over, we return the context to the pool. The key here is that when we return the context to the pool, we already reset _ptrCurrent to its initial value. So all the memory we used during that operation is available to be used again very cheaply.

Now, this is actually almost the entire story. As it turns out, this works very well for a lot of scenarios, but in some cases we have a context open for a long while ( bulk insert, indexing, etc). That means that an operation would go through quite a lot of memory if we didn’t have some way to recover it on the fly. And we do. When we are done with the memory, we return it to the context. Now, this is usually where things starts to get hairy. But we made sure that there is an O(1) cost for both allocation & deallocation in the context.

When you return memory to the context, we check if this is the top most section, and if so, we just reduce _ptrCurrent by the amount of memory that was used. But there are times when we return memory that isn’t on the top of the context, what then? Instead of leaving this memory unused until the next context reset, we’re adding that to a linked list. But this linked list is actually stored in the memory that was freed. This looks like this:

The _freed is an array that hold the most recent value for that size. So when we allocate, we can do:

And if we don’t find returned memory, we’ll go to the cheap-alloc code above and bump the pointer. So during the lifetime of the context, we get pretty cheap memory reuse, and we don’t worry too much about it because the context reset will clear everything anyway.

A side benefit of this is that because the context goes to the pool, we are not going to release that memory, so when the next request comes, we’ll have relatively hot memory right there & available. And, of course, all of it is going to be with high locality.

Trying to live without ReSharper in Visual Studio 2017

time to read 5 min | 944 words

This is an experiment that is doomed to failed, but given that I just setup a new VS 2017, I decided to see how it would feel to run it without ReSharper and how it impacts my workflow. Please note that this is very much not an unbiased review. I have been using ReSharper for my day to day coding for over a decade, and the workflow it enables is deeply rooted in how I work. I’m going to ignore any differences in key bindings, as irritating as that can be, in favor of just looking at different features.

So far, I spent a couple of days trying to work on VS 2017 without ReSharper. It has been quite frustrating, but I was able to sort of limp along. I most certainly felt the lack.

My hope was that I would be able to see the promised performance improvements without it, and then consider whatever it is worth it. That wasn’t the case.

image

As you can see, ReSharper is not installed, but I managed to get VS into a hang several times. It seems to happen with NuGet, or when trying to use the Test Explorer and a few times when I was trying to edit code while the solution was compiling.

Without any meaningful order, here are the things that I really felt the lack of.

  • Go to definition with automatic decompile is something that I apparently use a lot more than I expected. It helps figuring out what I can expect from the method that I’m looking at, even when it is not our code that I’m looking at.
  • Refactor method ignoring whitespace lets me just write a statement and it becomes a method name. This is actually quite nice.
  • Quick docs in R# is very nice, that is, the ability to hit Ctrl+Q and get the docs for a method is something that I seem to be using a lot. This is important because I can quickly check the docs (most often, what conditions it has for returning, or specific arguments. The key here is that I don’t need to leave my current context. I can Ctrl+Q, peek at the docs, and then move on.

image

  • Extract variable isn’t there, and so are a lot of the refactoring that I’m used to aren’t there or are hardly accessible.
  • IntelliSense is also a lot less intelligent. Being able to write a method and just Ctrl+Space all the parameters because R# can fill the context is very useful.
  • Ctrl+N, (symbol search) is also a LOT more useful. I’m familiar with the Ctrl+; to search solution explorer, but the features don’t compare. In one case, I get live feedback, which means that I don’t have to remember nearly as much about the symbol that I’m looking for. On the other hand, I have to write it and hit enter to see the results. There is also an issue with the presentation. Solution explorer is really poor model for it.
    image image

    There is a lot of wasted space in the VS model versus the R# model.
  • Update: I have since learned about VS' Ctrl+, feature, and that seems much nicer, and it also does auto peeking, which I like.

In general, to be honest, R# feels smarter (remember, I’m biased and likely work to the strengths of R#). But another aspect here is that with R#, I rarely have to leave my current context, pretty much everything is available immediately from where I am.

Even something as simple as the search above. With R#, this shows up in the middle of the screen, with VS, that is all the way at the right, so I need to move my eyes to track it. The same is pretty much true for everything else. Reference search in R# shows where I’m looking at right now, and with VS, it shows in a window in the bottom. Refactoring options in VS show up in the top right, and it is easy to miss them completely, R# put them in the front, along with what you are working on right now.

I’m going to install R# for VS 2017 shortly, and then I’ll be able to compare the speed, but I’m pretty sure that I’m not going to be very happy with that. Then again, once it is loaded, I haven’t noticed R# + 2015 being much worse than 2017 without Resharper.

Not that I’m doing this during my usual work, on a solution with 55 projects and 820 KLOC.

Update

I have tried R# & VS 2017 for a couple of days now, and I can tell that aside from the project open times (which are absolutely atrocious with R#), I’m not seeing anything major performance wise.

That said, project open time are also “switch between branches”, and that is a major PITA.

Of course, I’m guessing R# is really popular, because:

image

I can guess someone was tired of “visual studio is slow” when it is someone else’s code, and they wrote this to point the blame on the relevant extension so the bug report would go to the appropriate people.

The bug in the platform: Partial HTTP requests & early response

time to read 3 min | 408 words

Some of our tests were randomly hanging on the Linux machine. This was a cause for concern, obviously, so we started investigating that.

The problem was that we couldn’t really reproduce this except by running our full code, in which case it would hang. It clearly seemed timing related, because it wouldn’t always hang, but it was pretty consistent.  We resorted to sprinkling printf throughout our code to figure out what was going on, and we found that a particular request would get to the server, the server would respond to it, and then the client would hang.

Looking at TCP traces, we could clearly see that the response was generated by the server and sent to the client, but something was funky there. We reported it and try to figure out what was going on in the meantime.

We finally managed to get an isolated reproduction:

As a side note, the response from the Core FX & Kestrel guys is absolutely amazing. In a short while, it was determined to be an issue in Kestrel and fixed.

The details of the issue are really interesting.

  1. The client (libcurl) starts sending the request.
  2. Kestrel sends an error response.
  3. The client sees the error response and stops sending the request (this is a libcurl behavior), but does NOT close the sending side of the connection (HTTP clients typically wait until they receive the entire response before closing the connection, because most HTTP servers treat a mid-request FIN like a connection reset).
  4. After the application runs, Kestrel tries to consume the rest of the request body from the client. But the client has stopped sending the request, so Kestrel hangs here.
  5. The client hangs waiting for the chunked terminator from the server, which is only sent after it consumes the rest of the request body (we'll work on improving this behavior).

This behavior is specific to libcurl,  the WinHTTP implementation sends the full body of the request before processing any of the response.

As an aside, this difference in behavior is actually pretty interesting to know, because we need those low level details when designing interactions. For example, if I send a long request to the server, and the server want to tell me that it is not going to accept it, being able to know about it early is very advantageous.


FUTURE POSTS

  1. De-virtualization in CoreCLR: Part II - 3 days from now

There are posts all the way to May 01, 2017

RECENT SERIES

  1. De-virtualization in CoreCLR (2):
    28 Apr 2017 - Part I
  2. re (17):
    26 Apr 2017 - Writing a Time Series Database from Scratch
  3. Performance optimizations (2):
    11 Apr 2017 - One step forward, ten steps back
  4. RavenDB Conference videos (12):
    03 Mar 2017 - Replication changes in 3.5
  5. Low level Voron optimizations (5):
    02 Mar 2017 - Primitives & abstraction levels
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats