Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,351 | Comments: 47,155

filter by tags archive

I was wrong, reflecting on the .NET design choices

time to read 3 min | 480 words

I have been re-thinking about some of my previous positions with regards to development, and it appear that I have been quite wrong in the past.

In particular, I’m talking about things like:

Note that those posts are parts of a much larger discussion, and both are close to a decade old. They aren’t really relevant anymore, I think, but it still bugs me, and I wanted to outline my current thinking on the matter.

C# is non virtual by default, while Java is virtual by default. That seems like a minor distinction, but it has huge implications. It means that proxying / mocking / runtime subclassing is a lot easier with Java than with C#. In fact, a lot of frameworks that were ported from Java rely on this heavily, and that made it much harder to use them in C#. The most common one being NHibernate, and one of the chief frustrations that I kept running into.

However, given that I’m working on a database engine now, not on business software, I can see a whole different world of constraints. In particular, a virtual method call is significantly more expensive than a direct call, and that adds up quite quickly. One of the things that we routinely do is try to de-virtualize method calls using various tricks, and we are eagerly waiting .NET Core 2.0 with the de-virtualization support in the JIT (we already start writing code to take advantage of it).

Another issue is that my approach to software design has significantly changed. Where I would previously do a lot of inheritance and explicit design patterns, I’m far more motivated toward using composition, instead. I’m also marking very clear boundaries between My Code and Client Code. In My Code, I don’t try to maintain encapsulation, or hide state, whereas with stuff that is expected to be used externally, that is very much the case. But that give a very different feel to the API and usage patterns that we handle.

This also relates to abstract class vs interfaces, and why you should care. As a consumer, unless you are busy doling some mocking or so such, you likely don’t, but as a library author, that matters a lot to the amount of flexibility you get.

I think that a lot of this has to do with my view point, not just as an Open Source author, but someone who runs a project where customers are using us for years on end, and they really don’t want us to make any changes that would impact their code. That lead to a lot more emphasis on backward compact (source, binary & behavior), and if you mess it up, you get ricochets from people who pay you money because their job is harder.

A tricky bit of code

time to read 1 min | 111 words

I run into the following bit of code while doing a code review on a pull request:

This was very strange, because the code appeared to compile properly, but it shouldn’t. I mean, look at it. The generic parameter is not constrained, and I don’t have any extension methods on Object that can apply here, so why would this compile?

The secret was in the base class:

Basically, we specified the constraint on the abstract method, and then inherited it, which was really confusing to me until I figured it out.

You can’t do the same with interfaces, though, although explicit interface implementation does allow it.

Overloading the Windows Kernel and locking a machine with RavenDB benchmarks

time to read 4 min | 781 words

During benchmarking RavenDB, we have run into several instances where the entire machine would freeze for a long duration, resulting in utter non responsiveness.

This has been quite frustrating to us, since a frozen machine make it kinda hard to figure out what is going on. But we finally figured it out, and all the details are right here in the screen shot.

image

What you can see is us running our current benchmark, importing the entire StackOverflow dataset into RavenDB. Drive C is the system drive, and drive D is the data drive that we are using to test our RavenDB’s performance.

Drive D is actually a throwaway SSD. That is, an SSD that we use purely for benchmarking and not for real work. Given the amount of workout we give the drive, we expect it to die eventually, so we don’t want to trust it with any other data.

At any rate, you can see that due to a different issue entirely, we are now seeing data syncs in excess of 8.5 GB. So basically, we wrote 8.55GB of data very quickly into a memory mapped file, and then called fsync. At the same time, we started increasing our  scratch buffer usage, because calling fsync ( 8.55 GB ) can take a while. Scratch buffers are a really interesting thing, they were born because of Linux crazy OOM design, and are basically a way for us to avoid paging. Instead of allocating memory on the heap like normal, which would then subject us to paging, we allocate a file on disk (mark it as temporary & delete on close) and then we mmap the file. That give us a way to guarantee that Linux will always have a space to page out any of our memory.

This also has the advantage of making it very clear how much scratch memory we are currently using, and on Azure / AWS machines, it is easier to place all of those scratch files on the fast temp local storage for better performance.

So we have a very large fsync going on, and a large amount of memory mapped files, and a lot of activity (that modify some of those files) and memory pressure.

That force the Kernel to evict some pages from memory to disk, to free something up. And under normal conditions, it would do just that. But here we run into a wrinkle, the memory we want to evict belongs to a memory mapped file, so the natural thing to do would be to write it back to its original file. This is actually what we expect the Kernel to do for us, and while for scratch files this is usually a waste, for the data file this is exactly the behavior that we want. But that is beside the point.

Look at the image above, we are supposed to be only using drive D, so why is C so busy? I’m not really sure, but I have a hypothesis.

Because we are currently running a very large fsync, I think that drive D is not currently process any additional write requests. The “write a page to disk” is something that has pretty strict runtime requirements, it can’t just wait for the I/O to return whenever that might be. Consider the fact that you can open a memory mapped file over a network drive, I think it very reasonable that the Kernel will have a timeout mechanism for this kind of I/O. When the pager sees that it can’t write to the original file fast enough, it shrugs and write those pages to the local page file instead.

This turns an  otherwise very bad situation (very long wait / crash) into a manageable situation. However, with the amount of work we put on the system, that effectively forces us to do heavy paging (in the orders of GBs) and that in turn lead us to a machine that appears to be locked up due to all the paging. So the fallback error handling is actually causing this issue by trying to recover, at least that is what I think.

When examining this, I wondered if this can be considered a DoS vulnerability, and after careful consideration, I don’t believe so. This issue involves using a lot of memory to cause enough paging to slow everything down, the fact that we are getting there in a somewhat novel way isn’t opening us to anything that wasn’t there already.

Sometimes it really IS not our fault

time to read 1 min | 150 words

imageSo we got an emergency support call during the Passover holiday, and as you can imagine, it was a strange one. Our investigation of the error basically boiled down (cutting down a lot of effort in between): “This can’t be happening.”

I hate this kind of answer, because it usually means that we are missing something. Usually that can be a strange error code, some race condition or just something strange about the environment.

While we were working the problem, the customer came back with, “Oh, we found the issue. A memory unit went rogue, and the firmware wasn’t able to catch it.” When they updated the firmware, it apparently caught it immediately.

So I guess we can close this support incident. Smile

RavenDB Bootcamp

time to read 1 min | 105 words

image

We have RavenDB Bootcamp ready to go. If you want to learn about RavenDB, we have 18 parts series that take you through working with RavenDB in easily digestible pieces.

You can either go through them all or register to get them once a day via email so it doesn’t take too much time all at once.

They are available in our docs, and it is a great way to learn RavenDB from nothing.

Reducing the cost of occasionally needed information

time to read 3 min | 486 words

Consider the case when you need to call a function, and based on the result of the function, and some other information, do some work. That work require additional information, that can only be computed by calling the function.

Sounds complex? Let us look at the actual code, I tried to make it simple, but keep the same spirit.

This code has three different code paths. We are inserting a value that is already in the tree, so we update it and return. We insert a new value, but the page has enough space, so we add it and return.

Or, the interesting case for us, we need to split the page, which requires modifying the parent page (which may require split that page, etc). So the question is, how do we get the parent page?

This is important because we already found the parent page, we had to go through it to find the actual page we returned in FindPageFor. So ideally we’ll just return the list of parent pages whenever we call FindPageFor.

The problem is that in all read scenarios, which by far outweigh the  write operations, never need this value. What is more, even write operations only need it occasionally. On the other hand, when we do need it, we already did all the work needed to just give it to you, and it would be a waste to do it all over again.

We started the process (this particular story spans 4 years or so) with adding an optional parameter. If you passed a not null value, we’ll fill it with the relevant details. So far, so good, reads will send it null, but all the writes had to send it, and only a few of them had to actually use it.

The next step was to move the code to use an out lambda. In other words, when you called us, we’ll also give you a delegate that you can call, which will give you the details you need. This way, you’ll only need to pay the price when you actually needed it.

It looked like this:

However, that let to a different problem, while we only paid the price of calling the lambda when we needed it, we still paid the price  of allocating the labmda itself.

The next step was to create two overloads, one which is used only for reads, and one for writes. The writes one allows us to send a struct out parameter. The key here is that because it is a struct there is no allocation, and the method is written so we put the values that were previously captured on the lambda on the struct. Then if we need to actually generate the value, we do that, but have no additional allocations.

And this post is now about six times longer than the actual optimization itself.

Dynamic compression acceleration

time to read 3 min | 523 words

imageAfter talking about what specifying LZ4 acceleration do, let us get down and talk about how we use it.

In our benchmark, we run into a problem. Our benchmark hardware is too fast. I’m testing on a disk that can deliver 250,000 IOPS and can write over 1GB/sec. On the other hand, if I’m running on Amazon, the maximum IOPS that I can pay for is 20,000. Azure seems to have a limit of 5,000 IOPS per disk.  And standard SSD will give you about 75,000 IOPS, while HDD will typically have IOPS in the low hundreds. I’m using IOPS because it is an easy single metric to compare, but the same is abound disk bandwidth, write speed, etc.

That means that we are actually testing on hardware that is likely to be better than what we’ll typically run on. Which means that we need to be careful about what kind of optimizations we bring in. It would be easy to optimize for a specific machine, at the cost of worst performance in the general case.

Case in point, for many cases, the cost of actually writing to disk on that particular machine is low enough that it isn’t worth to put a lot of effort into compressing the data. That isn’t the case all the time, though, for example, is we are applying pressure on the I/O system, or if we have some other stuff going on that will impact our writes.

On that particular machine, however, it got to the point where we are seeing higher compression times than write times, which is pretty ridiculous. We aren’t actually saving anything by doing that. But instead of tailoring a solution to a single machine, we decided to go in a bit of a roundabout way. When we start, our initial assumption is that we are in a machine with higher I/O cost than CPU cost, which will cause us to put more effort into compressing the data to reduce the amount of writes we have to make.

On the other hand, after we start writing, we can start measuring the relative costs, and adjust accordingly. In other words, based on how expensive it is to compress the data versus writing the data to disk, we can dynamically adjust the cost of compression, until we hit the sweet spot for the machine and environment that we are running on. The beauty in this behavior is that it will automatically adjust to pressures, so if someone is running a backup on this disk, and slowing us down, we’ll learn about it and increase the compression rate to compensate. Once the I/O pressure is off, we can relax our compression ratio to reduce total resource consumed.

On our benchmark machine, we have managed to get the import of the full StackOverflow dataset, just over 18 million documents, totaling over 50 GB in size in under 8 minutes, reducing our overall benchmark time by almost 10%.

How does LZ4 acceleration work?

time to read 3 min | 403 words

LZ4 has an interesting feature, acceleration. It allows you to modify the compression ratio (and the corresponding compression speed). This is quite interesting for several scenarios. In particular, while higher compression rate is almost always good, you want to take into account the transfer speed as well. For example, if I’m interesting in writing to the disk, and the disk write rate is 400 MB / sec, it isn’t worth it to use the default acceleration level (which can produce about 385MB / sec), and I can reduce that so the speed of compression will not dominate my writes.

You can read more about it here.

We started playing with this feature today, and that brought the question, what does this actually do?

This is a really good question, but before I can answer it, we need to understand how compression works. Here is a very simple illustration of the concept.

We created a very stupid size constrained hash table, that maps from the current 4 bytes to the previous instance where we saw those 4 bytes. When we find a match, we check to see how much of a match we have, and then write it out as a back reference.

Note that if we don’t find a match, we update the last match position for this value, and move one byte forward, to see if there is a match in that location. This way, we are scanning the past 64KB to see if there is a match. This is meant to be a very rough approximation to how compression work, don’t get too hang up on the details.

The key point with acceleration is that it impacts what we’ll do if we didn’t find a match. Instead of moving by one byte, we are going to skip by more than that. The actual logic is in here, and if I’m reading this correctly, it will probe the data to compress in increasingly wider gaps until it find a match that it can use to reduce the output size.

What acceleration does is tell it to jump in even wider increments as it searches for a match. This reduce the number of potential matches it find, but also significantly reduces the amount of work that LZ4 need to do with comparing the data stream, hence how it both accelerate the speed and reduces the compression ratio.

Performance optimizationsOne step forward, ten steps back

time to read 3 min | 547 words

AvatarCrane01-300pxAs we continuously optimized more and more of our code, we kept seeing faster and faster benchmarks. In fact, the more we optimized, the faster we became. One would think that there is some sort of correlation there.

However, that is a mere theory that can be disproven, as this story will demonstrate.

When optimizing, you eventually expects to get into the land of diminishing returns. But something very strange happened, we have made a few changes, the each should have brought our speed up by a significant percentage, we had the micro benchmarks to prove that this is the case, and we were even able to see that the code was running much faster than before, but the overall benchmark time kept growing, and we started seeing higher and higher stalls in the process.

That… sucked. Especially because we couldn’t figure out what was going on. Every single metric we could see was saying that we should be seeing higher speed, our disk usage went up, our CPU usage went up a bit, we increased our memory buffers from 32 MB to 1GB, and every single indication we had told us that we are faster on a per operation basis. But the entire thing just started slowing down more and more.

Frustratingly, there was nothing we could really sink our teeth into. The system would just go into stalls and do nothing. We got to the point it looked like we broke the operating system, but nothing helped, stuff just didn’t want to work. It looked like we were waiting for I/O, but tracing at the syscall level showed that we were getting much faster response from the hardware than we saw in the application. Somewhere, stuff was getting lost.

Eventually we managed to track it down to the offending line:

image

So this is pretty obvious, right? We are waiting, so we are slow. But this line is called from a completely different part of the code, and it isn’t blocking anything else in the code path that is suffering from stalls. The key here is that this line is called from:

image

Still fine, right? We threw that into the thread pool, so it is fine to wait. But…

image

The line above is responsible for releasing our threads when the I/O operation has completed. Note that it needs to run on the thread pool as well, but because we are now much faster, we now have a lot of threads that are stuck in the call to SyncEnvironment, that overloaded the thread pool, and meant that the notification that we can proceed would come very late. We missed it in all of our profiling because we didn’t look at that code path at all, since it was obviously unrelated to the issue at hand.

Single roundtrip authentication

time to read 5 min | 851 words

imageOne of the things that we did in RavenDB 4.0 was support running on Linux, which meant giving up on Windows Authentication. To the nitpickers among you, I’m aware that we can do LDAP authentication, and we can do Windows Auth over HTTP/S on Linux. Let us say that given the expected results, all I can wish you is that you’ll have to debug / deploy / support such a system for real.

Giving up on Windows Authentication in general, even on Windows, is something that we did with great relief. Nothing like getting an urgent call from a customer and trying to figure out what a certain user isn’t authenticated, and trying to figure out the trust relationship between different domains, forests, DMZ and a whole host of other stuff that has utter lack of anything related to us. I hate those kind of cases.

In fact, I think that my new ill wish phrase will become “may you get a 2 AM call to support Kerberus authentication problems on a Linux server in the DMZ to a nested Windows domain”. But that might be going too far and damage my Karma.

That lead us to API Key authentication. And for that, we want to be able to get authenticate against the server, and get a token, which can then use in future requests. An additional benefit is that by building our own system we are able to actually own the entire thing and support it much better. A side issue here is that we need to support / maintain security critical code, which I’m not so happy about. But owning this also give me the option of doing things in a more optimized fashion. And in this case, we want to handle authentication in as few network roundtrips as possible, ideally one.

That is a bit challenging, since on the first request, we know nothing about the server. I actually implemented a couple of attempts to do so, but pretty much all of them were vulnerable to some degree after we did security analysis on them. You can look here for some of the details, so we gave that up in favor of mostly single round trip authentication.

The very first time the client starts, it will use a well known endpoint to get the server public key. When we need to authenticate, we’ll generate our own key pair and use it and the server public key to encrypt a hash of the secret key with the client’s public key, then send our own public key and the  encrypted data to the server for authentication. In return, the server will validate the client based on the hash of the password and the client public key, generate an authentication token, encrypt that with the client’s public key and send it back.

Now, this is a bit complex, I’ll admit, and it is utterly unnecessary if the users are using HTTPS, as they should. However, there are actually quite a few deployments where HTTPS isn’t used, mostly inside organizations where they choose to deploy over HTTP to avoid the complexity of cert management / update / etc. Our goal is to prevent leakage of the secret key over the network even in the case that the admin didn’t setup things securely. Note that in this case (not using HTTPS), anyone listening on the network is going to be able to grab the authentication token (which is valid for about half an hour), but that is out of scope for this discussion.

We can assume that communication between client & server are not snoopable, given that they are using public keys to exchange the data. We also ensure that the size of the data is always the same, so there is no information leakage there. The return trip with the authentication token is also

A bad actor can pretend to be a server and fool a client into sending the authentication request using the bad actor’s public key, instead of the server. This is done because we don’t have trust chains. The result is that the bad actor now has the hash of the password with the public key of the client (which is also known). So the bad guy can try do some offline password guessing, we mitigate that using secret keys that are a minimum of 192 bits and generated using cryptographically secured RNG.

The bad actor can off course send the data as is to the server, which will authenticate it normally, but since it uses the client’s public key both for the hash validation and for the encryption of the reply, the bad actor wouldn’t be able to get the token.

On the client side, after the first time that the client hits the server, it will be able to cache the server public key, and any future authentication token refresh can be done in a single roundtrip.

FUTURE POSTS

  1. Thread pool starvation? Just add another thread - 4 hours from now
  2. re: Writing a Time Series Database from Scratch - about one day from now
  3. Writing a time series database with Voron - 2 days from now

There are posts all the way to Apr 26, 2017

RECENT SERIES

  1. re (17):
    28 Jul 2016 - Why Uber Engineering Switched from Postgres to MySQL
  2. Performance optimizations (2):
    11 Apr 2017 - One step forward, ten steps back
  3. RavenDB Conference videos (12):
    03 Mar 2017 - Replication changes in 3.5
  4. Low level Voron optimizations (5):
    02 Mar 2017 - Primitives & abstraction levels
  5. Implementing low level trie (4):
    26 Jan 2017 - Digging into the C++ impl
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats