Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,354 | Comments: 47,173

filter by tags archive

De-virtualization in CoreCLRPart I

time to read 5 min | 852 words

I mentioned that we are doing some work to enable de-virtualization of our code, as well as getting ready for CoreCLR changes that will get the JIT to do more de-virtualization.

I was asked (by Maayan) about this, more specifically:

How does manual de-virtualization work? AFAIK, the compiler always emits a CallVirt instruction for non-static method calls, regardless of weather the method is virtual or not (and regardless of weather the class is sealed or not). Are you extending the C# compiler and overriding the emission code? Are you re-JITting the code in runtime (as profilers do using the profiler API)?

And the answer is that Maayan is correct, C# is defined (for various reasons related to ECMA approval process over 15 years ago) to always dispatch methods using CallVirt. But CallVirt is an IL instruction, not an assembly one.

Here is some code for various code, as well as the relevant assembly being generated for it (CoreCLR 1.1, X64, Release). Don’t worry about the assembly, I have detailed explanations below for each part.

This is pretty simple (non inlined) set of methods that are just there to make sure that you can see what ends up actually running. The actual method just end up calling Console.WriteLine, but that is about it.

Now, let us inspect each of those behaviors one at a time, first let us look at how an interface dispatch work.

If you want the gory details, you can look for those  in the Book of the Runtime. But basically, we are jumping to a code location that will find the proper code that we need to execute. Note that there is still additional work there to actually route the method to the appropriate place, which is hidden from us by the runtime, JIT, etc.

Note that the funny cmp method there? It is there to generate a dereference of the address by the CPU, which will cause it raise a trap if the address is invalid, and if the address is 0, the CLR will convert that to a NullReferenceException.

Now, what about a virtual method call? That is actually simpler, since is it similar to how I learned it when I worked with C++.

Basically, at the end of the object we have a method table pointer, and we dereference that, and then another pointer to the specific method we want to invoke. Part of the reason that virtual calls are expensive is exactly this, we have to jump around in memory a lot, and that means that we both have to issue more instructions, and more to the point, we have to touch a lot more memory, which can can cause cache lines to be evicted, forcing us to stall.

What about a struct method call? To make things easier, I made sure that it couldn’t be inlined, which generated:

In this case, I’m using an empty struct, so it has as much size as a pointer, which is why you can see it being passed around like it was just a pointer. If I had a bigger struct, I would see very different code.

Why is the code simpler when the struct is bigger? Well, the answer is quite simple, we are looking at the actual method that was called with this parameter, but the job of actually sending the parameter is done by the caller, so we aren’t seeing it here.

What is going on is that as long as your struct is empty or have a single value that is an 4 / 8 bytes long, the CLR can optimize that to be a regular parameter, making it effectively free. In such cases, you can see the struct being passed around in registers (the struct itself, since it is copied, not the address to it).

However, if you have a struct that is composed of multiple fields, that require us to copy each field to the stack before call, which on large stacks can take a bit.

I mentioned that this was done with a struct that we disabled inlining for, what happens if we allow inlining (the default)?

This looks completely different then anything we have seen so far. And in fact, it is. What we are seeing here is the result of inlining the struct method invocation. Because the compiler was able to figure out what the end target of the call is, and because it is small enough to be inlined, we can skip calling the method entirely and just directly run the code.

As it turns out, this can have dramatic affect on performance (on both directions, mind), and something that you need to carefully consider when you analyze your application performance.

But the short of it is, the fewer jumps and dereferences we have, the better it is for us. And you can see the various methods (pun intended) that the CLR uses to dispatch them. In my next post, I’ll talk about how we can make use of this behavior.

Thread pool starvation? Just add another thread

time to read 2 min | 388 words

One of the nastier edge cases with TaskCompletionSource is that you can attach a continuation to that which will run synchronously. You can avoid that to a certain extent by using RunContinuationsAsynchronously, and that works, but under load, it can still be problematic.

In particular, consider the case where we have a task with:

  1. Do computation
  2. Enqueue a task to be completed by a different thread (getting a Task back)
  3. Continue computation until done
  4. Wait for previous operation to complete
  5. Go to 1

Even with avoiding running the continuation in sync mode, that still result in an issue. In particular, when we are running asynchronous continuation, that isn’t magic, that still need a thread to run on, and that will typically be a thread pool thread.

But if all the thread pool threads are busy doing the work above, it may force us to wait until we are done with the computation that the code is running, to pull some more work from the thread pool queue until the queue gets to the notification that we are ready to work. In other words, we may suffer from jitter, where the running task is waiting for an already complete async operation, but it doesn’t know it (and hence give up the thread) because there wasn’t any available thread to run it.

We resolve it by adding a dedicated thread, which simply wait for those notifications, and run only them. Because those are typically very short, and there isn’t that many of them, it means that we can process them very quickly. In order to prevent us from having stalls on that thread, we use what I think is a pretty nifty trick.

We are registering the event twice, once on our dedicated thread, and once on the normal thread pool. If somehow the dedicated thread is too busy, the thread pool (and its auto growth) will handle it, but most of the time, the dedicated thread can catch it and run it.

And adding a basically noop task to the thread pool isn’t going to generate any pressure on the thread pool if there is no load, and if there is load, it will encourage it to grow faster, which is what we want.

If you care to look at the code, it is here.

I was wrong, reflecting on the .NET design choices

time to read 3 min | 480 words

I have been re-thinking about some of my previous positions with regards to development, and it appear that I have been quite wrong in the past.

In particular, I’m talking about things like:

Note that those posts are parts of a much larger discussion, and both are close to a decade old. They aren’t really relevant anymore, I think, but it still bugs me, and I wanted to outline my current thinking on the matter.

C# is non virtual by default, while Java is virtual by default. That seems like a minor distinction, but it has huge implications. It means that proxying / mocking / runtime subclassing is a lot easier with Java than with C#. In fact, a lot of frameworks that were ported from Java rely on this heavily, and that made it much harder to use them in C#. The most common one being NHibernate, and one of the chief frustrations that I kept running into.

However, given that I’m working on a database engine now, not on business software, I can see a whole different world of constraints. In particular, a virtual method call is significantly more expensive than a direct call, and that adds up quite quickly. One of the things that we routinely do is try to de-virtualize method calls using various tricks, and we are eagerly waiting .NET Core 2.0 with the de-virtualization support in the JIT (we already start writing code to take advantage of it).

Another issue is that my approach to software design has significantly changed. Where I would previously do a lot of inheritance and explicit design patterns, I’m far more motivated toward using composition, instead. I’m also marking very clear boundaries between My Code and Client Code. In My Code, I don’t try to maintain encapsulation, or hide state, whereas with stuff that is expected to be used externally, that is very much the case. But that give a very different feel to the API and usage patterns that we handle.

This also relates to abstract class vs interfaces, and why you should care. As a consumer, unless you are busy doling some mocking or so such, you likely don’t, but as a library author, that matters a lot to the amount of flexibility you get.

I think that a lot of this has to do with my view point, not just as an Open Source author, but someone who runs a project where customers are using us for years on end, and they really don’t want us to make any changes that would impact their code. That lead to a lot more emphasis on backward compact (source, binary & behavior), and if you mess it up, you get ricochets from people who pay you money because their job is harder.

Single roundtrip authentication

time to read 5 min | 851 words

imageOne of the things that we did in RavenDB 4.0 was support running on Linux, which meant giving up on Windows Authentication. To the nitpickers among you, I’m aware that we can do LDAP authentication, and we can do Windows Auth over HTTP/S on Linux. Let us say that given the expected results, all I can wish you is that you’ll have to debug / deploy / support such a system for real.

Giving up on Windows Authentication in general, even on Windows, is something that we did with great relief. Nothing like getting an urgent call from a customer and trying to figure out what a certain user isn’t authenticated, and trying to figure out the trust relationship between different domains, forests, DMZ and a whole host of other stuff that has utter lack of anything related to us. I hate those kind of cases.

In fact, I think that my new ill wish phrase will become “may you get a 2 AM call to support Kerberus authentication problems on a Linux server in the DMZ to a nested Windows domain”. But that might be going too far and damage my Karma.

That lead us to API Key authentication. And for that, we want to be able to get authenticate against the server, and get a token, which can then use in future requests. An additional benefit is that by building our own system we are able to actually own the entire thing and support it much better. A side issue here is that we need to support / maintain security critical code, which I’m not so happy about. But owning this also give me the option of doing things in a more optimized fashion. And in this case, we want to handle authentication in as few network roundtrips as possible, ideally one.

That is a bit challenging, since on the first request, we know nothing about the server. I actually implemented a couple of attempts to do so, but pretty much all of them were vulnerable to some degree after we did security analysis on them. You can look here for some of the details, so we gave that up in favor of mostly single round trip authentication.

The very first time the client starts, it will use a well known endpoint to get the server public key. When we need to authenticate, we’ll generate our own key pair and use it and the server public key to encrypt a hash of the secret key with the client’s public key, then send our own public key and the  encrypted data to the server for authentication. In return, the server will validate the client based on the hash of the password and the client public key, generate an authentication token, encrypt that with the client’s public key and send it back.

Now, this is a bit complex, I’ll admit, and it is utterly unnecessary if the users are using HTTPS, as they should. However, there are actually quite a few deployments where HTTPS isn’t used, mostly inside organizations where they choose to deploy over HTTP to avoid the complexity of cert management / update / etc. Our goal is to prevent leakage of the secret key over the network even in the case that the admin didn’t setup things securely. Note that in this case (not using HTTPS), anyone listening on the network is going to be able to grab the authentication token (which is valid for about half an hour), but that is out of scope for this discussion.

We can assume that communication between client & server are not snoopable, given that they are using public keys to exchange the data. We also ensure that the size of the data is always the same, so there is no information leakage there. The return trip with the authentication token is also

A bad actor can pretend to be a server and fool a client into sending the authentication request using the bad actor’s public key, instead of the server. This is done because we don’t have trust chains. The result is that the bad actor now has the hash of the password with the public key of the client (which is also known). So the bad guy can try do some offline password guessing, we mitigate that using secret keys that are a minimum of 192 bits and generated using cryptographically secured RNG.

The bad actor can off course send the data as is to the server, which will authenticate it normally, but since it uses the client’s public key both for the hash validation and for the encryption of the reply, the bad actor wouldn’t be able to get the token.

On the client side, after the first time that the client hits the server, it will be able to cache the server public key, and any future authentication token refresh can be done in a single roundtrip.

RavenDB 4.0 Features: Document Versioning

time to read 3 min | 487 words

After focusing for so long on the low level and performance of RavenDB 4.0, you might be forgiven if you think that we are neglecting real meaningful features in 4.0. That couldn’t be further from the truth, the problem is that this it is actually quite hard to talk about some features meaningfully before they have been through complete round of implementing, testing, wiring for UI and UX and then beating them with a stick to see what fall.

Document versioning in RavenDB isn’t new. It has been a feature for several releases now. In RavenDB 4.0 we have decided to reimagine them complete. One of the strong points of RavenDB design from the get go was that triggers allowed you to build new features into the product without making core changes. That allowed quickly adding new features, but could cause issues with feature intersections.

With RavenDB 4.0, versions are now part of the core code. That means that they get special treatment. Here is how it looks like in the UI:

image

You can switch between versions of the document, and see them as it was at various points in time.

Externally, this is pretty much all there is to it. Here is what this will look like when you are looking at a revision.

image

In terms of the client API, there is little change:

So what did change?

Revisions are now stored separately from documents. Previously we stored them as special documents, which led to some problems (indexes had to read to see that they aren’t interested in them, for example), and we had to special case a lot of stuff that is supposed to work on documents to not apply to attachments. Making them their own unique thing simplify everything significantly. We have also made sure that revisions can properly flow with replication, and prevent the potential for versioning conflicts which were tricky to solve. The downside of this is that desirable operations on revisions (load a revision with include, for example), will no longer work.

The whole thing is also significantly faster and cheaper to work with.  A major change we also did was to ensure that versioning is always on, so you don’t have to specify it at the DB creation time. Instead, you can configure it whenever you want, without having to add a bundle at runtime and risk unsupported operations.

That means that if at some point you want to start versioning your documents, you can just flip a switch, and it will immediately start versioning all changes for you.

Emoji Encoding: A new style for binary encoding for the web

time to read 4 min | 604 words

Computers think in binary, and you would have thought that sending binary data around would be pretty easy. But that turns out to be a completely non trivial task. The problem is those pesky humans and needing to interface with them.

For example, if I need to send some binary data over email, I can either do that as attachment, with high probability of at least a few people never getting it, or I can encode it somehow. Typical choices are Base64 encoding for the low tech and barcodes / QR code and the like. For the fancy among us, we can try go with Base85 and other such things. That is pretty standard, but it really has a lot of limitations. Base64 will increase the size of the data by 25%, and it is case sensitive, so it is hard to get right if you need to actually look at it and not just copy/paste it. It is also limited to plain old ASCII, for compatibility reasons that don’t make a lot of sense in today’s world.

I have been thinking about this for a long time, because we need to send binary data (license information) in text, and we also need that to look well and formatted.

After a lot of thought and experimentation, I’m proud to announce a new form of encoding: the Emoji Encoder, available currently for .NET, but soon to be available for Ruby, Python, Go, Node.JS, Ember.js, React.JS and maybe jQuery.

The idea for this innovation came to me because of the following observations:

  • Emojis are becoming much more important in any textual conversation (to the point where people will say an emoji). That mean that we can rely on them for long term, which is very important for storage technology.
  • Trying to read meaning from emojis being sent is clearly impossible, as anyone taking a peek at a text conversation between two teenage girls can say. (Although they appear to have a hidden meaning, if she sent the red heel and not the blue heel emoji that apparently means something.)
  • Because emojis are so relevant, they can be sent anywhere a normal text would go, including email, social media, printing, etc.
  • There are a lot of emojis, allowing us to overcome the bloat of Base64 and its friends by dedicating a single emoji for each byte in a 1:1: mapping.

That means that in terms of characters, Emoji Encoding is a net win. Consider the following equivalent information:

  • I5xy4dT9Qyjp7DKwuVI6y95EwlDeO/NBeiuc3GJ5Mjo= <—45 characters
  • ℹ⤴⚫✔⭕㊗◀☔➖✂♥⛵✖♍❤⛵✅✏ℹ⛲✂ <—33 characters

That is quite important when dealing with constrained textual formats, such as twitter, where the above will be rendered as:

There are other advantages. This data is actually a 256 bits key for use in encryption. And you can actually show it to a user and have a reasonably good chance that they will be able to tell it apart from something else. It rely on the ability of humans to recognize shapes, but it will be very hard for them to actually tell someone your key. There has been a lot of research around such things, and while it isn’t a primary motivation for us, it is a very nice perk.

I mentioned that a key interest for us is the usage in licensing code. Here is an example of how a license email will now look:

I think that in addition to being pretty, it is also going to bring a smile to people faces, so the Emoji Encoder is a win all around.

Big data work on 32 bits

time to read 7 min | 1326 words

Related imageEver put on a pair of shoes that were just a bit too small for you? You think that it would be fine, but then time goes by, and it pinches. And then it hurts, and then you just can’t walk any longer.

That is how it feels to work with any reasonable amount of data (even low hundreds of MB) in 32 bits. The virtual address space is so limited that it is incredibly easily to just run out due to address space fragmentation and fail, and there is very little that you can actually do to recover from such a scenario. We have been working on making RavenDB 4.0 work reliably in 32 bits mode*, and it has been a PITA after PITA. In particular,  Voron was quite explicitly designed for an environment where it is hard to impossible to run out of address space, and out typical modus operandi is to map the entire data file (which can be hundreds of GB in size) as a single continuous memory region.

* And yes, I can’t imagine how much PITA it was to run in 16 bits. When I programmed to those sort of platforms, I never actually used anything that got to needing oh so much memory (I was at middle school at the time, IIRC).

That allows us to do all sort of tricks and optimizations, and it is utterly impossible to do on 32 bits. In fact, on most 32 bits system, after the process has been running for a while, just doing a single 256MB allocation is probably going to fail. You might have that much memory free, but you don’t have it in a continuous run. So we had to do drastic changes, and at the same time, 32 bits is an important requirement, but mostly it is on the sidelines. I want to support it, and I’m willing to put the effort to get there, but I’m not willing to pay for it if it costs me when running outside of 32 bits mode.

In other words, anything that would have a drastic impact of the “normal” mode of running in 64 bits was out, and we didn’t want to re-architect the whole thing. Luckily, we already had the place to put the different behavior. In Voron, the Pager is the entity that is responsible for mapping the file to memory, and hand out pointers from the file based on the page number asked. That meant that we had a well defined interface:

image

Because the file can grow, and the pager might hold multiple overlapping maps to the same file, we only allow pointers to the data file in the scope of a transaction. That ended up being a very good thing, since this enabled us to build a pager that could work in 32 bits mode. Instead of mapping the entire file, we’ll map just the pages that are required, and only for the duration of the transaction, after which we can unmap everything.  The idea is that we’ll only have the data that we are actually using mapped, so we’ll save a lot of address space.

That was the theory, at least, in practice, we run into several interesting issues.

  • We want to reduce the number of system calls we make, and the allocation granularity in Windows in 64KB, so we always need to map at least that much, instead of mapping 1 page at a time.
  • A value may actually be longer than 64KB, or not be aligned on 64KB boundary.
  • Transactions are concurrent, and are typically need to access the same areas (the top branches in the B+Trees we use, for example).

The end result is that we map in multiples of 64KB (on both Windows & Linux), and we check, based on the actual data, whatever we need to remap the data if it is more than is allocated in the current block. We are also sharing all the maps among all the concurrent transactions, to reduce the total amount of virtual address space we are using. This is a bit hard to do concurrently & safely, so we are racing it. In most cases, we’ll have a single mapping, but it is fine to map the same section twice from different transactions (there can only ever be a single write transaction, so all the others are just reading, and never from the same memory that the write tx is writing to). The alternative would have been to use a more invasive locking, and the performance cost isn’t worth it.

Once we got this working, we were most of the way there, but there were still a lot of reversed optimizations that we had to do. For example, in 64 bits mode, it make a lot of sense to try to pre-allocate data in advance to maintain locality, and as the data we dealt with became larger, so would our pre-allocations. But those would end up being around 32MB of data that we pre-allocate (so we need to prepare and touch all of it), and under load, it was actually one of the more common cases for failures due to fragmentation of the address space, because it would allocate (1MB, 2MB, 4MB, 8MB, 16MB, 32MB) and we had many such operations cutting the address space into fine bits.

Another location that cause us problem was with indexes, where we allocated a 16MB range for bloom filter. Made perfect sense to optimize things in 64 bits, but in 32 bits mode that is a huge chunk of my address space that I’m giving up. In both cases, we drastically reduced the sizes involved (to 64KB in the case of the bloom filter, and a max pre-allocation of 1 MB).

Another thing that was actually in our favor is that our memory usage is already policed quite carefully.  We didn’t do that intentionally for 32 bits, we did that because memory allocation is so expensive, but because we rarely ask for memory from the OS, and because we are typically reuse buffers, we were pretty much already there in terms of proper behavior of the system in 32 bits mode. The CLR is also pretty good about allocation in large sections from the OS and being able to move memory around to avoid internal fragmentation.

Overall, it works, we have tested that on the Stack Overflow dataset, and it is quite amazing to see it (you can see it chugging along, using anything between 300 MB – 600 MB as it is inserting 60+ GB of data). It is obviously less efficient than the 64 bits version, but given that we are so much faster in 4.0, I don’t think that anyone will actually notice, and if they do… Well, that is why we have hardware built in this decade.

There were other places where we had to take into account the 32 bits nature of the system, in which we actively undermined previous optimizations. Instead of allowing transactions to merge as much as possible to reduce I/O, we placed a hard limit on how much data can be accessed or modified by a transaction before we force it to commit. The same goes for indexing, instead of letting batches to run as long as we would usually like them, address space considerations forces us to reduce the batch size to account for how much address space is used in the current batch, and close it early (freeing up the address space for other uses).

The overall process is very unpleasant, because things that I would consider to be obvious and easy are suddenly so much harder than I would expect them to be.

Attachments, RavenFS and scoping out the market

time to read 3 min | 490 words

RavenFS is a pretty cool technology. It was designed to handle both very large files over geographically distributed environment and large number of small files in a single datacenter. It has some really cool features, such as the ability to run metadata searches, delta replication, etc. And yet, pretty much all our customers are using it primarily as a way to handle small set of binaries, typically strongly related to the documents.  We also got a lot of feedback / worries about attachments deprecation from customers.

This post is intended to lay out some of our thoughts regarding this feature. And the idea is that we are going with the market. We are going to merge RavenFS back into RavenDB.

Instead of having files with metadata, we’ll reverse things, you’ll have documents with attachments. Let us consider the simplest example that I could conceive. Users and profile pictures.

You are going to store the user’s information in “users/1” document. And then you need to store the profile pic somewhere. You’ll be able to do that by push that into RavenDB as an attachments. An attachment is always going to be tied to a specific document, and if it is deleted, all its attachments will also be deleted. So in this case, we’ll have “profile.png” attachment on “users/1”.

Of course, you don’t have just a single profile picture, you also have a thumbnail of that. So after the user has uploaded their pic and you attached that to the user’s document, we’ll have an offline process to generate the thumbnail and attach that as well to the document.

Documents will have a metadata flag that will indicate whatever they have attachments, and if they do, the metadata will contain the list of attachments they have. So loading the document will be enough to enable you to peek at all its attachments, however load an attachment would be a separate operation. You’ll always be able to access and attachment directly, naturally.  Attachment won’t have metadata or the ability to search them, instead, you can define your indexes on documents, as you normally do, and go from there to the attachments you desire.

Adding / deleting / modifying an attachment will also update the etag of the document they are attached to (since it updates the document metadata). The attachments will receive the same etag as their document at the time of modification, and will be replicated along the same manner. Obviously, only new attachments will be replicated whenever the document is updated. Conflicts on attachments is also a conflict on the document, and will be resolved based on however the document conflict is resolved.

Because attachments reside in the same location as documents, we can now have a transaction that spans both a document and attachment (not necessarily to the same document, mind), which will make things easier on our users.

Analyzing RavenDB 4.0 with NDepends

time to read 4 min | 648 words

I have a… difficult relationship with NDepends, on the one hand, it is an invaluable tool to provide you with good insight into your project. On the other hand, I don’t always agree with its recommendation and I have to teach it what I consider valuable. In my mind, it is the kind of tool that you reach for when you need the Expert Mode.

We use it for exploring our API surface area and seeing that it makes sense, to validate breaking changes and in general to get detailed view into troublesome spots in the codebase.

For example, it pointed out this guy as having to many parameters. I can’t say that I disagree, and it is a prime candidate to refactor to make it simpler.

image

On the other hand, it didn’t get to be this way by accident, it was added to slowly over time. Each new feature or bug fix needed just a bit more, and this grew and grew. At this point, however, this is extremely stable code that rarely changes, and modifying it will take a lot of work.

Just having tools around doesn’t mean that you get to turn off your head Smile.

One important thing is that if you are considering NDepends, you probably want to start doing so very early in the project lifecycle. One of the things that I find most interesting in the new version is the notion of estimated debt. Here is what this looks like for RaveDB 4.0

image

And here is the value for RavenDB 3.5

image

This is composed from estimated amount of effort that you need to put to fix things.

The really interesting part is that it does a pretty good job of finding troublesome locations even when you don’t have history / coverage information. For example, here is the list from RavenDB 3.5:

image

I find it highly amusing to see this. Mostly because the client side implementation is more complex (and are more frequently modified).

Digging a little bit deeper give us this:

image

And this is where I start arguing with the tool. Or, to be rather more exact, I have more information than NDepends on this usage. AsyncServerClient is how the entire client talks to the server, so it isn’t cohesive and it is certainly too big, but it is pretty much by design.

The details about boxing / unboxing, however, are much more interesting, and it is where you can start doing a lot of interesting things. In particular, you can customize NDepends to give you a lot of details and enforce rules about your codebase.

For example, given our recent need, here is all the big structures in our code:

image

You can do that for exploring, or you can add that as a rule (warnif).

Low level Voron optimizationsThe page size bump

time to read 5 min | 864 words

Explaining the usage pages seems to be one of the things is either hit of miss for me. Either people just get it, or they struggle with the concept. I have written extensively on this particular topic, so I’ll refer it to that post for the details on what exactly pages in a database are.

Voron is currently using 4KB pages. That is pretty much the default setting, since everything else also works in units of 4KB. That means that we play nice with requirements for alignment, CPU page sizes, etc.  However, 4KB is pretty small, and that lead to trees that has higher depth. And the depth of the tree is one of the most major reasons for concern for database performance (the deeper the tree, the more I/O we have to do).

We previously tested using different page sizes (8KB, 16KB and 32KB), and we saw that our performance decreased as a result. That was surprising and completely contrary to our expectations. But a short investigation revealed what the problem was. Whenever you modify a value, you dirty up the entire page. That means that we would need to write that entire page back to storage (which means making a bigger write to the journal, then applying a bigger write to the data filed, etc).

In effect, when increasing the page size to 8KB, we also doubled the amount of I/O that we had to deal with. That was a while ago, and we recently implemented journal diffing, as a way to reduce the amount of unnecessary data that we write to disk. A side affect of that is that we no longer had a 1:1 correlation between a dirty page and full page write to disk. That opened up the path to increasing the page sizes. There is still an O(PageSize) cost to doing the actual diffing, of course, but that is memory to memory cost and negligible in compared to the saved I/O.

Actually making the change was both harder and easier then expected. The hard part was that we had to do a major refactoring working to split a shared value. Both the journal and the rest of Voron used the notion of Page Size. But while we want the page size of Voron to change, we didn’t want the journal write size to change. That led to a lot of frustration where we had to go over the entire codebase and look at each value and figure out whatever it meant writing to the journal, or pages as they are used in the rest of Voron. I’ve got another post scheduled talking about how you can generate intentional compilation errors to make this easy for you to figure it out.

Once we were past the journal issue, the rest was mostly dealing with places that made silent assumptions on the page size. That can be anything from “the max value we allow here is 512 (because we need to fit at least so many entries in)” to tests that wrote 1,000 values and expected the resulting B+Tree to be of a certain depth.

The results are encouraging, and we can see them mostly on the system behavior with very large data sets, those used to generate very deep trees, and this change reduced them significantly. To give some context, let us assume that we can fit 100 entries per page using 4KB pages.

That means that if we have as little as 2.5 million entries, we’ll have (in the ideal case):

  • 1 root page holding 3 entries
  • 3 branch pages holding 250 entries
  • 25,000 leaf pages holding the 2.5 million entries

With 8 KB pages, we’ll have:

  • 1 root page holding 63 entries
  • 12,500 lead pages holding 2.5 million entries

That is a reducing of a full level. The nice thing about B+Trees is that in both cases, the branch pages are very few and usually reside in main memory already, so you aren’t directly paying for their I/O.

What we are paying for is the search on them.

The cost of searching the 4KB tree is:

  • O(log2 of 3) for searching the root page
  • O(log2 of 100) for searching the relevant branch page
  • O(log2 of 100) for searching the leaf page

In other words, about 16 operations. For the 8 KB page, that would be:

  • O(log2 of 63) for searching the root page
  • O(log2 of 200) for searching the leaf page

It comes to 14 operations, which doesn’t seems like a lot, but a lot of our time goes on key comparisons on the key, so anything helps.

However, note that I said that the situation above was the ideal one, this can only happen if the data was inserted sequentially, which it doesn’t usually do. Page splits can cause the tree depth to increase very easily (in fact, that is one of the core reasons why non sequential keys are so strongly discourage in pretty much all databases.

But the large page size allows us to pack many more entries into a single page, and that also reduce the risk of page splits significantly. 

FUTURE POSTS

  1. De-virtualization in CoreCLR: Part II - about one day from now

There are posts all the way to May 01, 2017

RECENT SERIES

  1. De-virtualization in CoreCLR (2):
    28 Apr 2017 - Part I
  2. re (17):
    26 Apr 2017 - Writing a Time Series Database from Scratch
  3. Performance optimizations (2):
    11 Apr 2017 - One step forward, ten steps back
  4. RavenDB Conference videos (12):
    03 Mar 2017 - Replication changes in 3.5
  5. Low level Voron optimizations (5):
    02 Mar 2017 - Primitives & abstraction levels
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats