Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,481 | Comments: 47,769

filter by tags archive

RavenDB 4.0 Unsung heroesThe design of the security error flow

time to read 2 min | 351 words

recipe-575434_640This is again a feature that very few people will even notice exist, but a lot of time, effort and thinking went into building. How should RavenDB handle a case when a user make a request that it is not authorize to make. In particular, we need to consider the case of a user pointing the browser to a server or database that they aren’t authorized to see or without having the x509 certificate properly registered.

To understand the problem we need to figure out what the default experience will be like, and if we require a client certificate to connect to RavenDB, and the client does not provide it, by default the response is some variation of just closing the TCP connection. That result in the client getting an error that looks like this:

TCP connection closed unexpectedly

That is not conductive for a good error experience and will typically cause a user to spend a lot of time trying to figure out what the network problem is, while everything is working just fine, the server just doesn’t want to talk to the user.

The problem is that at the TLS level, there isn’t really a good way to give back some meaningful error. We are too low level, all we can do is just terminate the connection.

Instead of doing that, RavenDB will accept the connection, regardless of whatever it has a valid certificate (or even any certificate) and pass the connection to one level up in the chain. At that point, we can check whatever the certificate is valid and if it isn’t (or if it doesn’t have the permissions to do what we want it to do we can use the protocol’s own mechanism to report errors.

With HTTP, that means we can return a 403 error to the user, including an explanation on why we rejected the connection (no certificate, expired certificate, certificate doesn’t have the right permissions, etc). This make things much easier when you need to troubleshoot permissions issues.

RavenDB 4.0 Unsung HeroesThe indexing threads

time to read 3 min | 473 words

wire-33134_640A major goal in RavenDB 4.0 is to eliminate as much as possible complexity from the codebase. One of the ways we did that is to simplify thread management. In RavenDB 3.0 we used the .NET thread pool and in RavenDB 3.5 we implemented our own thread pool to optimize indexing based on our understanding of how indexing are used. This works, is quite fast and handles things nicely as long as everything works. When things stop working, we get into a whole different story.

A slow index can impact the entire system, for example, so we had to write code to handle that, and noisy indexing neighbors can impact overall indexing performance  and tracking costs when the indexing work is interleaved is anything but trivial. And all the indexing code must be thread safe, of course.

Because of that, we decided we are going to dramatically simplify our lives. An index is going to use a single dedicated thread, always. That means that each index gets their own thread and are only able to interfere with their own work. It also means that we can have much better tracking of what is going on in the system. Here are some stats from the live system.

image

And here is another:

image

What this means is that we have fantastically detailed view of what each index is doing, in terms of CPU, memory and even I/O utilization is needed. We can also now define fine grained priorities for each index:

image

The indexing code itself can now assume that it single threaded, which free a lot of complications and in general make things easier to follow.

There is the worry that a user might want to run 100 indexes per database and 100 databases on the same server, resulting in a thousand of indexing threads. But given that this is not a recommended configuration and given that we tested it and it works (not ideal and not fun, but works), I’m fine with this, especially given the other alternative that we have today, that all these indexes will fight over the same limited number of threads and stall indexing globally.

The end result is that thread per index allow us to have fine grained control over the indexing priorities, account for memory and CPU costs as well simplify the code and improve the overall performance significantly. A win all around, in my book.

RavenDB 4.0 Unsung HeroesIndexing related data

time to read 3 min | 506 words

treasure-map-153425_640RavenDB is a non relational database, which means that you typically don’t model documents as having strong relations. A core design principle for modeling documents is that they should be independent, isolated and coherent, or more specifically,

  • Independent – meaning a document should have its own separate existence from any other documents.
  • Isolated – meaning a document can change independently from other documents.
  • Coherent – meaning a document should be legible on its own, without referencing other documents.

That said, even when following proper modeling procedures there are still cases when you want to search a document by its relationship. For example, you might want to search for all for all the employees whose manage name is John, and you don’t care if this is John Doe or John Smith for some reason.

RavenDB allows you to handle this scenario by using LoadDocument during the index phase. That creates a relationship between the two documents and ensures that whenever the referenced document is updated, the referencing documents will be re-indexed to catch up to the new details. It is quite an elegant feature, even if I say so myself, and I’m really proud of it.

It is also the source of much abuse in the wild. If you don’t model properly, it is often easy to paper over that using LoadDocument in the indexes.

The problem is that in RavenDB 3.x an update to a document that was referenced using LoadDocument was also required to touch all of the referencing documents. This slowed down writes, which is something that we generally try really hard to avoid and could also caused availability issues if there were enough referencing documents (as in, all of them, which happened more frequently then you might think).

With RavenDB 4.0, we knew that we had to do better. We did this by completely changing how we are handling LoadDocument tracking. Instead of having to re-index all the relevant values globally, we are now tracking them on a per index basis. In each index, we track the relevant references on a per collection basis, and as part of the indexes we’ll check if there has been any updates to any of the documents that we have referenced. If we do have an document that has a lot of referencing documents, it will still take some time to re-index all of them, but that cost is now limited to just the index in question.

You can still create an index and slow it down in this manner, but the pay to play model is much nicer and there is no affect on the write speed for documents and no general impact on the whole database, which is pretty sweet. The only way you would ever run into this feature is if you run into this problem in 3.x and try to avoid it, which is now not necessary for the same reason (although the same modeling concerns apply).

RavenDB 4.0 Unsung HeroesMap/reduce

time to read 6 min | 1060 words

Computing_In_MorningOne of the hardest things that we did in RavenDB 4.0 would probably go completely unnoticed by users. We completely re-wrote how RavenDB is processing map/reduce queries. One of my popular blog posts is still a Visual Explanation to Map/Reduce, and it still does a pretty good job of explaining what map/reduce is.

The map/reduce code in RavenDB 3.x is one of the more fragile things that we have, require you to maintain in your head several completely different states that a particular reduction can be in and how they transition between states. Currently, there are probably two guys* who still understand how it works and one guy that is still able to find bugs in the implementation. It is also not as fast as we wished it would be.

So with RavenDB 4.0 we set out to build it from scratch, based in no small part on the fact that we had also written our storage engine for 4.0 and was able to take full advantage of that. You can read about the early design in this blog post, but I’m going to do a quick recap and explain how it works now.

The first stage in map/reduce is… well, the map. We run over the documents and extract the key portions we’ll need for the next part. We then immediately apply the reduce on each of the results independently. This give us the final map/reduce results for a single document. More to the point, this also tells us what is the reduce key for the results is. The reduce key is the value that the index grouped on.

We store all of the items with the same reduce key together. And here is where its get interesting. Up until a certain point, we just store all of the values for a particular reduce key as an embedded value inside a B+Tree. That means that whenever any of the values changes, we can add that value to the appropriate location and reduce all the matching values in one go. This works quite well until the total size of all the values exceed about 4KB or so.

At this point, we can’t store the entire thing as an embedded value and we move all the values for that reduce key to its own dedicated B+Tree. This means that we start with a single 8KB page and fill it up, then split it, and so on. But there is a catch. The results of a map/reduce operation tend to be extremely similar to one another. At a minimum, they share the same properties and the same reduce key. That means that we would end up storing a lot of duplicate information. To resolve that, we also apply recursive compression. Whenever a page nears 8KB in size, we will compress all the results stored in that page as a single unit. This tend to have great compression rate and can allow us to store up to 64KB of uncompressed data in a single page.

When adding items to a map/reduce index, we apply an optimization so it looks like:

results = reduce(results, newResults);

Basically, we can utilize the recursive nature of reduce to optimize things for the append only path.

When you delete or update documents and results change or are removed, things are more complex. We handle that by running a re-reduce on the results. Now, as long as the number of results is small (this depend on the size of your data, but typically up to a thousand or so) we’ll just run the reduce over the entire result set. Because the data is always held in a single location, this means that it is extremely efficient in terms of memory access and the tradeoff between computation and storage leans heavily to the size of just recomputing things from scratch.

When we have too many results (the total uncompressed size exceeds 64KB) we start splitting the B+Tree and adding a level to the three. At this point, the cost of updating a value is now the cost of updating a leaf page and the reduce operation on the root page. When we have more data still,  we will get yet another level, and so on.

The (rough) numbers are:

  • Up to 64KB (roughly 1000 results) – 1 reduce for the entire dataset
  • Up to 16 MB – 2 reduces (1 for up to 1000 results, 1 for up to 254 results)
  • Up to 4 GB – 3 reduces (1 for up to 1000 results, 2 for up to 254 results each)
  • Up to 1 TB  - 4 reduces (1 for up to 1000 results, 3 for up to 254 results each)
  • I think you get how it works now, right? The next level up is 1 to 248 TB and will requite 5 reduces.

These numbers is if your reduce data is very small, in the order of a few dozen byes. If you have large data, this means that the tree will expand faster, and you’ll get less reduces at the first level.

Note that at the first level, if there is only an addition (new document, basically), we can process that as a single operation between two values and then proceed upward as the depth of the tree requires.There are also optimizations in place if we have multiple updates to the same reduce key, in that case, we can first apply all the updates, then do the reduce once for all of them in one shot.

And all of that is completely invisible to the users, unless you want to peek inside, which is possible using the Map/Reduce visualizer:

image

This can give you insight deep into the guts of how RavenDB is handling map/reduce operations.

The current status is that map/reduce indexing are actually faster than normal indexes, because they are almost all our code, while a large portion of the normal indexing cost is with Lucene.

* That is an exaggeration, there is one guy that know how it works. Okay, okay, I’ll admit that we can dive into the code and figure out what is going on, but it takes quite a bit of time if there is a significant issue there.

RavenDB 4.0 Unsung HeroesField compression

time to read 3 min | 498 words

I have been talking a lot about major features and making things visible and all sort of really cool things. What I haven’t been talking about is a lot of the work that has gone into the backend and all the stuff that isn’t sexy and bright. You probably don’t really care how the piping system in your house work, at least until the toilet doesn’t flush. A lot of the things that we did with RavenDB 4.0 is to look at all the pain points that we have run into and try to resolve them. This series of posts is meant to expose some of these hidden features. If we did our job right, you will never even know that these features exists, they are that good.

In RavenDB 3.x we had a feature called Document Compression. This allowed a user to save significant amount of space by having the documents stored in a compressed form on disk. If you had large documents, you could typically see significant space savings from enabling this feature. With RavenDB 4.0, we removed it completely. The reason is that we need to store documents in a way that allow us to load them and work with them in their raw form without any additional work. This is key for many optimizations that apply to RavenDB 4.0.

However, that doesn’t mean that we gave up on compression entirely. Instead of compressing the whole document, which would require us to decompress any time that we wanted to do something to it, we selectively compress individual fields. Typically, large documents are large because they have either a few very large fields or a collection that contain many items. The blittable format used by RavenDB handles this in two ways. First, we don’t need to repeat field names every time, we store this once per document and we can compress large field values on the fly.

Take this blog for instance, a lot of the data inside it is actually stored in large text fields (blog posts, comments, etc). That means that when stored in RavenDB 4.0, we can take advantage of the field compression and reduce the amount of space we use. At the same time, because we are only compressing selected fields, it means that we can still work with the document natively. A trivial example would be to pull the recent blog post titles. we can fetch just these values (and since they are pretty small already, they wouldn’t be compressed) directly, and not have to touch the large text field that is the actual post contents.

Here is what this looks like in RavenDB 4.0 when I’m looking at the internal storage breakdown for all documents.

image

Even though I have been writing for over a decade, I don’t have enough posts yet to make a statistically meaningful difference, the total database sizes for both are 128MB.

FUTURE POSTS

  1. Complex Linq queries in RavenDB 4.0 - 9 hours from now
  2. Cost centers, revenue centers and politics in a cross organization world - about one day from now
  3. Giving Demeter PTSD - 2 days from now
  4. PR Review: Code has cost, justify it - 3 days from now
  5. PR Review: Beware the things you can’t see - 6 days from now

And 2 more posts are pending...

There are posts all the way to Oct 25, 2017

RECENT SERIES

  1. PR Review (7):
    10 Aug 2017 - Errors, errors and more errors
  2. RavenDB 4.0 (15):
    13 Oct 2017 - Interlocked distributed operations
  3. re (21):
    10 Oct 2017 - Entity Framework Core performance tuning–Part III
  4. RavenDB 4.0 Unsung Heroes (5):
    05 Oct 2017 - The design of the security error flow
  5. Writing SSL Proxy (2):
    27 Sep 2017 - Part II, delegating authentication
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats