Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,481 | Comments: 47,769

filter by tags archive

RavenDB 4.0Interlocked distributed operations

time to read 3 min | 440 words

imageWe couldn’t make unique constraints work in RavenDB 4.0 in a way that made sense for distributed operations, there were just too many hurdles at that level of abstractions. The problem, in essence, boils down to having to do an atomic operation in a distributed environment.  When we need to do this in an multi threaded environment, we can rely on interlocked operations to help us. In the same manner, RavenDB offers the notion of interlocked distributed operations.

Let us take a look at how this looks like, shall we?

 

The output of this code would be:

Success: True, Val: users/1, Index: 13

In other words, we were able to atomically set the value of “ayende” to “users/1”. At the most basic level, this gives us the ability to create unique constraints, because we are able to reserve values for particular documents, and it gives us a much more explicit manner in which to do so. If a user wants to change their username, we first try to grab the new name, change the username and then release the old username. And we have the ability to recover if there are errors midway.

This design is modeled after the Interlocked operations, and is meant to be used in the same manner. You submit such an operation to the cluster, which will run this through the Raft state machine. This means that you can rely on RavenDB’s own distributed state machine to get reliable distributed operations, including the full consistency that this provides. The key arguments here is that the name of the value and the index used. If the index you provide match the index on the state machine, we’ll replace the value in an atomic fashion. Otherwise, the operation will fail and you’ll get the current value and index so you can decide if you want to try again.

The example with the unique constraints is just the tip of the iceberg. You can also use this for your own use for things like distributed locking by registering an owner for a particular lock key and ensuring that everyone who needs the lock will race to acquire it. Note that the value that we use here can be anything, including complex objects. In other words, you can do things like set a lock descriptor that would include timeout information, owner, etc.

The interface here is pretty small, in addition to PutCompareExchangeValueOperation there is also just GetCompareExchangeValueOperation, but it is enough for you to be able to lean on RavenDB’s distributed state machine in your own application.

RavenDB 4.0Node.JS client is now in beta

time to read 1 min | 164 words

imageI’m happy to announce that the RavenDB node.js client is now publicly available in beta. Following our Python client (and obviously the .NET one), this is the newest client for RavenDB on the block, with additional clients for the JVM, Go and Ruby quickly reaching critical stage.

Here is some code using it (I’m using async/await here, but the RavenDB node.js client supports any Node 6.0 or higher):

image

And here is how you do some basic CRUD:

image

A full sample app can be found here:

You can just get the code and then run:

npm run serve

And you’ll get a running server running against the public live instance.

Unique constraints didn’t make the cut for RavenDB 4.0

time to read 3 min | 504 words

Unique Constraints is a bundle in RavenDB 3.x that was allowed you to… well, define unique constraints.  Here is the classic example:

image

It was always somewhat awkward to use (you had to mess around with configuration on both the server and the client side, but it worked pretty well. As long as you were running on a single server.

With RavenDB 4.0, we put a lot more emphasis on running in a cluster, and when the time came to discuss how we are going to handle unique constraints in 4.0 it was very obvious that this is going to be neither easy nor performant.

The key here is that in a distributed database using multi masters, there is no real good way to provide a unique constraint. Imagine the following database topology:

image

Now, what will happen if we’ll create the two different User documents with the same username in node C? It is easy for node C to be able to detect and reject this, right?

But what happen if we create one document in node B and another in node A? This is now a much harder problem to deal with. And this is without even getting into the harder aspects of how to deal with failure conditions.

The main problem here is that it is very likely that at the point we’ve discovered that we have a duplicate, there were already actions taken based on that information, which is generally not a good thing.

As a result, we were left in somewhat of a lurch. We didn’t want to have  feature that would work only on a single node, or contain a hidden broken invariant. The way to handle this properly in a distributed environment is to use some form of consensus algorithm. And what do you know, we actually have a consensus algorithm implementation at hand, the Raft protocol that is used to manage the RavenDB cluster.

That, however, led to a different problem. The process of using a unique constraint would now broken into two distinct parts. First, we would verify that the value is indeed unique, then we would save the document. This can lead to issues if there is a failure just between these two operations, and it puts a heavy burden on the system to always check the unique constraint across the cluster on every update.

The interesting thing about unique constraints is that they rarely change once created. And if they do, they are typically part of very explicit workflow. That isn’t something that is easy to handle without a lot of context. Therefor, we decided that we can’t reliably implement them and dropped the feature.

However… reliable and atomic distributed operations are now on the table, and they allow you to achieve the same thing, and usually in a far better manner. The full details will be in the next post.

reEntity Framework Core performance tuning–Part III

time to read 1 min | 166 words

I mentioned in the previous post that I’ll take the opportunity to show of some interesting queries. The application itself is here, and you can see how to UI look in the following screenshot:

image

I decided to see what would be the best way to come up with the information we need for this kind of query. Here is what I got.

image

This is using a select object style to get a complex projection back from the server. Here are the results:

image

As you can see, we are able to get all the data we want, in a format that is well suited to just sending directly to the UI with very little work and with tremendous speed.

reEntity Framework Core performance tuning–Part II

time to read 4 min | 701 words

After looking at this post detailing how to optimize data queries in EF Core, I obviously decided that I need to test how RavenDB handles the same load.

To make things fair, I tested this on my laptop, running on battery mode. The size of the data isn’t that much, only 100,000 books and half a million reviews, so I decided to increase that by an order of magnitude.

image

The actual queries we make from the application are pretty simple and static. We can sort by votes / publication date / price (ascending /descending) and we can filter by number of votes and the publication year.

image

This means that we don’t have an explosion of querying options, so that simplify the kind of work we are doing. To make things simple for myself, I kept the same model of books / authors and reviews as separate collections. This isn’t the best model for document database, but it allows us to compare apples to apples against the work the EF Core based solution and the RavenDB solution need to do.

A major cost in Jon’s solution is the need to aggregate the reviews for a book (so the average for the review can be computed). In the end, the only way to get the solution required was to just manually calculate the average reviews for each book and store the computation in the book. We’ll discuss this a bit more in a few minutes, for now, I want to turn our eyes toward the simplest possible query in this page, getting 100 books sorted by the book id.

Because we aren’t running on the same machine, it is hard to make direct parallels, but on Jon’s machine he got 80 ms for this kind of query on 100,000 books. When increasing the data to half a million  books, the query time rose to 150ms. Running the same query gives us the results instantly (zero ms). Querying and sorting by the title, for example, give us the results in 19 ms for a page size of 100 books.

Now, let us look at the major complexity for this system, sorting and filtering by the number of votes in the system. This is hard because the reviews are stored separately from the books. With EF Core, there is the need to join between the tables, which is quite expensive and eventually led Jon to take upon himself the task of manually maintaining the values. With RavenDB, we can use a map/reduce index to handle this all for us. More specifically, we are going to use a multi map/reduce index.

Here is what the index definition looks like:

image

We map the results from both the Books and the BookReviews into the same shape, and then reduce them together into the final output, which contains the relevant aggregation.

Now, let us do some queries, shall we? Here is us querying over the entire dataset (an order of magnitude higher than the EF Core sample set), filtering by the published date and ordering by the computed votes average. In here, we get the first 100 items, and you can see that we got over 289,753 total results:

image

One very interesting feature of this query is that we are asking to include the book document for the results. This is handled after the query (so no need to do a join to the entire 289K+ results), and we are able to get everything we want in a very simple fashion.

Oh, and the total time? 17 ms. Compared to the 80ms result for EF with 1/10 of the data size. That is pretty nice (and yes, different machines, hard to compare, etc).

I’ll probably have another post on this topic, showing off some of the cool things that you can do with RavenDB and queries.

RavenDB 4.0 Unsung heroesThe design of the security error flow

time to read 2 min | 351 words

recipe-575434_640This is again a feature that very few people will even notice exist, but a lot of time, effort and thinking went into building. How should RavenDB handle a case when a user make a request that it is not authorize to make. In particular, we need to consider the case of a user pointing the browser to a server or database that they aren’t authorized to see or without having the x509 certificate properly registered.

To understand the problem we need to figure out what the default experience will be like, and if we require a client certificate to connect to RavenDB, and the client does not provide it, by default the response is some variation of just closing the TCP connection. That result in the client getting an error that looks like this:

TCP connection closed unexpectedly

That is not conductive for a good error experience and will typically cause a user to spend a lot of time trying to figure out what the network problem is, while everything is working just fine, the server just doesn’t want to talk to the user.

The problem is that at the TLS level, there isn’t really a good way to give back some meaningful error. We are too low level, all we can do is just terminate the connection.

Instead of doing that, RavenDB will accept the connection, regardless of whatever it has a valid certificate (or even any certificate) and pass the connection to one level up in the chain. At that point, we can check whatever the certificate is valid and if it isn’t (or if it doesn’t have the permissions to do what we want it to do we can use the protocol’s own mechanism to report errors.

With HTTP, that means we can return a 403 error to the user, including an explanation on why we rejected the connection (no certificate, expired certificate, certificate doesn’t have the right permissions, etc). This make things much easier when you need to troubleshoot permissions issues.

RavenDB 4.0 Unsung HeroesThe indexing threads

time to read 3 min | 473 words

wire-33134_640A major goal in RavenDB 4.0 is to eliminate as much as possible complexity from the codebase. One of the ways we did that is to simplify thread management. In RavenDB 3.0 we used the .NET thread pool and in RavenDB 3.5 we implemented our own thread pool to optimize indexing based on our understanding of how indexing are used. This works, is quite fast and handles things nicely as long as everything works. When things stop working, we get into a whole different story.

A slow index can impact the entire system, for example, so we had to write code to handle that, and noisy indexing neighbors can impact overall indexing performance  and tracking costs when the indexing work is interleaved is anything but trivial. And all the indexing code must be thread safe, of course.

Because of that, we decided we are going to dramatically simplify our lives. An index is going to use a single dedicated thread, always. That means that each index gets their own thread and are only able to interfere with their own work. It also means that we can have much better tracking of what is going on in the system. Here are some stats from the live system.

image

And here is another:

image

What this means is that we have fantastically detailed view of what each index is doing, in terms of CPU, memory and even I/O utilization is needed. We can also now define fine grained priorities for each index:

image

The indexing code itself can now assume that it single threaded, which free a lot of complications and in general make things easier to follow.

There is the worry that a user might want to run 100 indexes per database and 100 databases on the same server, resulting in a thousand of indexing threads. But given that this is not a recommended configuration and given that we tested it and it works (not ideal and not fun, but works), I’m fine with this, especially given the other alternative that we have today, that all these indexes will fight over the same limited number of threads and stall indexing globally.

The end result is that thread per index allow us to have fine grained control over the indexing priorities, account for memory and CPU costs as well simplify the code and improve the overall performance significantly. A win all around, in my book.

RavenDB 4.0 Unsung HeroesIndexing related data

time to read 3 min | 506 words

treasure-map-153425_640RavenDB is a non relational database, which means that you typically don’t model documents as having strong relations. A core design principle for modeling documents is that they should be independent, isolated and coherent, or more specifically,

  • Independent – meaning a document should have its own separate existence from any other documents.
  • Isolated – meaning a document can change independently from other documents.
  • Coherent – meaning a document should be legible on its own, without referencing other documents.

That said, even when following proper modeling procedures there are still cases when you want to search a document by its relationship. For example, you might want to search for all for all the employees whose manage name is John, and you don’t care if this is John Doe or John Smith for some reason.

RavenDB allows you to handle this scenario by using LoadDocument during the index phase. That creates a relationship between the two documents and ensures that whenever the referenced document is updated, the referencing documents will be re-indexed to catch up to the new details. It is quite an elegant feature, even if I say so myself, and I’m really proud of it.

It is also the source of much abuse in the wild. If you don’t model properly, it is often easy to paper over that using LoadDocument in the indexes.

The problem is that in RavenDB 3.x an update to a document that was referenced using LoadDocument was also required to touch all of the referencing documents. This slowed down writes, which is something that we generally try really hard to avoid and could also caused availability issues if there were enough referencing documents (as in, all of them, which happened more frequently then you might think).

With RavenDB 4.0, we knew that we had to do better. We did this by completely changing how we are handling LoadDocument tracking. Instead of having to re-index all the relevant values globally, we are now tracking them on a per index basis. In each index, we track the relevant references on a per collection basis, and as part of the indexes we’ll check if there has been any updates to any of the documents that we have referenced. If we do have an document that has a lot of referencing documents, it will still take some time to re-index all of them, but that cost is now limited to just the index in question.

You can still create an index and slow it down in this manner, but the pay to play model is much nicer and there is no affect on the write speed for documents and no general impact on the whole database, which is pretty sweet. The only way you would ever run into this feature is if you run into this problem in 3.x and try to avoid it, which is now not necessary for the same reason (although the same modeling concerns apply).

RavenDB 4.0 Unsung HeroesMap/reduce

time to read 6 min | 1060 words

Computing_In_MorningOne of the hardest things that we did in RavenDB 4.0 would probably go completely unnoticed by users. We completely re-wrote how RavenDB is processing map/reduce queries. One of my popular blog posts is still a Visual Explanation to Map/Reduce, and it still does a pretty good job of explaining what map/reduce is.

The map/reduce code in RavenDB 3.x is one of the more fragile things that we have, require you to maintain in your head several completely different states that a particular reduction can be in and how they transition between states. Currently, there are probably two guys* who still understand how it works and one guy that is still able to find bugs in the implementation. It is also not as fast as we wished it would be.

So with RavenDB 4.0 we set out to build it from scratch, based in no small part on the fact that we had also written our storage engine for 4.0 and was able to take full advantage of that. You can read about the early design in this blog post, but I’m going to do a quick recap and explain how it works now.

The first stage in map/reduce is… well, the map. We run over the documents and extract the key portions we’ll need for the next part. We then immediately apply the reduce on each of the results independently. This give us the final map/reduce results for a single document. More to the point, this also tells us what is the reduce key for the results is. The reduce key is the value that the index grouped on.

We store all of the items with the same reduce key together. And here is where its get interesting. Up until a certain point, we just store all of the values for a particular reduce key as an embedded value inside a B+Tree. That means that whenever any of the values changes, we can add that value to the appropriate location and reduce all the matching values in one go. This works quite well until the total size of all the values exceed about 4KB or so.

At this point, we can’t store the entire thing as an embedded value and we move all the values for that reduce key to its own dedicated B+Tree. This means that we start with a single 8KB page and fill it up, then split it, and so on. But there is a catch. The results of a map/reduce operation tend to be extremely similar to one another. At a minimum, they share the same properties and the same reduce key. That means that we would end up storing a lot of duplicate information. To resolve that, we also apply recursive compression. Whenever a page nears 8KB in size, we will compress all the results stored in that page as a single unit. This tend to have great compression rate and can allow us to store up to 64KB of uncompressed data in a single page.

When adding items to a map/reduce index, we apply an optimization so it looks like:

results = reduce(results, newResults);

Basically, we can utilize the recursive nature of reduce to optimize things for the append only path.

When you delete or update documents and results change or are removed, things are more complex. We handle that by running a re-reduce on the results. Now, as long as the number of results is small (this depend on the size of your data, but typically up to a thousand or so) we’ll just run the reduce over the entire result set. Because the data is always held in a single location, this means that it is extremely efficient in terms of memory access and the tradeoff between computation and storage leans heavily to the size of just recomputing things from scratch.

When we have too many results (the total uncompressed size exceeds 64KB) we start splitting the B+Tree and adding a level to the three. At this point, the cost of updating a value is now the cost of updating a leaf page and the reduce operation on the root page. When we have more data still,  we will get yet another level, and so on.

The (rough) numbers are:

  • Up to 64KB (roughly 1000 results) – 1 reduce for the entire dataset
  • Up to 16 MB – 2 reduces (1 for up to 1000 results, 1 for up to 254 results)
  • Up to 4 GB – 3 reduces (1 for up to 1000 results, 2 for up to 254 results each)
  • Up to 1 TB  - 4 reduces (1 for up to 1000 results, 3 for up to 254 results each)
  • I think you get how it works now, right? The next level up is 1 to 248 TB and will requite 5 reduces.

These numbers is if your reduce data is very small, in the order of a few dozen byes. If you have large data, this means that the tree will expand faster, and you’ll get less reduces at the first level.

Note that at the first level, if there is only an addition (new document, basically), we can process that as a single operation between two values and then proceed upward as the depth of the tree requires.There are also optimizations in place if we have multiple updates to the same reduce key, in that case, we can first apply all the updates, then do the reduce once for all of them in one shot.

And all of that is completely invisible to the users, unless you want to peek inside, which is possible using the Map/Reduce visualizer:

image

This can give you insight deep into the guts of how RavenDB is handling map/reduce operations.

The current status is that map/reduce indexing are actually faster than normal indexes, because they are almost all our code, while a large portion of the normal indexing cost is with Lucene.

* That is an exaggeration, there is one guy that know how it works. Okay, okay, I’ll admit that we can dive into the code and figure out what is going on, but it takes quite a bit of time if there is a significant issue there.

RavenDB 4.0 nightly builds are now available

time to read 2 min | 245 words

imageWith the RC release out of the way, we are starting on a much faster cadence of fixes and user visible changes as we get ready to the release.

In order to allow users to be able to report issues and have then resolved as soon as possible we now publish our nightly build process.

The nightly release is literally just whatever we have at the top of the branch at the time of the release. A nightly release goes through the following release cycle:

  • It compiles
  • Release it!

In other words, a nightly should be used only on development environment where you are fine with the database deciding that names must be “Green Jane” and it is fine to burp all over your data or investigate how hot we can make your CPU.

More seriously, nightlies are a way to keep up with what we are doing, and its stability is directly related to what we are currently doing. As we come closer to the release, the nightly builds stability is going to improve, but there are no safeguards there.

It means that the typical turnaround for most issues can be as low as 24 hours (and it give me back the ability, “thanks for the bug report, fixed and will be available tonight”). All other release remains with the same level of testing and preparedness.

FUTURE POSTS

  1. Complex Linq queries in RavenDB 4.0 - 11 hours from now
  2. Cost centers, revenue centers and politics in a cross organization world - about one day from now
  3. Giving Demeter PTSD - 2 days from now
  4. PR Review: Code has cost, justify it - 3 days from now
  5. PR Review: Beware the things you can’t see - 6 days from now

And 2 more posts are pending...

There are posts all the way to Oct 25, 2017

RECENT SERIES

  1. PR Review (7):
    10 Aug 2017 - Errors, errors and more errors
  2. RavenDB 4.0 (15):
    13 Oct 2017 - Interlocked distributed operations
  3. re (21):
    10 Oct 2017 - Entity Framework Core performance tuning–Part III
  4. RavenDB 4.0 Unsung Heroes (5):
    05 Oct 2017 - The design of the security error flow
  5. Writing SSL Proxy (2):
    27 Sep 2017 - Part II, delegating authentication
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats