Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,128 | Comments: 45,549

filter by tags archive

The design of RavenDB 4.0You can’t see the map/reduce from all the trees

time to read 6 min | 1009 words

Map/Reduce is a core part of RavenDB, one of the earliest features that we implemented and something that we have worked to improve many times. You can read my original blog post about them. In the current codebase, Map/Reduce is also one of the scariest pieces of code, and one of the most fragile. The good thing is that we have a really high number of tests around it. The sad thing is that it takes a long time to make any modification there to stick without breaking something else.

It is a complex topic from the get go, and having a performant version of that was not trivial, so put all together, this is one of the more complex pieces of code in RavenDB. I’m not going to go into the details on how this works now, it is that complex.

That complexity really bugged me. This is an area of the code that only a few of us had approached, and usually with various levels of dread. We spent so much time simplifying things in RavenDB 4.0 that I couldn’t really abide the concept of having to bring this complexity over and having to live with it. But we couldn’t find a better way to do it.

Until I realized that I was thinking about this in totally the wrong level. The way to handle that was to reduce the level of abstraction down, way down. Again, being able to control the storage at a very low level helps a lot.

I have talked before about B+Trees, and this time, we are going to use their structure directly. Here is how it works. Let us assume that we have the following map/reduce index:

This operates on orders, and gives us the total purchases per product. Now, for the actual implementation. We first run the map function over a set of orders, giving us the mapped results, which we’ll store in a B+Tree.  A B+Tree is a key/value structure, and previously we used composite keys using the reduced key, the document key, and a whole bunch of other stuff. That led to a whole lot of complications.

Now, we are going to have a separate key per reduce key. Confusing? Lets see if I can explain. Each reduce key (a reduce key is the thing that you group by, in this case, that is the product id) is going to have its own dedicated tree. And the content of that tree is going to be the document id as the key to the tree, and the mapped result for that particular reduce key as the value. So far, that isn’t really impressive. We changed things around, instead of having a single tree for all mapped results, we are going to have a tree per reduce key.

But here is where it gets interesting. B+Tree are… well, trees. That means that they are hierarchical in nature. What we did was ask the tree to let us know about all the changed pages. Consider the image below, which shows the status of the database after indexing a few orders, which have line items for products/17 and products/10.

image

We are now going to update orders/2. So the next thing that we need to do, is to run the map over the updated order, giving us new entries for products/17 and products/10. Each of them in a different tree. Because we can ask the tree what pages have changed, we can now do the following:

  • For each changed page:
    • calculate the new value of the page.
    • mark the parent page as changed
  • Repeat until there are no changed pages.

Let us see this in detail, in Page 14, which is the (small) tree for products/10, we don’t really have much to do. We just need to run the reduce over all the entries in the page, and we have the final result. Things are more interesting with products/17.

Here, we updated Page 31 during the map process. When we get to the reduce, we discover that, and that means that we need to re-reduce Page 31. But that isn’t the end. Now we need to also update things upward. In our case, we need to update Page 13. You might have noticed the list on the side, there is the computed reduce value per each page there. And when we are done updating Page 31, we go to its parent, Page 13. We get the computed values for pages 31,44, 32 and reduce those, giving us the final value for Page 13. In which point, we go up yet again, and reduce Page 13 and Page 33 together, resulting in the final value.

All in all, this is pretty simple to explain, was very easy to implement and is going to handle the complexities of map/reduce in dramatically better fashion.

In the example above, I’m showing it working with very few entries per page. In practice, we are usually talking about roughly 200 entries per page, so a reduce key has more than 200 entries, we start going with the multiple steps.

Our go-to example for map reduce was the US census. 310 million people, more or less, and we want to build a count of how many people per state. That gives us California, with roughly 40 million people in it. Adding a new person to California using this system will result in a B+Tree that have a depth of 5, and re-reducing after an update will take less than a thousand operations to re-compute it. And that is for an update / delete operation.

If we have a new value, we can skip the whole thing, and re-update the map/reduce in about 6 operations. And the entire codebase is easy to read, and we expect it to be much faster as well.

RavenDB 3.5 beta is out!

time to read 2 min | 216 words

RavenDB 3.5 has been in development for over two years, and we finally get to see it taking its first steps out the gate.

You can read about all the goodies in this blog post series, and we have more features being unveiled each day, until the RavenDB Conference next week.

You can get build 35100 right now, and start playing with the bits.

What does this mean? Beta means just that, the software is pretty much ready, we have all the features we want, and while there are still a bunch of pending tasks (mostly in the UI and outer layers), the software is ready for actual use.

We expect the beta period to last about six to eight weeks, during which we’ll finish the remaining work and run it though the usual wringer (longevity, performance, stabilization, etc) before going with RC and finally RTM. We are currently expecting RTM release in July/August time frame.

I’ll be talking about this release a lot in the RavenDB Conference.

We’ll be updating the website and the documentation for the next few weeks, as part of the release cycle, but I wanted to get the bits into your hands as soon as possible.

RavenDB 3.5 whirl wind tourDeeper insights to indexing

time to read 2 min | 303 words

The indexing process in RavenDB is not trivial, it is composed of many small steps that need to happen and coordinate with one another to reach the end goal.

A lot of the complexity involved is related to concurrent usage, parallelization of I/O and computation and a whole bunch of other stuff like that. In RavenDB 3.0 we added support for visualizing all of that, so you can see what is going on, and can tell what is taking so long.

With RavenDB 3.5, we extended that from just looking at the indexing documents to looking into the other parts of the indexing process.

unnamed

In this image, you can see the green rectangles representing prefetching cycles of documents from the disk, happening alongside indexing cycles (representing as the color rectangles, with visible concurrent work and separate stages.

The good thing about this is that it make it easy to see whatever there is any waste in the indexing process, if there is a gap, we can see it and investigate why that happens.

Another change we made was automatic detection of corrupted Lucene indexes (typical case, if you run out of disk space, it is possible that the Lucene index will be corrupted, and after disk space is restored, it is often not recoverable by Lucene), so we now can automatically detect that scenario and apply the right action (trying to recover from a previous instance, or resetting the index).

As a reminder, we have the RavenDB Conference in Texas in a few months, which would be an excellent opportunity to see RavenDB 3.5 in all its glory.

image

The design of RavenDB 4.0Separation of indexes and documents

time to read 4 min | 629 words

In my last post on the topic, I discussed physically separating documents of different collections. This post is about the same concept, but applied at a much higher level. In RavenDB, along with the actual indexing data, we also need to keep track of quite a few details. What did we last index, what is our current state, any errors that happened, keep track of referenced documents, etc. For map/reduce indexes, we have quite a bit more data that we need to work with, all the intermediate results of the map/reduce process, along with bookkeeping information about how to efficiently reduce additional values, etc.

All of that information is stored in the same set of files as the documents themselves. As far as the user is concerned, this is mostly relevant when we need to delete an index. Because on large databases the deletion of a big index can take a while, this was an operational issue. In RavenDB 3.0 we changed things so index deletion would be async, which improved matters significantly. But on large databases with many indexes, that still got us into problems.

Because all the indexes were using the same underlying storage, that meant that the number of values that we had to track was high. And it was proportional to the number of indexes and the amount of documents they indexed. That means that in a particular database with a hundred million documents, and three map/reduce indexes, we had to keep track of over half a billion entries. B+Trees are really amazing creatures, but one of their downsides is that once they get to a certain size, they slow down as the cost of traversing the tree become very high.

In relational terms, we put all the indexing data into a single table, and had a IndexId column to distinguish between the different records. And once the table got big enough, we had issues.

One of the design decisions we made in the build up to RavenDB 4.0 was to remove multi threaded behavior inside Voron, so that led to an interesting problem with having everything in the same Voron storage. We wouldn’t be able to index and accept new documents at the same time (I’ll have another post about this design decision).

The single threaded nature and the problems with index deletion has led us toward an interesting decision. A RavenDB database isn’t actually composed from a single Voron storage. It is composed of multiple of those, each of them operating independently of one another.

The first one, obviously, is for the documents. But each of the indexes now have its own Voron storage. That means that they are totally independent from one another, which leads to a few interesting implications:

  • Deleting an index is as simple as shutting down the indexing for this index and then deleting the Voron directory from the file system.
  • Each index has its own independent data structures, so having multiple big indexes isn’t going to cause us to pay the price of all of them together.
  • Because each index has a dedicated thread, we aren’t going to see any complex coordination between multiple actors needing to use the same Voron storage.

This is important, because in RavenDB 4.0, we are also storing the actual Lucene index inside the Voron storage, so the amount of work that we now require it to deal with is much higher. By splitting it along each index line, we have saved ourselves a whole bunch of headache on how to manage them properly.

As a reminder, we have the RavenDB Conference in Texas shortly, which would be an excellent opportunity to discuss RavenDB 4.0 and see what we already have done.

image

RavenDB 3.5 whirl wind tourDigging deep into the internals

time to read 3 min | 491 words

So far I talked mostly about the visible parts of the stuff that we did in RavenDB 3.5, stuff that has a user interface and is actually easy to talk about. In this post, I'm going to dive a bit into the stuff that goes in the core, which no one usually notices except us, except when it breaks.

RavenDB Lucene Parser

A frequent cause for complaint with RavenDB is the fact that the Lucene Query Parser is relying on exceptions for control flow. That means that if you are debugging a test that is using RavenDB, you are likely to be stopped by LookAheadSuccessException during debugging. This is handled internally, but the default VS configuration will stop on all exception, which caused more than a single person to assume that there is actually some error and post a question to the mailing list.

But the reason we decided to implement our own parser wasn't the issue of exceptions. It was performance and stability. RavenDB doesn't actually use Lucene syntax for queries, we have extended in in several ways (for example, the @in<Group>: (Relatives ,Friends) syntax). Those extensions to the syntax were implemented primarily as pre and post processing over the raw query string using regular expressions. And you know the saying about that. Under profiling, it turned out that significant amount of time was spent in these processing, and in particular, in those regexes.

All of which gives us an extremely efficient parser, no exceptions during the parsing and a formal grammer that we can stick to. If you care, you can read the full grammar here.

Explicit thread management for indexing

In RavenDB 3.0, we rely on the standard .NET ThreadPool for index execution, this has led to some interesting issues related to thread starvation, especially when you have many concurrent requests that take up threads. The fact that the .NET ThreadPool has a staggered growth pattern also have an impact here, in terms of how much we are actually scale out there.

By creating our own thread pool, decided for our own stuff, we are able to do things that you can't do in the global thread pool. For example, we can respond to CPU pressure by reducing the priority of the indexing thread pool threads, so we'll prefer to process request than do background work. We also have a more predictable behavior around indexing batches and abandon an index midway through an index to ensure liveliness for the entire indexing process.

And what is almost as important, the fact that we have our own thread pool for indexing means that we can now much more easily report and monitor it. Which make our lives much easier in production.

As a reminder, we have the RavenDB Conference in Texas in a few months, which would be an excellent opportunity to see RavenDB 3.5 in all its glory.

image

The design of RavenDB 4.0Voron has a one track mind

time to read 3 min | 471 words

Kill it With Fire Aliens

When we started to build Voron, we based some of its behavior around how LevelDB works. While the storage details are heavily influenced by LMDB, we did take a few things from LevelDB. In particular, the idea of transaction merging.

You can read about our implementation in our design notes for Voron from 2013. The idea is that even though you have a single writer, you can prepare the transaction separately, then submit the transaction to be processed by a dedicated thread. This thread will merge all pending transaction requests into a single physical transaction and allow us to parallelize some of the work, amortizing the cost of going to disk across multiple concurrent transactions.

This is how Voron is running now, and it was a feature that I, personally, was very excited about.  And in RavenDB 4.0, we killed this feature.

If this is such a great and exciting feature, why kill it and go to a single writer only mode?

There are actually several distinct reasons, each of them serving as a big black mark against this feature.

The first strike against this feature is that is result in much higher memory usage and copying of data. Whenever we need to create a transaction, we have to write all the data into a temporary buffer, which is then sent to the transaction merger. This result in memory hanging around longer, higher allocations, and double copying of the data.

The second strike against this feature is that it result in unpredictable behavior. Because transactions are merged on a first come/first served basis, small differences in the execution of transactions can dramatically change the order of operations that is actually committed. Usually it doesn’t matter, but if we need to track down on a particular issue, that is a really important. Having a single writer means that we have very predictable behavior.

The third strike against this feature is that it leads to concurrency aware code. Because you are going to submit a transaction to be processed, there is potentially other transactions that can change the data that you rely on. We have ways to handle that, but requesting optimistic concurrency checks to be done, but this end up being quite complex to manage properly.

The forth strike against this feature is that the major reason it was needed was that we wanted to be able to parallelize the work of indexing and documents, and that was meant to handle just that. But the re-shaping of the indexes storage and documents storage means that we have separate Voron storages for the documents and for each index, so we still have this ability, but were able to remove this code and reduce our complexity significantly.

RavenDB 3.5 whirl wind tourI'll have the 3+1 goodies to go, please

time to read 5 min | 820 words

I spoke a lot about relatively large features, such as Clustering, Global Configuration and the Admin Console. But RavenDB 3.5 represent a lot of time and effort from a pretty large team. Some of those changes you'll probably not notice. Additional endpoints, better heuristics, stuff like that, and some of them are small tiny features that get lost in the crowd. This post is about them.

Sharding

RavenDB always had a customizable sharding strategy, while the default sharding strategy is pretty smart, you can customize it based on your own knowledge of your system. Unfortunately, we had one trouble about that. While you could customize it, when you did so, it was pretty much an all or nothing approach. Instead of RavenDB analyzing the query, extracting the relevant servers to use for this query and then only hitting them, you got the query and had to do all the hard work yourself.

In RavenDB 3.5, we changed things so you have multiple levels of customizing this behavior. You can still take over everything, if you want to, or you can let RavenDB do all the work, give you the list of servers that it things should be included in the query, and then you can apply your own logic. This is especially important in scenarios where you split a server. So the data that used to be "RVN-A" is now located on "RVN-A" and on "RVN-G". So RavenDB analyze the query, does it things and end up saying, I think that I need to go to "RVN-A", and then you can detect that and simply say: "Whenever you want to go to RVN-A, also go to RVN-G". So the complexity threshold is much lowered.

Nicer delete by index

The next feature is another tiny thing. RavenDB supports the ability to delete all record matching a query. But in RavenDB 3.0, you have to construct the query yourself (typically you would call ToString() on an existing query) and the API was a bit awkward. In RavenDB 3.5, you can now do the following:

await session.Advanced.DeleteByIndexAsync<Person, Person_ByAge>(x => x.Age < 35);

And this will delete all those youngster from the database, easy, simple, and pretty obvious.

I did mention that this was the post about the stuff that typically goes under the radar.

Query timings and costs

This nice feature appears in the Traffic Watch window. When you are looking into queries, you'll now not only get all the relevant details about this query, but will also be able to see the actual costs in serving it.

image

In this case, we are seeing a very expensive query, even though it is a pretty simple one. Just looking at this information, I can tell you why that is.

Look into the number of results, we are getting 22 out of 1.4 millions. The problem is that Lucene doesn't know which ones to return, it need to rank them by how well they match the query. In this case, they all match the query equally well, so we waste some time sorting the results only to get the same end result.

Explain replication behavior

Occasionally we get a user that complains that after setting up replication, some documents aren't being replicated. This is usually by design, because of some configuration or behavior, but getting to the root of it can take a non trivial amount of time.

In order to help that, we added a debug endpoint and a UI screen to handle just that:

Capture

Now you can explicitly ask, why did you skip replication over this document, and RavenDB will run through its logic tree and tell you exactly what the reason is. No more needing to read the logs and find exactly the right message, you can just ask, and get an immediate answer back.

Patch that tells you how much work was done

This wasn't meant to be in this post, as you can see from the title, this is a tiny stupid little feature, but I utterly forgot that we had it, and when I saw it I just loved it.

The premise is very simple, you are running a patch operation against a database, which can take a while. RavenDB will now report to you what the current state of the process is, so you can see that it is progressing easily, and not just wait until you get the "good / no good" message in the end.

image (1)

As a reminder, we have the RavenDB Conference in Texas in shortly, which would be an excellent opportunity to see RavenDB 3.5 in all its glory.

image

RavenDB 3.5 Whirlwind tour: I need to be free to explore my data

time to read 1 min | 164 words

This feature is primarily about giving operations / developers the ability to just say: "I don't care about performance, I don't care about cost, I need to find something out, and I wanna do this qucikly".

For those case, you can use Data Exploration:

image

This allows you to write Linq statements that are processed on the server with full access to the entire data set. Note that this has the option of running for a very long time, so we also provide a way to limit the cost based on time and the number of documents to process.

And, of course, because those kind of tasks are almost always generated at the behest of a business analyst, we can get the results as an Excel file, and dump the data in someone's else lap to deal with it.

The design of RavenDB 4.0Physically segregating collections

time to read 5 min | 915 words

When we started writing RavenDB, the idea of collection was this notion of “just a way to say that those documents are roughly similar”. We were deep in the schemaless nature of the system, and it made very little sense at the time to split different documents. By having all documents in the same location (and by that, I mean that they were all effectively stored in the same physical format and the only way to tell the difference between a User document and an Order document is by reading their metadata), we were able to do some really cool things. Indexes could operate over multiple collections easily, replication was simple, exporting documents was very natural operation, etc.

Over time, we learned by experience that most of the time, documents in separate collections are truly separate. They are processed differently, behave differently, and users expect to be able to operate on them differently. This is mostly visible when users have a large database and try to define an index on a small collection, and are surprised when it can take a while to index. The fact that we need to go over all the documents (because we can’t tell them apart before we read them) is not something that is in the mental model for most users.

We have work around most of that by utilizing our own indexing structure. The Raven/DocumentsByEntityName index is used to do quite a lot. For example, we often are able to optimize the “small collection, new index” scenario using the Raven/DocumentsByEntityName, deletion / patching of collections through the studio is using it, etc.

In RavenDB 4.0 we decided to see what it would take to properly segregate collections. As it turned out, this is actually quite hard to do, because of the etag property.

The etag property of RavenDB goes basically like this: Etags are numbers (128 bits in RavenDB up to 3.x, 64 bits in RavenDB 4.0 and onward) that are always increasing, and each document change will result in a higher etag being generated. You can ask RavenDB to give you all documents since a particular etag, and by continually doing this, you’ll get all documents in the database, including all updates.

This properly is the key for quite a lot of stuff internally. Replication, indexing, subscriptions, exports, the works.

But by putting documents in separate physical locations, that means that we won’t have an easy way to scan through all of them. In RavenDB 3.0, we effectively have an index of [Etag, Document], and the process of getting all documents after a particular etag is extremely simple and cheap. But if we segregate collections, we’ll need to merge the information from multiple locations, which can be non trivial, and has a complexity of O(N logN).

There is also the issue of the document keys namespace, which is global (so you can’t have a User with the document key “users/1” and an Order with the document key “users/1”).

Remember the previous post about choosing Voron as our storage engine for RavenDB 4.0? This is one of the reasons why. Because we control the storage layer, we we able to come up with an elegant solution to the problem.

Each of our collections is going to be stored in a separate physical structure, with its own location on disk, its own indexes, etc. But at the same time, all of those separate collections are also going to share a pair of indexes (document key and document etag). In this manner, we effectively index the document etag twice. At first it is indexed in the global index, along all the documents in the database, regardless of which collection they are on. This index will be used for replication, exports, etc. And it is indexed again in a per collection index, which is what we’ll use for indexing, patch by colelction, etc. In the same manner, the documents key index is going to allow us to lookup documents by their key without needing to know what collection they are on.

Remember, those indexes actually store the an id that gives us an O(1) access to the data, which means that processing them is going to be incredibly cheap.

This has a bunch of additional advantages. To start with, it means that we can drop the Raven/DocumentsByEntityName index ,it is not longer required, since all its functioned are now handled by those internal storage indexes.

Loaded terminology term: Indexes in RavenDB can refer to either the indexes that users define and are familiar with (such as the good ol` Raven/DocuemntsByEntityName) and also storage indexes, which are internal structures inside the RavenDB engine and aren’t exposed externally.

That has the nice benefit of making sure that all collection data are now transactional and is updated as part of the write transactions.

So we had to implement a relatively strange internal structure to support this segregation. But aside from the “collections are physically separated from one another”, what does this actually gives us?

Well, it make certain tasks, such as indexing, subscriptions, patching, etc that work on a per collection basis much easier. You don’t need to scan all documents and filter the stuff that isn’t relevant. Instead, you can just iterate over the entire result set directly. And that has its own advantages. Because we are storing documents in separate internal structures per collection, there is a much stronger chance that documents in the same collection will reside nearby one another on the disk. Which is going to increase performance, and opens up some interesting optimization opportunities.

RavenDB 3.5 whirl wind tourI’ll find who is taking my I/O bandwidth and they SHALL pay

time to read 2 min | 326 words

I previously mentioned that a large part of what we need to do as a database is to actively manage our resources, things like CPU usage and memory are relatively easy to manage (to a certain extent), but one of the things that we trip over again and again is the issue of I/O.

Whatever it is a cloud based system with an I/O rate of a an old IBM XT after being picked up from a garage after a flood to users that pack literally hundreds of extremely active database on the same physical storage medium to a user that is reading the entire database (through subscriptions) on every page load, I/O is not something that you can ever have enough of. We spend an incredible amount of time trying to reduce our I/O costs, and still we run into issues.

So we decided to approach it from the other side. RavenDB 3.5 now packages Raven.Monitor.exe, which is capable of monitoring the actual I/O and pin point who is to blame, live. Here is what this looks like after 1 minute run in a database that is currently having data imported + some minor indexing.

image

The idea is that we can use this ability to do two things. We can find out who is consuming the I/O on the system, and even narrow down to exactly what is consuming it, but we can also use it to find how much resources a particular database is using, and can tell based on that whatever we are doing good job of utilizing the hardware properly.

As a reminder, we have the RavenDB Conference in Texas in a few months, which would be an excellent opportunity to see RavenDB 3.5 in all its glory.

image

FUTURE POSTS

  1. The worker pattern - 2 days from now

There are posts all the way to May 30, 2016

RECENT SERIES

  1. The design of RavenDB 4.0 (14):
    26 May 2016 - The client side
  2. RavenDB 3.5 whirl wind tour (14):
    25 May 2016 - Got anything to declare, ya smuggler?
  3. Tasks for the new comer (2):
    15 Apr 2016 - Quartz.NET with RavenDB
  4. Code through the looking glass (5):
    18 Mar 2016 - And a linear search to rule them
  5. Find the bug (8):
    29 Feb 2016 - When you can't rely on your own identity
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats