What is new in RavenDB 3.0: Indexing enhancements

Sep 17 2014

What is new in RavenDB 3.0Indexing enhancements

time to read 7 min | 1393 words

We talked previously about the kind of improvements we have in RavenDB 3.0 for the indexing backend. In this post, I want to go over a few features that are much more visible.

Attachment indexing. This is a feature that I am not so hot about, mostly because we want to move all attachment usages to RavenFS. But in the meantime, you can reference the contents of an attachment during index. That can let you do things like store large text data in an attachment, but still make it available for the indexes. That said, there is no tracking of the attachment, so if it change, the document that referred to it won’t be re-indexed as well. But for the common case where both the attachments and the documents are always changed together, that can be a pretty nice thing to have.

Optimized new index creation. In RavenDB 2.5, creating a new index would force us to go over all of the documents in the database, not just the documents that we have in that collection. In many cases, that surprised users, because they expected there to be some sort of physical separation between the collections. In RavenDB 3.0, we changed things so creating a new index on a small collection (by default, less than 131,072 items) will be able to only touch the documents that belong to the collections being covered by that index. This alone represent a pretty significant change in the way we are processing indexes.

In practice, this means that creating a new index on a small index would complete much more rapidly. For example, I reset an index on a production instance, it covers about 7,583 documents our of 19,191. RavenDB was able to index that in just 690 ms, out of about 3 seconds overall that took for the index reset to take place.

What about the cases where we have new indexes on large collections? At this point, in 2.5, we would do round robin indexing between the new index and the existing ones. The problem was that 2.5 was biased toward the new index. That meant that it was busy indexing the new stuff, while the existing indexes (which you are actually using) took longer to run. Another problem was that in 2.5 creating a new index would effectively poison a lot of performance heuristics. Those were built for the assumptions of all indexes running pretty much in tandem. And when we have one or more that weren’t doing so… well, that caused things to be more expensive.

In 3.0, we have changed how this works. We’ll have separate performance optimization pipelines for each group of indexes based on its rough indexing position. That lets us take advantage of batching many indexes together. We are also not going to try to interleave the indexes (running first the new index and then the existing ones). Instead, we’ll be running all of them in parallel, to reduce stalls and to increase the speed in which everything comes up to speed.

This is using our scheduling engine to ensure that we aren’t actually overloading the machine with computation work (concurrent indexing) or memory (number of items to index at once). I’ve very proud in what we have done here, and even though this is actually a backend feature, it is too important to get lost in the minutia of all the other backend indexing changes we talked about in my previous post.

Explicit Cartesian/fanout indexing. A Cartesian index (we usually call them fanout indexes) is an index that output multiple index entries per each document. Here is an example of such an index:

from postComment in docs.PostComments
from comment in postComment.Comments
where comment.IsSpam == false
select new {
    CreatedAt = comment.CreatedAt,
    CommentId = comment.Id,
    PostCommentsId = postComment.__document_id,
    PostId = postComment.Post.Id,
    PostPublishAt = postComment.Post.PublishAt
}

For a large post, with a lot of comments, we are going to get an entry per comment. That means that a single document can generate hundreds of index entries. Now, in this case, that is actually what I want, so that is fine.

But there is a problem here. RavenDB has no way of knowing upfront how many index entries a document will generate, that means that it is very hard to allocate the appropriate amount of memory reserves for this, and it is possible to get into situations where we simply run out of memory. In RavenDB 3.0, we have added explicit instructions for this. An index has a budget, by default, each document is allowed to output up to 15 entries. If it tries to output more than 15 entries, that document indexing is aborted, and it won’t be indexed by this index.

You can override this option either globally, or on an index by index basis, to increase the number of index entries per document that are allowed for an index (and old indexes will have a limit of 16,384 items, to avoid breaking existing indexes).

The reason that this is done is so either you didn’t specify a value, in which case we are limited to the default 15 index entries per document, or you did specify what you believe is a maximum number of index entries outputted per document, in which case we can take advantage of that when doing capacity planning for memory during indexing.

Simpler auto indexes. This feature is closely related to the previous one. Let us say that we want to find all users that have an admin role and has an unexpired credit card. We do that using the following query:

var q = from u in session.Query<User>()
        where u.Roles.Any(x=>x.Name == "Admin") && u.CreditCards.Any(x=>x.Expired == false)
        select u;

In RavenDB 2.5, we would generate the following index to answer this query:

from doc in docs.Users
from docCreditCardsItem in ((IEnumerable<dynamic>)doc.CreditCards).DefaultIfEmpty()
from docRolesItem in ((IEnumerable<dynamic>)doc.Roles).DefaultIfEmpty()
select new {
    CreditCards_Expired = docCreditCardsItem.Expired,
    Roles_Name = docRolesItem.Name
}

And in RavenDB 3.0 we generate this:

from doc in docs.Users
select new {
    CreditCards_Expired = (
        from docCreditCardsItem in ((IEnumerable<dynamic>)doc.CreditCards).DefaultIfEmpty()
        select docCreditCardsItem.Expired).ToArray(),
    Roles_Name = (
        from docRolesItem in ((IEnumerable<dynamic>)doc.Roles).DefaultIfEmpty()
        select docRolesItem.Name).ToArray()
}

Note the difference between the two. The 2.5 would generate multiple index entries per document, while RavenDB 3.0 generate just one. What is worse is that 2.5 would generate a Cartesian product, so the number of index entries outputted in 2.5 would be the number of roles for a user times the number of credit cards they have. In RavenDB 3.0, we have just one entry, and the overall cost is much reduced. It was a big change, but I think it was well worth it, considering the alternative.

In my next post, I’ll talk about the other side of indexing, queries. Hang on, we still have a lot to go through.

0 comments

Tags:

raven

Oren Eini

Oren Eini

CEO of RavenDB