Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 18 | Comments: 87

filter by tags archive

The RavenDB indexing processOptimization–Getting documents from disk

time to read 2 min | 327 words

As I noted in my previous post, we have done major optimizations for RavenDB. One of the areas where we improved the performance was reading the documents from the disk for indexing.

In Pseudo Code, it looks like this:

while database_is_running:
  stale = find_stale_indexes()
  lastIndexedEtag = find_last_indexed_etag(stale)
  docs_to_index = get_documents_since(lastIndexedEtag, batch_size)
  

As it turned out, we had a major optimization option here, because of the way the data is actually structured on disk. In simple terms, we have an on disk index that lists the documents in the order in which they were updated, and then we have the actual documents themselves, which may be anywhere on the disk.

Instead of loading the documents in the orders in which they were modified, we decided to try something different. We first query the information we need to find the document on disk from the index, then we sort them based on the optimal access pattern, to reduce disk movement and ensure that we have as sequential reads as possible. Then we take those results in memory and sort them based on their last update time again.

This seems to be a perfectly obvious thing to do, assuming that you are aware of such things, but it is actually something that is very easy not to notice. The end result is quite promising, and it contributed to the 7+ times improvements in perf that we had for indexing costs.

But surprisingly, it wasn’t the major factor, I’ll discuss a huge perf boost in this area tomorrow.

More posts in "The RavenDB indexing process" series:

  1. (24 Apr 2012) Optimization–Tuning? Why, we have auto tuning
  2. (23 Apr 2012) Optimization–Getting documents from disk
  3. (20 Apr 2012) Optimization–De-parallelizing work
  4. (19 Apr 2012) Optimization–Parallelizing work
  5. (18 Apr 2012) Optimization

Comments

Scooletz

Speaking about the access pattern. Is it connected with a storage engine implementation? What about Esent based? Do you retrieve some metadata from it? If so, you could elaborate this topic more.

It's getting more and more interesting. I wish you'd publish it as a single, long post! :P

Ayende Rahien

Scooletz, We are talking about how we are doing this on Esent, yes. The relevant code is here: https://github.com/ayende/ravendb/blob/master/Raven.Storage.Esent/StorageActions/OptimizedIndexReader.cs

Scooletz

Correct me if I'm getting it wrong, but all what Raven does is searching through a given index and adding bookmarks' buffers to the collection for each doc matching criteria. Once you've got all of them, you query esent for docs in order of sorted buffers returned by Esent, right?

Ayende Rahien

Scooletz, Yes, that is the general idea.

Jerry

It would be a nice addition to these posts if you posted a link to the actual change set. I really like reading these types of posts. Thanks!

Comment preview

Comments have been closed on this topic.

FUTURE POSTS

  1. Buffer allocation strategies: A possible solution - about one day from now
  2. Buffer allocation strategies: Explaining the solution - 3 days from now
  3. Buffer allocation strategies: Bad usage patterns - 4 days from now
  4. The useless text book algorithms - 5 days from now
  5. Find the bug: The concurrent memory buster - 6 days from now

There are posts all the way to Sep 11, 2015

RECENT SERIES

  1. Find the bug (5):
    20 Apr 2011 - Why do I get a Null Reference Exception?
  2. Production postmortem (10):
    03 Sep 2015 - The industry at large
  3. What is new in RavenDB 3.5 (7):
    12 Aug 2015 - Monitoring support
  4. Career planning (6):
    24 Jul 2015 - The immortal choices aren't
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats