Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 5,970 | Comments: 44,505

filter by tags archive

The RavenDB indexing processOptimization–Getting documents from disk


As I noted in my previous post, we have done major optimizations for RavenDB. One of the areas where we improved the performance was reading the documents from the disk for indexing.

In Pseudo Code, it looks like this:

while database_is_running:
  stale = find_stale_indexes()
  lastIndexedEtag = find_last_indexed_etag(stale)
  docs_to_index = get_documents_since(lastIndexedEtag, batch_size)
  

As it turned out, we had a major optimization option here, because of the way the data is actually structured on disk. In simple terms, we have an on disk index that lists the documents in the order in which they were updated, and then we have the actual documents themselves, which may be anywhere on the disk.

Instead of loading the documents in the orders in which they were modified, we decided to try something different. We first query the information we need to find the document on disk from the index, then we sort them based on the optimal access pattern, to reduce disk movement and ensure that we have as sequential reads as possible. Then we take those results in memory and sort them based on their last update time again.

This seems to be a perfectly obvious thing to do, assuming that you are aware of such things, but it is actually something that is very easy not to notice. The end result is quite promising, and it contributed to the 7+ times improvements in perf that we had for indexing costs.

But surprisingly, it wasn’t the major factor, I’ll discuss a huge perf boost in this area tomorrow.

More posts in "The RavenDB indexing process" series:

  1. (24 Apr 2012) Optimization–Tuning? Why, we have auto tuning
  2. (23 Apr 2012) Optimization–Getting documents from disk
  3. (20 Apr 2012) Optimization–De-parallelizing work
  4. (19 Apr 2012) Optimization–Parallelizing work
  5. (18 Apr 2012) Optimization

Comments

Scooletz

Speaking about the access pattern. Is it connected with a storage engine implementation? What about Esent based? Do you retrieve some metadata from it? If so, you could elaborate this topic more.

It's getting more and more interesting. I wish you'd publish it as a single, long post! :P

Ayende Rahien

Scooletz, We are talking about how we are doing this on Esent, yes. The relevant code is here: https://github.com/ayende/ravendb/blob/master/Raven.Storage.Esent/StorageActions/OptimizedIndexReader.cs

Scooletz

Correct me if I'm getting it wrong, but all what Raven does is searching through a given index and adding bookmarks' buffers to the collection for each doc matching criteria. Once you've got all of them, you query esent for docs in order of sorted buffers returned by Esent, right?

Ayende Rahien

Scooletz, Yes, that is the general idea.

Jerry

It would be a nice addition to these posts if you posted a link to the actual change set. I really like reading these types of posts. Thanks!

Comment preview

Comments have been closed on this topic.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Production postmortem (5):
    29 Jul 2015 - The evil licensing code
  2. Career planning (6):
    24 Jul 2015 - The immortal choices aren't
  3. API Design (7):
    20 Jul 2015 - We’ll let the users sort it out
  4. What is new in RavenDB 3.5 (3):
    15 Jul 2015 - Exploring data in the dark
  5. The RavenDB Comic Strip (3):
    28 May 2015 - Part III – High availability & sleeping soundly
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats