Ayende @ Rahien

It's a girl

The RavenDB indexing process: Optimization–Getting documents from disk

As I noted in my previous post, we have done major optimizations for RavenDB. One of the areas where we improved the performance was reading the documents from the disk for indexing.

In Pseudo Code, it looks like this:

while database_is_running:
  stale = find_stale_indexes()
  lastIndexedEtag = find_last_indexed_etag(stale)
  docs_to_index = get_documents_since(lastIndexedEtag, batch_size)
  

As it turned out, we had a major optimization option here, because of the way the data is actually structured on disk. In simple terms, we have an on disk index that lists the documents in the order in which they were updated, and then we have the actual documents themselves, which may be anywhere on the disk.

Instead of loading the documents in the orders in which they were modified, we decided to try something different. We first query the information we need to find the document on disk from the index, then we sort them based on the optimal access pattern, to reduce disk movement and ensure that we have as sequential reads as possible. Then we take those results in memory and sort them based on their last update time again.

This seems to be a perfectly obvious thing to do, assuming that you are aware of such things, but it is actually something that is very easy not to notice. The end result is quite promising, and it contributed to the 7+ times improvements in perf that we had for indexing costs.

But surprisingly, it wasn’t the major factor, I’ll discuss a huge perf boost in this area tomorrow.

Comments

Scooletz
04/23/2012 10:17 AM by
Scooletz

Speaking about the access pattern. Is it connected with a storage engine implementation? What about Esent based? Do you retrieve some metadata from it? If so, you could elaborate this topic more.

It's getting more and more interesting. I wish you'd publish it as a single, long post! :P

Ayende Rahien
04/23/2012 10:18 AM by
Ayende Rahien

Scooletz, We are talking about how we are doing this on Esent, yes. The relevant code is here: https://github.com/ayende/ravendb/blob/master/Raven.Storage.Esent/StorageActions/OptimizedIndexReader.cs

Scooletz
04/23/2012 12:38 PM by
Scooletz

Correct me if I'm getting it wrong, but all what Raven does is searching through a given index and adding bookmarks' buffers to the collection for each doc matching criteria. Once you've got all of them, you query esent for docs in order of sorted buffers returned by Esent, right?

Ayende Rahien
04/23/2012 01:48 PM by
Ayende Rahien

Scooletz, Yes, that is the general idea.

Jerry
04/24/2012 01:37 AM by
Jerry

It would be a nice addition to these posts if you posted a link to the actual change set. I really like reading these types of posts. Thanks!

Comments have been closed on this topic.