The RavenDB indexing processOptimization–Getting documents from disk

time to read 2 min | 327 words

As I noted in my previous post, we have done major optimizations for RavenDB. One of the areas where we improved the performance was reading the documents from the disk for indexing.

In Pseudo Code, it looks like this:

while database_is_running:
  stale = find_stale_indexes()
  lastIndexedEtag = find_last_indexed_etag(stale)
  docs_to_index = get_documents_since(lastIndexedEtag, batch_size)

As it turned out, we had a major optimization option here, because of the way the data is actually structured on disk. In simple terms, we have an on disk index that lists the documents in the order in which they were updated, and then we have the actual documents themselves, which may be anywhere on the disk.

Instead of loading the documents in the orders in which they were modified, we decided to try something different. We first query the information we need to find the document on disk from the index, then we sort them based on the optimal access pattern, to reduce disk movement and ensure that we have as sequential reads as possible. Then we take those results in memory and sort them based on their last update time again.

This seems to be a perfectly obvious thing to do, assuming that you are aware of such things, but it is actually something that is very easy not to notice. The end result is quite promising, and it contributed to the 7+ times improvements in perf that we had for indexing costs.

But surprisingly, it wasn’t the major factor, I’ll discuss a huge perf boost in this area tomorrow.

More posts in "The RavenDB indexing process" series:

  1. (24 Apr 2012) Optimization–Tuning? Why, we have auto tuning
  2. (23 Apr 2012) Optimization–Getting documents from disk
  3. (20 Apr 2012) Optimization–De-parallelizing work
  4. (19 Apr 2012) Optimization–Parallelizing work
  5. (18 Apr 2012) Optimization