The RavenDB indexing processOptimization–Getting documents from disk
As I noted in my previous post, we have done major optimizations for RavenDB. One of the areas where we improved the performance was reading the documents from the disk for indexing.
In Pseudo Code, it looks like this:
while database_is_running:
stale = find_stale_indexes()
lastIndexedEtag = find_last_indexed_etag(stale)
docs_to_index = get_documents_since(lastIndexedEtag, batch_size)
As it turned out, we had a major optimization option here, because of the way the data is actually structured on disk. In simple terms, we have an on disk index that lists the documents in the order in which they were updated, and then we have the actual documents themselves, which may be anywhere on the disk.
Instead of loading the documents in the orders in which they were modified, we decided to try something different. We first query the information we need to find the document on disk from the index, then we sort them based on the optimal access pattern, to reduce disk movement and ensure that we have as sequential reads as possible. Then we take those results in memory and sort them based on their last update time again.
This seems to be a perfectly obvious thing to do, assuming that you are aware of such things, but it is actually something that is very easy not to notice. The end result is quite promising, and it contributed to the 7+ times improvements in perf that we had for indexing costs.
But surprisingly, it wasn’t the major factor, I’ll discuss a huge perf boost in this area tomorrow.
More posts in "The RavenDB indexing process" series:
- (24 Apr 2012) Optimization–Tuning? Why, we have auto tuning
- (23 Apr 2012) Optimization–Getting documents from disk
- (20 Apr 2012) Optimization–De-parallelizing work
- (19 Apr 2012) Optimization–Parallelizing work
- (18 Apr 2012) Optimization
Comments
Speaking about the access pattern. Is it connected with a storage engine implementation? What about Esent based? Do you retrieve some metadata from it? If so, you could elaborate this topic more.
It's getting more and more interesting. I wish you'd publish it as a single, long post! :P
Scooletz, We are talking about how we are doing this on Esent, yes. The relevant code is here: https://github.com/ayende/ravendb/blob/master/Raven.Storage.Esent/StorageActions/OptimizedIndexReader.cs
Correct me if I'm getting it wrong, but all what Raven does is searching through a given index and adding bookmarks' buffers to the collection for each doc matching criteria. Once you've got all of them, you query esent for docs in order of sorted buffers returned by Esent, right?
Scooletz, Yes, that is the general idea.
It would be a nice addition to these posts if you posted a link to the actual change set. I really like reading these types of posts. Thanks!
Comment preview