The RavenDB indexing processOptimization

time to read 3 min | 422 words

The actual process done by RavenDB to index documents is a fairly complex one. In order to understand what exactly happened, I decided to break it apart to pseudo code.

It looks something like this:

while database_is_running:
  stale = find_stale_indexes()
  lastIndexedEtag = find_last_indexed_etag(stale)
  docs_to_index = get_documents_since(lastIndexedEtag, batch_size)
  
  filtered_docs = execute_read_filters(docs_to_index)
  
  indexing_work = []
  
  for index in stale:
    
    index_docs = select_matching_docs(index, filtered_docs)
    
    if index_docs.empty:
      set_indexed(index, lastIndexedEtag)
    else
      indexing_work.add(index, index_docs)
      
  for work in indexing_work:
  
     work.index(work.index_docs)

And now let me show you the areas in which we did some perf work:

while database_is_running:
  stale = find_stale_indexes()
  lastIndexedEtag = find_last_indexed_etag(stale)
  docs_to_index = get_documents_since(lastIndexedEtag, batch_size)
  
  filtered_docs = execute_read_filters(docs_to_index)
  
  indexing_work = []
  
  for index in stale:
    
    index_docs = select_matching_docs(index, filtered_docs)
    
    if index_docs.empty:
      set_indexed(index, lastIndexedEtag)
    else
      indexing_work.add(index, index_docs)
      
  for work in indexing_work:
  
     work.index(work.index_docs)

All of which gives us a major boost in the system performance. I’ll discuss each part of that work in detail, don’t worry Winking smile

More posts in "The RavenDB indexing process" series:

  1. (24 Apr 2012) Optimization–Tuning? Why, we have auto tuning
  2. (23 Apr 2012) Optimization–Getting documents from disk
  3. (20 Apr 2012) Optimization–De-parallelizing work
  4. (19 Apr 2012) Optimization–Parallelizing work
  5. (18 Apr 2012) Optimization