Ayende @ Rahien

It's a girl

RavenDB indexing optimizations, Step II–Pre Fetching

Getting deeper into our indexing optimization routines, when we last left it, we had the following system:

image

This was good because it was able to predictively decide when to increase the batch size and smooth over spikes easily. But note where we have the costs?

The next step was this:

image

Pre fetching, basically. What we noticed is that we were spending a lot of time just loading the data from the disk, and we changed our behavior to allow us to load things while we are indexing. So on the next indexing batch, we will usually find all of the data we needed already in memory and ready to rock.

This gave us a pretty big boost in how fast we can index things Smile, but we aren’t done yet. In order to make this feature viable, we had to do a lot of work there. For starter, we had to make sure we would take too much memory, and we wouldn’t impact other aspects of the database, etc. Interesting work, all around, even if I am just focusing on the high level optimizations. There is still a fairly obvious optimization waiting for us, but I’ll discuss that in the next post.

Tags:

Posted By: Ayende Rahien

Published at

Originally posted at

Comments

Rafal
12/13/2012 01:35 PM by
Rafal

I wonder why you have to load any data at all. If the docs have just been inserted or modified they should be in memory so you can index them without any loading. Maybe you should index the most recently modified document first and catch-up with the remaining ones later? This way the 'hottest' document would be indexed first, without any additional loading cost.

Rafal
12/13/2012 02:04 PM by
Rafal

.... and the cache wouldn't be polluted with older documents loaded there just for indexing.

Chris
12/13/2012 06:45 PM by
Chris

@Rafal

You would have to also be mindful of "starvation" of the older documents. If you have a steady stream of new documents coming in, eventually you have to just say "enough guys, I've got to go back and get these other documents in."

Rafal
12/13/2012 08:08 PM by
Rafal

oops, my response disappeared somehow. So, let's try again: 1. if your indexing cant keep up with the rate of modifications and there's starvation then it doesn't matter how you order documents for indexing - you won't be able to index them anyway and some will always 'starve' 2. But if you start with the wrong order and you have to load documents because they are not in the cache then you pay a double performance penalty - a cost of loading the data and even greater cost of throwing away already cached documents 3. Imho in normal operation you should never have to load documents to be indexed - they should always be already in the cache. So I'm not sure why Ayende is talking about the cost of loading documents - maybe this applies to batch processing or initial data load

Matt Warren
12/14/2012 09:43 AM by
Matt Warren

@Rafal

Take a look at the post in the queue, it's titled, so I think it'll answer some of your questions.

"RavenDB indexing optimizations, Step III–Skipping the disk altogether"

Ayende Rahien
12/17/2012 08:58 AM by
Ayende Rahien

Rafal, Consider what happens when you have existing data in the database and you add an index. You don't have all of the previously created documents in memory. Also, indexing by most recently modified means that you run into a LOT of issues with just tracking what you indexed and what you didn't. Especially when you add the notion of updates during indexing.

Ayende Rahien
12/17/2012 09:00 AM by
Ayende Rahien

Rafal, Docs loaded for indexes are not actually cached. And we have steps in place to avoid starvation, we move to higher and higher batch sizes, optimizing our IO throughput along the way.

And I am talking about things like adding an index, or what happens after a restart, etc.

Rafal
12/17/2012 10:27 AM by
Rafal

Thanks for explanation, Ayende In case anyone thought so, I'm not nitpicking, just being curious about how Raven manages its resources during periods of high load.

And another question: what is your idea for monitoring Raven's performance? I'm talking about automated, continuous collection of key performance data, like number of updates/sec, number of docs indexed/sec, cache size/hit ratio, indexing lag, number of sessions, transactions, Esent performance, memory, etc? I've been recently quite busy with monitoring application and server performance in Windows ecosystem and was wondering how Raven does these things, compared for example to MS SQL. And btw I have some pretty nice results with using NLog for collecting performance data, which might be useful for RavenDB too.

Ayende Rahien
12/17/2012 10:37 AM by
Ayende Rahien

Rafal, We have several ways of doing that. We expose a number of performance counters, and we also provide /admin/stats and /databases/DB_NAME/stats endpoint that expose a lot of details about the internal structure of how ravendb works.

Comments have been closed on this topic.