The RavenDB indexing processOptimization–Tuning? Why, we have auto tuning
The final aspect of RavenDB’s x7 jump in indexing performance is the fact that we made it freakishly smart.
During standard operation, most indexes only update when new information comes in, we are usually talking about a small number of documents for every indexing run. The problem is what happens when you have a sudden outpour of documents into RavenDB? For example, during nightly ETL batch, or just if you suddenly have a flood of users doing write operations.
The problem here is that we actually have to balance a lot of variable at the same time:
- The number of documents that we have to index*.
- The current memory utilization**.
- How any cores I have available to do the index work with?
- How much time do I have to do this?
Basically, the idea goes like this, if I have a small batch size, I am able to index more quickly, ensuring that we have fresher results. If I have big batch size, I am able to index more documents, and my overall indexing times goes down.
There is a non trivial cost associated with every indexing run, so reducing the number of indexing run is good, but the more documents I shove into a single run, the more memory will I use, and the more time it will take before the results are visible to the users.
* It is non trivial because there is no easy way for us to even know how many documents we have left to index (to find out is costly).
** Memory utilization is hard to figure out in a managed world. I don’t actually have a way to know how much memory I am using for indexing and how much for other stuff, and there is no real way to say “free the memory from the last indexing run”, or even estimate how much memory that took.
What we have decided on doing is to start from a very small (low hundreds) indexing batch size, and see what is actually going on live. If we see that we have more documents to index than the current batch size, we will slowly double the size of the batch. Slowly, because bigger batches requires more memory, and we also have to take into account current utilization, memory usage, and a bunch of other factors as well. We also go the other way around, able to reduce the indexing batch size on demand based on how much work we have to do right now.
We also provide an upper limit, because at some point it make sense to just do a big batch and make the indexing results visible than to try to do everything all at once.
The fun part in all of that is that once we have found the appropriate algorithm for this, it means that RavenDB will automatically adjust itself based on real production load. If you have an low update rate, it will favor small indexing batches and immediately execute indexing on the new documents. However, if you suddenly have a spike in traffic and the update rate goes up, RavenDB will adjust the indexing batch size so it will be able to keep up with your rate.
We have done some (read, a huge amount) testing with regards to this new optimization, and it turns out that under slow update frequency, we are seeing an average of 15 – 25 ms between a document update and it showing up in the indexes. That is pretty good, but what is going on when we have data just pouring in?
We tested this with a 3 million documents and 3 indexes. And it turn out that under this scenario, where we are trying to shove data into RavenDB as fast as it can accept it, we do see an increase in index latency. Under those condition, latency rose all the way to 1.5 seconds.
This is actually something that I am very happy about, because we were able to automatically adjust to the changing conditions, and were still able to index things at a reasonable rate (note that under this scenario, the batch size was usually 8 – 16 thousands documents, vs. the 128 – 256 that it is normally).
Because we were able to adjust the batch size on the fly, we could handle sustained writes at this rate with no interruption in service and no real need to think about this from the users perspective.. Exactly what the RavenDB philosophy calls for.
More posts in "The RavenDB indexing process" series:
- (24 Apr 2012) Optimization–Tuning? Why, we have auto tuning
- (23 Apr 2012) Optimization–Getting documents from disk
- (20 Apr 2012) Optimization–De-parallelizing work
- (19 Apr 2012) Optimization–Parallelizing work
- (18 Apr 2012) Optimization
Comments
Auto-tuning is extremely convenient. Lacking this is likely to be one of the factors making MySQL users unhappy and driving the NoSQL shift.
The same can be applied (to the given limit of time) to quering a given resource for changes, or any other heavy work. If for instance a message was found in a queue, pop another one quite fast the second time, if one isn,t found, extend the period. I think that this idea is standing behind usage of Azure Queries in NServiceBus.
Dynamically adjusting the batch size reminds me of process prioritization in OSes. The basic idea being if a process uses all of its time slice, increase its time slice (up to some maximum). If the process does not use its time slice, decrease the time slice (down to some minimum). Usually, the increase and decrease are asymmetrical so that the algorithm can find an appropriate balance between the current workload vs the historical workload.
I am also impressed that with a roughly 4-6 order of magnitude increase in document rate, there is only a 2 order of magnitude increase in index latency.
Any chance u could elaborate about the specifics of the 3 million document test? how many bulk inserts/ updates over what period of time ... and such. Thank you
Janivz, We loaded all 3 million records as soon as we could, using inserts. We didn't test updates, since the process is the same in all respects.
Ayende,
do you have any plans for ravendb driver for node.js ?
There are actually a few already out there: https://github.com/mattdaly/node-ravendb https://github.com/csainty/node-raven
See also: http://groups.google.com/group/ravendb/browse_thread/thread/501ed1b4a0b3e380?pli=1
Comment preview