Rob’s SprintIdly indexing
During Rob Ashton’s visit to our secret lair, we did some work on hard problems. One of those problems was the issue of index prioritization. As I have discussed before, this is something that isn’t really easy to do, because of the associated IO costs with not indexing properly.
With Rob’s help, we have the defined the following:
- An auto index can be set to idle if it hasn’t been queried for a time.
- An index can be forced to be idle by the user.
- An index that was automatically set to idle will be set to normal on its first query.
What are the implications for that? And idle index will not be indexed by RavenDB during the normal course of things. Only when the database is idle for a period of time (by default, about 10 minutes with no writes) will we actually get it indexing.
Idle indexing will continue indexing as long as there is no other activity that require their resources. When that happens, they will complete their current run and continue to wait for the database to become idle again.
But wait, there is more. In addition to introducing the notion of idle indexes, we have also created another two types of indexes. The first is pretty obvious, the disabled index will use no system resources and will never take part in indexing. This is mostly there so you can manually shut down a single index. For example, maybe it is a very expensive one and you want to stop it while you are doing an import.
More interesting, however, is the concept on an abandoned index. Even idle indexes can take some system resources, so we have added another level beyond that, an abandoned index is one that hasn’t been queried in 72 hours. At that point, RavenDB is going to avoid indexing it even during idle periods. It will still get indexed, but only if there has been a long enough time passed since the last time it was indexed.
Next, we will discuss why this feature was a crucial step in the way to killing temporary indexes.
More posts in "Rob’s Sprint" series:
- (08 Mar 2013) The cost of getting data from LevelDB
- (07 Mar 2013) Result Transformers
- (06 Mar 2013) Query optimizer jumped a grade
- (05 Mar 2013) Faster index creation
- (04 Mar 2013) Indexes and the death of temporary indexes
- (28 Feb 2013) Idly indexing
Comments
Cool,
When will you push it into the unstable branch so we can test it out?
"an abandoned index is one that hasn’t been queried in 72 hours" - so a weekly report will never be up to date?
Also, why do idle indexes wait for 10 minutes of inactivity instead of just working only when all other indexes are up to date?
"An index that was automatically set to idle will be set to normal on its first query."
What if you want the index to always be an idle index? Like a reporting index that pulls tons of things together, or a crazy reporting map/reduce that is not relevant to OLTP functionality at all?
Chris - while not covered explicitly in the entry above, there is a flag to "force idle" and this will be exposed in the studio
Can we get a way to set these flags on the index creators as well?
Patrik, This is already available at: http://hibernatingrhinos.com/builds/ravendb-unstable-v2.5
Configurator, You can force an index to not go into idle / abandoned mode. But in general, if you have an index that is queried weekly, you can afford to wake it up and then wait for it to catch up.
Configurator, And the reason we wait for 10 minutes on inactivity is that we don't want to get into: "we have 1 second of rest, let us start indexing all the idle indexes, which can be VERY expensive".
Alex, No, you can't do that at creation, but you can do that immediately after.
In my still limited experience with Raven, specifically trying to work with bundles like replication and versioning. I have noticed that its not very straightforward to accomplish certain functionality without using the studio.
This specific feature is not that big of a deal to us, but we would really love to see functionality like this be configurable without going through the UI.
Alex, ALL of RavenDB functionality is exposed via REST interface, and you can do absolutely everything the studio does. After all, the studio just uses HTTP to talk to RavenDB himself, it is not a privileged client.
Alex, In other words, anything that you can do through the UI can be done in code, and pretty easily, at that.
RavenDB already caches compiled indexes ( https://github.com/ayende/ravendb/blob/master/Raven.Database/Linq/QueryParsingUtils.cs#L334 , discussion https://groups.google.com/d/msg/ravendb/hsMc4lLnaXU/h0WRLOYog9EJ ) which makes second and subsequent test runs that use create that index much faster.
I'm wondering if it would be possible to configure the indexes to be lazily compiled? That is, compiled and loaded when first queried?
Am currently doing system acceptance tests where we have an increasing number of indexes and am experiencing some time pain (20-30s +) on single test runs.
Damian, There is really no cost in doing the compilation (it happens once, and that is it.)
Oh, you are talking about the cost _per test run_, right? I was thinking about production runs, actually. In that case, can't you handle this via the index compilation caching that we already have?
Yes, the cost per test run, where I am run _one test at a time_, in the usual TDD(-ish) scenario. The index compilation caching (which is great) only kicks in when I run 2 or more tests per session. http://i.imgur.com/38DF0fc.png - second test benefits from the caching.
My other approach is to be able to supply a predicate to my application so the test fixture can configure it to only create indexes that are going to be used. But that means my acceptance test fixtures need to know what indexes may be required which I find to be leaky. (I take a different approach with my unit tests, no problems there)
Yes, it's a development pain and not a production issue. I may be an edge case though.
Damian, In that case, how about implementing on disk caching for this?
Yeah, that sounds good too. Generate a hash from the source, use it as the CompilerParameters.OutputAssemblyName and if the assembly already exists on disk (in a location that will exist between test sessions i.e. users temp dir) load it.
Or something like that :)
Actually, that may be a nice-to-have from a production pov. An index that is deleted and then re-created, assuming it is exactly the same, would be slightly faster. Don't know how often that would happen though really.
Damian, We have 2K+ tests, most of them with some form of indexes. We run them a LOT. any saving there would be useful in general.
Cool. Created the issue: http://issues.hibernatingrhinos.com/issue/RavenDB-969
Comment preview