RavenDB RetrospectiveExplicit indexes & auto indexes
RavenDB doesn’t provide any way for queries to do table scans*.
* That isn’t actually true, we have Data Exploration, which does just that, but we don’t provide an explicit API for it, and it is a DBA driven feature (I wanna get this report with a minimum of fuss without regards to how much it is going to cost me) than an API that is exposed.
What this means is that the cost of query operations in RavenDB is always going to be O(logN), instead of O(N). How does this relate to the topic of RavenDB retrospectives?
One of the things that I kept seeing over and over as a database consultant was that databases are complex, and that it is easy to write a query that works perfectly fine for a period of time, then fall over completely as the size of the data goes over a certain threshold. In particular, queries that use table scans are particularly vulnerable for this issue.
One of the design goals for RavenDB was to avoid that, completely. We did it by simply forbidding any query that doesn’t have an index. initially, that was a pretty annoying requirement, because every time that you needed a new query, you needed to go ahead and create an index. But early on we got the Auto Indexes feature.
Basically, it means that when you can query RavenDB without specifying which index you want to use, at which point the query optimizer will inspect the query and decide which index can serve it. The most interesting point here is that if there isn’t an index that can serve this query, the query optimizer is going to create one on the fly. See the previous post about BASE indexes and how we can afford to do that.
The fun part here is that the query optimizer is actually learning over time, and it will shape its indexes to best fit the kind of queries you are doing. It also makes RavenDB much more robust for New Version Degradation effects. NVD is what happens when you push a new version out, which have slightly different queries, which make previously used indexes ineffective, forcing all your queries to become full table scans. Here is an example of the kind of subtle issues that this can cause. With RavenDB, when you use auto indexes (in other words, when you don’t explicitly state which index to use), the query optimizer will take care of that, and it will create all the appropriate indexes (and retire the unused ones) for you.
This in particular is a feature that I’m really proud of, it require very little from the user to work with, and it gets the Right Thing Done.
More posts in "RavenDB Retrospective" series:
- (17 Oct 2016) The governors
- (14 Oct 2016) Explicit indexes & auto indexes
- (12 Oct 2016) BASE Indexes
- (28 Sep 2016) Unbounded result sets
Comments
I think auto-indexes was good for us initially but we then started to suffer from different issues as a result of using them, again after time as usage / data volumes increased.
Maybe it is due to issues with historical build quality / bugs, but we have seen creation of new indexes (including auto-indexes) really hammer CPU and memory. To the point that the database server is completely unusable.
We never want to be taken by surprise by the database suddenly deciding it needs to create an auto-index and then basically destroying itself in the process. As such, we have completely disabled the ability for auto-indexes to happen on both developer machines and in production. Now we know that any major cost of creating indexes (or bugs in this aspect) will happen at a time of our choosing, when we are upgrading to the latest version of our code.
@Ayende. It is a big business for consultants adding og later removing index hints for a SQL server. Don't ruin the business :-) https://www.brentozar.com/archive/2013/10/index-hints-helpful-or-harmful/
Your query with autoIndex is n*Log(n) + log(n) and you use a lot of memory. I agree with @Ian that autoindex some times might be problematic.
I often had to do only once checking/modifying of all entries. Table Scan is good for that and the table will not be blocked due to index creation (feature of other database providers). Moreover you are not changing the source code.
When I studied at University, I learned a rule of thumb that if you need to read more than 10-20% of the records in the table. Table Scan is faster than searching by index. From my practical experience from other database providers, the rule of thumb is OK.
My conclusion: Autoindex is good because the source code is not changed. You should only do the big queries which need autoindex when you are not disturbing the users.
Ian, We have done a lot of work around that, and index now can't affect other indexes, and the amount of resource it can take from the system is limited
Carsten, Table scan isn't really an option for our use case. In 3.x, we are storing the data as JSON, so a table scan means read the entire database, parse the json, compare the value, and then decide.
And when you have queries where the results are more than 10% of the dataset (assuming reasonable sizes) you are no longer talking about OLTP scenarios.
Comment preview