Automatic indexes and garbage collection, features that you shouldn’t need which save the day

time to read 6 min | 1149 words

I recently got into an interesting discussion about one of the most interesting features of RavenDB, the ability to automatically deduce and create indexes on the fly, based on actual queries made to the server. This is a feature that RavenDB had for a very long time, over a decade and one that I’m quite proud of. The discussion was about whatever such a feature was useful or not in real world scenario. I obviously leant on this being incredibly useful, but I found the discussion good enough to post it here.

The gist of the argument against automatic indexes is that the developers should be in control of what is going on in the database and create the appropriate indexes on their own accord. The database should be not be in the business of creating indexes on the fly, which is scary to do in production.

I don’t like the line of thinking that says that it is the responsibility of the developers / operators / database admins to make sure that all queries use the optimal query plans. To be rather more exact, I absolutely think that they should do that, I just don’t believe that they can / will / are able to.

In many respects, I consider the notion of automatic index creation to be similar to garbage collection in managed languages. There is currently one popular language that still does manual memory management, and that is C. Pretty much all other languages have switched to some other mode that mean that the developer don’t need to track things manually. Managed languages has a GC, Rust has its ownership model, C++ has RAII and smart pointers, etc. We have decades and decades of experience telling us that no, developers actually can’t be expected to keep track of memory properly. There is a direct and immediate need for systematic help for that purpose.

Manual memory management can be boiled down to: “for every malloc(), call free()”. And yet it doesn’t work.

For database optimizations, you need to have a lot knowledge. You need to understand the system, the actual queries being generated, how the data is laid out on disk and many other factors. The SQL Server Query Performance Tuning book is close to a thousand pages in length. So that is decidedly not a trivial topic.

It is entirely possible to expect experts to know the material and have a checkpoint to deployment that would ensure that you have done the Right Thing before deploying to production. Expect that this is specialized knowledge, so now you have gate keepers, and going back to manual memory management woes, we know that this doesn’t always work.

There is a cost / benefit calculation here. If we make it too hard for developers to deploy, the pace of work would slow down. On the other hand, if a bad query goes to production, it may take the entire system down.

In some companies, I have seen weekly meetings for all changes to the database. You submit your changes (schema or queries), it get reviewed in the next available meeting and deploy to production within two weeks of that. The system was considered to be highly efficient in ensuring nothing bad happened to the database. It also ensured that developers would cut corners. In a memorable case, a developer needed to show some related data on a page. Doing a join to get the data would take three weeks. Issuing database calls over the approved API, on the other hand, could be done immediately. You can guess how that ended up, don’t you?

RavenDB has automatic indexes because they are useful. As you build your application, RavenDB learn from the actual production behavior. The more you use a particular aspect, the more RavenDB is able to optimize it. When changes happen, RavenDB is able to adjust, automatically. That is key, because it remove a very tedious and time consuming chore from the developers. Instead of having to spend a couple of weeks before each deployment verifying that the current database structure still serve for the current set of queries, they can rest assured that the database will handle that.

In fact, RavenDB has a mode where you can run your system on a test database and take the information gather from the test run and apply it on your production system. That way, you can avoid having to learn the new behavior on the fly. You can introduce the new changes to the system model at an idle point in time and let RavenDB adjust to it without anything much going on.

I believe that much of the objection for automatic indexes comes from the usual pain involved in creating indexes in other databases. Creating an index is often seen as a huge risk. It may lock tables and pages, it may consume a lot of system resources and even if the systems has an online index creation mode (and not all do), it is something that you Just Don’t do.

RavenDB, in contrast, has been running with this feature for a decade. We have had a lot of time to improve the heuristics and behavior of the system under this condition. New indexes being introduced are going to have bounded resources allocated to them, no locks are involved and other indexes are able to server requests with no interruption in service. RavenDB is also able to go the other way, it will recognize which automatic indexes are superfluous and remove them. And automatic indexes that see no use will be expired by the query optimizer for you. The whole idea is that there is an automated DBA running inside the RavenDB Query Optimizer that will constant monitor what is going on, reducing the need for manual maintenance cycles.

As you can imagine, this is a pretty important feature and has been through a lot of optimization and work over the years. RavenDB is now usually good enough in this task that in many cases, you don’t ever need to create indexes yourself. That has enormous impact on the ability to rapidly evolve your product. Because you are able to do that instead of going over a thousand pages book telling you how to optimize your queries. Write the code, issue your queries, and the database will adjust.

Will all those praises that I heap upon automatic index creation, I want to note that it is a most a copper bullet, not a silver one. Just like with garbage collection, you are free from the minutia and tedium of manual memory management, but you still need to understand some of the system behavior. The good thing about this is that you are free()-ed  from having to deal with that all the time. You just need to pay attention in rare cases, usually at the hotspots of your application. That is a much better way to invest your time.