Debug considerations for high level system architecture

time to read 4 min | 602 words

I run into this twit:

This resonated very strongly with me, because when we architected RavenDB 4.0, one of the key considerations was the issue of debuggability. RavenDB instances often run for months on end, usually only restarted to apply updates to OS or database. They are often running in production environments where it is not possible to do any meaningful debugging. We rely heavily on resolving issues through minidumps, core dumps, etc. Part of the work we did in architecting RavenDB 4.0 was to sit down and think about supporting the system in production.

For many of the core components, async was right out. Part of that was because of issues relating to the unpredictability of async execution, we want certain things to always happen first, avoid thread pool starvation / growth policies / etc. But primarily, we were sick and tired of getting a dump (or even just pausing a running instance when we debug a complex situation) and having to manually reconstruct the state of the system. Parallel stacks alone is an amazing feature for figuring out what is going on in a complex system.

The design of RavenDB called for any long lived task to run on a dedicated thread. These threads are named, so if you stop in the debugger, you can very quickly see what is actually is going on there. This is also useful for things like account for memory, CPU time, etc. We had a problem in a particular component that was leaking memory at a rate of 144 bytes per second, just under 12 MB per day. This is something that is very easy to lose in the noise. But because we do memory accounting on a thread basis, it was easy to go to a system that was running for a few weeks and see that this particular thing had 500MB of memory in use, when we expected maybe 15MB.

We still use async for handling of short term operations. For example, processing of a single request, because these are fast and if there are problems with them, we’ll usually see them already executing.

I’m really happy with this decision, since it provided us many dividends down the line. We planned this for production, to be honest, but it ended up really helpful in normal debugging as well.

This also allow us to take advantage of the fact that a thread that is not runnable is effectively free (aside from some memory, of course), so we can dedicate a full thread for these long running tasks and greatly simplify everything. An index in RavenDB always has its own dedicated thread, which is woken up if there is anything that this index needs to process. This means that indexing code is simple, isolated and we can start applying policies at the index level easily. For example, if I have an index that has a low priority, I can just adjust the thread’s priority and let the OS do the hard work of scheduling it accordingly.

Async simplifies the programming model significantly, but it also come at a cost of system complexity and maintenance overhead. Figuring out that you have a request stuck on a task that will never return, for example, is never pleasant. The same thing using blocking operations is immediately obvious. That is a benefit that should absolutely not be discounted.