Designing RavenDB’s debug log system

time to read 5 min | 845 words

I have talked before about the roles of logs (for debug statements) in RavenDB. In this post, I want to start talking about the actual concrete design we have of them.

Overall guidelines (in no particular order):

  • As little configuration as possible.
  • Development / Operational – only two levels we support.
  • Idiomatic – should be easy to read the log code and see if it is good or not.
  • Builtin understanding of RavenDB mechanics – Scoped to database, etc.
  • Fast & Cheap as possible (especially if disabled).
  • Memory efficient.

The idea that log configuration should be flexible is a great idea, if you are talking about a generic logging framework. In particular, however, we have just one configuration option, writing to a sized limited text file, that is it. (This isn’t strictly true, we also have an HTTP endpoint(s) that we need to take care of, but that is true enough).

So the log file name format is going to be fixed at:

  • 2016-06-26.001.log
  • 2016-06-26.002.log

Each of which is going to be roughly 256MB in size.

How this is done? We spin off a single background logging task that has a list of per thread queues, and it is running over all of those queues. Each time we log, we do all of the work of preparing the log entry on the sending thread, and just send the actual buffer to the background logging thread. The idea is that the background logging thread should be focused on just writing to the log file, nothing else. Technically speaking, it would be much better to have a log file per thread, but that would require us to merge them at a later point in time using dedicated tools, and it isn’t really worth it at this point.

Each thread actually have two queues, one that is used to communicate with the background logging thread, and the other to get the buffers it allocated back, for reuse. This allows us to drastically reduce the amount of memory that is being held by logging, and it also gives us a good way to buffer the amount of in flight logging information.

We can process over 400MB/second for logs using this approach, on a laptop, so I’m pretty happy with the performance.

This has some particular behaviors, though. Because we are processing things on a per thread basis, it means that we’ll group all operations from the same thread in the log for a certain duration (by default, 16 messages), even if in practice they are interleaved. This is actually make it easier to read the log, but it make it harder to see when you want to see the interleaving of different operations at the same time (a key aspect of the log in multi threaded environment). You can get the correlation from the log timestamp, of course, but that isn’t the same.

The limit of max 16 operations per thread in the log should limit that, and the amount of cross thread coordination we want to see if likely to be lower in the RavenDB 4.0 codebase, so that is an acceptable compromise.

In terms of the API we have, we define the following:

image

The idea is that when you create a logger, you specify what is your logger (the BulkInsertHandler, in this case), as well as which database this logger belongs to. This make it easier to read the logs after the fact.

Then there is the usage:

image

Note that this looks like a regular log usage, but this API is actually very constrained. We have only two log levels:

  • Information – The full log
  • Operations – Stuff that admins needs to look at

In particular, we don’t have a Warn / Error levels, because they are incredibly hard to get right. What does it means to log an error, if it is handled?

Another thing to note is that this API is actually really primitive, it accepts just a string (relying on the C#’s string interpolation to make it nicer looking), and it assumes that it is always going to be called with a check if the logging is enabled. This enables us to reduce costs when actually logging, because we can skip all those checks and additional API that is there just to save performance when you don’t have the right log level enabled.

In this case, we are forcing the caller to take care of that, making our life much simpler.

Other things that we don’t allow is filtering by logger, hierarchy, etc. We don’t need any of those things, and they cost.

So this is how things stand right now with our logging system.