Logging & Production systems

time to read 4 min | 735 words

I got a question from Dominic about logging:

Jeff Atwood wrote a great blog post about over-using logging, where stack traces should be all a developer needs to find the root cause of a problem. Therefore ...

When building an enterprise level system, what rules do you have to deem a log message 'useful' to a developer or support staff?

This is the relevant part in Jeff’s post:

So is logging a giant waste of time? I'm sure some people will read about this far and draw that conclusion, no matter what else I write. I am not anti-logging. I am anti-abusive-logging. Like any other tool in your toolkit, when used properly and appropriately, it can help you create better programs. The problem with logging isn't the logging, per se -- it's the seductive OCD "just one more bit of data in the log" trap that programmers fall into when implementing logging. Logging gets a bad name because it's so often abused. It's a shame to end up with all this extra code generating volumes and volumes of logs that aren't helping anyone.

We've since removed all logging from Stack Overflow, relying exclusively on exception logging. Honestly, I don't miss it at all. I can't even think of a single time since then that I'd wished I'd had a giant verbose logfile to help me diagnose a problem.

I don’t really think that I can disagree with this more vehemently. This might be a valid approach if/when you are writing what is essentially a single threaded, single use, code. It just so happens that most web applications are actually composed of something like that. The request code very rarely does threading, and it is usually just dealing with its own stuff. For system where most of the code is actually doing ongoing work, there really isn’t any alternative to logging.

You cannot debug multi threaded code efficiently. The only way to really handle that is to do printf debugging. In which you write what happens, and then construct the actual execution from the traces. And that is leaving aside one very important issue. It isn’t the exceptions that will get you, it is when your system is subtly wrong. Maybe it missed an update, or skipped a validation, or something just doesn’t look right. And you need to figure out what is going on.

And then you have distributed system, when you might have things happening concurrently in multiple systems, and good luck trying to get a good grip about how to resolve problems without using logging.

Speaking of which, here is a reply to a customer from one of our developers:

image

There is absolutely no way this would have been found without logging. The actual client  visible issue happened quite a bit later than when the actual bug was, and no exception was thrown.

Of course, this is all just solving problems on the developer machine. When you go to production, the rules are different, usually the only thing that you have are the logs, and you need to be able to figure out what was wrong and how to fix it, when the system is running.

I find that I don’t really sweat Debug vs. Info and Warn vs. Error debates. The developers will write whatever they consider to be relevant on each case. And you might get Errors that show up in the logs that are error for that feature, but are merely warnings for the entire product.  I don’t care, all of that can be massages later, using log configuration & filtering. But the very first thing that has to happen is to have the logs. Most of the time you don’t care, but it is pretty much the same as in saying: “We removed all the defibrillators from the building, because they were expensive and took up space. We rely exclusively on CPR in the event of a heart failure. Honestly, I don’t miss it at all. I can’t think of a single time since then that I’d wished I’d a machine to send electricity into someone’s heart to solve a problem.”

When you’ll realize that you need it, it is likely going to be far too late.