Logging & Production systems
I got a question from Dominic about logging:
Jeff Atwood wrote a great blog post about over-using logging, where stack traces should be all a developer needs to find the root cause of a problem. Therefore ...
When building an enterprise level system, what rules do you have to deem a log message 'useful' to a developer or support staff?
This is the relevant part in Jeff’s post:
So is logging a giant waste of time? I'm sure some people will read about this far and draw that conclusion, no matter what else I write. I am not anti-logging. I am anti-abusive-logging. Like any other tool in your toolkit, when used properly and appropriately, it can help you create better programs. The problem with logging isn't the logging, per se -- it's the seductive OCD "just one more bit of data in the log" trap that programmers fall into when implementing logging. Logging gets a bad name because it's so often abused. It's a shame to end up with all this extra code generating volumes and volumes of logs that aren't helping anyone.
We've since removed all logging from Stack Overflow, relying exclusively on exception logging. Honestly, I don't miss it at all. I can't even think of a single time since then that I'd wished I'd had a giant verbose logfile to help me diagnose a problem.
I don’t really think that I can disagree with this more vehemently. This might be a valid approach if/when you are writing what is essentially a single threaded, single use, code. It just so happens that most web applications are actually composed of something like that. The request code very rarely does threading, and it is usually just dealing with its own stuff. For system where most of the code is actually doing ongoing work, there really isn’t any alternative to logging.
You cannot debug multi threaded code efficiently. The only way to really handle that is to do printf debugging. In which you write what happens, and then construct the actual execution from the traces. And that is leaving aside one very important issue. It isn’t the exceptions that will get you, it is when your system is subtly wrong. Maybe it missed an update, or skipped a validation, or something just doesn’t look right. And you need to figure out what is going on.
And then you have distributed system, when you might have things happening concurrently in multiple systems, and good luck trying to get a good grip about how to resolve problems without using logging.
Speaking of which, here is a reply to a customer from one of our developers:
There is absolutely no way this would have been found without logging. The actual client visible issue happened quite a bit later than when the actual bug was, and no exception was thrown.
Of course, this is all just solving problems on the developer machine. When you go to production, the rules are different, usually the only thing that you have are the logs, and you need to be able to figure out what was wrong and how to fix it, when the system is running.
I find that I don’t really sweat Debug vs. Info and Warn vs. Error debates. The developers will write whatever they consider to be relevant on each case. And you might get Errors that show up in the logs that are error for that feature, but are merely warnings for the entire product. I don’t care, all of that can be massages later, using log configuration & filtering. But the very first thing that has to happen is to have the logs. Most of the time you don’t care, but it is pretty much the same as in saying: “We removed all the defibrillators from the building, because they were expensive and took up space. We rely exclusively on CPR in the event of a heart failure. Honestly, I don’t miss it at all. I can’t think of a single time since then that I’d wished I’d a machine to send electricity into someone’s heart to solve a problem.”
When you’ll realize that you need it, it is likely going to be far too late.
Comments
Thank you very much for your reply on this.
I've worked on a multi-threaded system before, and ThreadIds (especially as they can be changed to more meaningful names) have saved many a late night on logging those pieces. In previous posts, you've mentioned using a CSV, ready to import into a database in a triage situation. Again, I couldn't agree more. DB backups are much harder to gain approval for their release, than zipping a CSV.
Logging is an incremental process, and I'm sure early in any development stage, (mainly out of out of developer curiosity) you learn what is relevant to solving a problem, and what is really noise. Do you really, really want to know every entry/exit from a method?? Especially on huge multi-threaded systems? Multiply by 50 threads on a multi-core behemoth server.
And possibly review these logs periodically, much like a code review to determine are these logs really giving us enough to support the application remotely.
Dominic, I very rarely care about method enter/exit. I try to ensure that our logging actually detail meaningful stuff. But yet, they can be mind numbingly boring at time, but that is still important to be able to trace what hapened.
Comment preview