Ayende @ Rahien

It's a girl

Logging & Production systems

I got a question from Dominic about logging:

Jeff Atwood wrote a great blog post about over-using logging, where stack traces should be all a developer needs to find the root cause of a problem. Therefore ...

When building an enterprise level system, what rules do you have to deem a log message 'useful' to a developer or support staff?

This is the relevant part in Jeff’s post:

So is logging a giant waste of time? I'm sure some people will read about this far and draw that conclusion, no matter what else I write. I am not anti-logging. I am anti-abusive-logging. Like any other tool in your toolkit, when used properly and appropriately, it can help you create better programs. The problem with logging isn't the logging, per se -- it's the seductive OCD "just one more bit of data in the log" trap that programmers fall into when implementing logging. Logging gets a bad name because it's so often abused. It's a shame to end up with all this extra code generating volumes and volumes of logs that aren't helping anyone.

We've since removed all logging from Stack Overflow, relying exclusively on exception logging. Honestly, I don't miss it at all. I can't even think of a single time since then that I'd wished I'd had a giant verbose logfile to help me diagnose a problem.

I don’t really think that I can disagree with this more vehemently. This might be a valid approach if/when you are writing what is essentially a single threaded, single use, code. It just so happens that most web applications are actually composed of something like that. The request code very rarely does threading, and it is usually just dealing with its own stuff. For system where most of the code is actually doing ongoing work, there really isn’t any alternative to logging.

You cannot debug multi threaded code efficiently. The only way to really handle that is to do printf debugging. In which you write what happens, and then construct the actual execution from the traces. And that is leaving aside one very important issue. It isn’t the exceptions that will get you, it is when your system is subtly wrong. Maybe it missed an update, or skipped a validation, or something just doesn’t look right. And you need to figure out what is going on.

And then you have distributed system, when you might have things happening concurrently in multiple systems, and good luck trying to get a good grip about how to resolve problems without using logging.

Speaking of which, here is a reply to a customer from one of our developers:

image

There is absolutely no way this would have been found without logging. The actual client  visible issue happened quite a bit later than when the actual bug was, and no exception was thrown.

Of course, this is all just solving problems on the developer machine. When you go to production, the rules are different, usually the only thing that you have are the logs, and you need to be able to figure out what was wrong and how to fix it, when the system is running.

I find that I don’t really sweat Debug vs. Info and Warn vs. Error debates. The developers will write whatever they consider to be relevant on each case. And you might get Errors that show up in the logs that are error for that feature, but are merely warnings for the entire product.  I don’t care, all of that can be massages later, using log configuration & filtering. But the very first thing that has to happen is to have the logs. Most of the time you don’t care, but it is pretty much the same as in saying: “We removed all the defibrillators from the building, because they were expensive and took up space. We rely exclusively on CPR in the event of a heart failure. Honestly, I don’t miss it at all. I can’t think of a single time since then that I’d wished I’d a machine to send electricity into someone’s heart to solve a problem.”

When you’ll realize that you need it, it is likely going to be far too late.

Comments

Dominic Zukiewicz
02/06/2014 02:39 PM by
Dominic Zukiewicz

Thank you very much for your reply on this.

I've worked on a multi-threaded system before, and ThreadIds (especially as they can be changed to more meaningful names) have saved many a late night on logging those pieces. In previous posts, you've mentioned using a CSV, ready to import into a database in a triage situation. Again, I couldn't agree more. DB backups are much harder to gain approval for their release, than zipping a CSV.

Logging is an incremental process, and I'm sure early in any development stage, (mainly out of out of developer curiosity) you learn what is relevant to solving a problem, and what is really noise. Do you really, really want to know every entry/exit from a method?? Especially on huge multi-threaded systems? Multiply by 50 threads on a multi-core behemoth server.

And possibly review these logs periodically, much like a code review to determine are these logs really giving us enough to support the application remotely.

Ayende Rahien
02/06/2014 02:42 PM by
Ayende Rahien

Dominic, I very rarely care about method enter/exit. I try to ensure that our logging actually detail meaningful stuff. But yet, they can be mind numbingly boring at time, but that is still important to be able to trace what hapened.

Carsten Hansen
02/21/2014 10:59 AM by
Carsten Hansen

On my servers I have got a disk full error a few times due to logging :-)

I think in development you should be able to due a lot of logging and if you get problems in production you enable logging again. Exception-logging should always be done. If you have the time and knowledge, logging can be used in e.g. recommandation systems.

Gondar
02/21/2014 11:04 AM by
Gondar

Interesting post. Two things I am still missing in this conversation is: 1. Strongly typed logs - what I mean by that is logging actual objects (e.g. request/response) not just messages. And keeping their data formats in not serialised form. The benefit of such logs is tremendous as you can search, aggregate, compare and filter them.

  1. And when we think about all this stuff all that comes to my mind is that it needs tool to actually go through them. It's too easy to get lost in excessive log file with vim. But with tools like elasticsearch and kibana you can really easily correlate events from distributed system.

That's ofcourse perspective of web site development with micro-architecture in place, but I am sure it can find it's way to stand alone application development.

Ayende Rahien
02/21/2014 11:48 AM by
Ayende Rahien

Carsten, Yes, logging can bring down the system it is serving. That happens to us a few times, and it is a common production issue. That doesn't detract from how useful the logs are, and they should be used frequently and often.

Paul Turner
02/21/2014 02:14 PM by
Paul Turner

I think the only time tracing is useful is when it's in a multi-threaded environment and each entry is correlated to an activity. It's really very valuable to see the story of some activity passing through the components of the system, so you have appropriate context for an errors flagged. It's also the only way to deal with what you have called "subtle" errors, where no actual exception occurred, but the system isn't doing what it's supposed to do.

I have been working on a message-oriented system deployed into Windows Azure. Without using correlated traces, it'd be virtually impossible to see what was happening to a given message.

One worry with extensive tracing is that you end up building haystacks around needles (credit to James Geall for teaching me that term http://acceptnobrokenwindows.com/). It's important to trace the right information - method enters and exits are largely pointless, but knowing the system received a message (in a message-centric system) is pretty valuable as a basis for testing "did the system even get a message" when trying to understand a problem.

I've found it quite useful to keep my tracing output very minimal (errors-only typically), but go to full tracing on a per-request basis by setting a header in the request. This way you can send a test through the system to see one thing in detail and avoid building up lots of "hay" from the other requests you're not interested in.

Ultimately, I think the only rule about tracing has to be: think about what's going to be hard to diagnose once the code leaves your debugger and goes into the wild.

James Arendt
02/21/2014 02:18 PM by
James Arendt

After two recent systems where I've employed a solid amount of logging, I've come to the same conclusion as you, Oren. Not only can you not debug multi-threaded code well without logging, you cannot easily pin-point the causes of issues that may have started two requests prior, but part of the same session. If you have a SOA architecture, the problem may have originated in a different service.

I'm forever thankful of learning about log4net's NDC (nested data context) from a colleague. Leveraging the NDC and logging has allowed me to get to the root of an issue in terms of minutes in many cases instead of hours of debugging. And the hard ones that take long time to research are at least possible vs. the alternative -- shrugging and going, "no idea".

Luke Sigler
02/21/2014 04:43 PM by
Luke Sigler

"If you have a SOA architecture, the problem may have originated in a different service."

This is all too often overlooked. I love the additions that EntLib team has made to SLAB around this. The new ActivityId and RelatedActivityIds in .NET 4.5.1 has made my life 20 times easier.

I also think that the complaint most people have against logging is because they are suffering from alarm fatigue because developers. in my experience, log everything either as an "error" or "critical." I would be against it too if I was getting non-critical notifications all day long.

Piotr Perak
02/22/2014 12:08 PM by
Piotr Perak

In our system we use Castle.Windsor interceptors to log data. Especially in points that call other systems that we integrate with that proved to be very useful. Whenever there is a bug where we and other systems have data that don't match we just check logs to see what we sent. If it's ok then work is done on our part. Other system has to find a bug. Other systems' developers started to notice it that we always have all the answers and more and more often they ask us if we have logs for this.

And thanks to interceptors our classes are clean and don't know anything about logging!

Ayende Rahien
02/22/2014 12:16 PM by
Ayende Rahien

Piotr, I used that several times, but I usually really care about the decisions my system does. For example, why did I decided to use the single reduce strategy rather than the multi reduce strategy for this data item? Why did I change the thersholds for indexing at this time, etc...

Piotr Perak
02/22/2014 12:17 PM by
Piotr Perak

And in some of algorithms we add additional logging. Those algorithms are so complicated that often clients were filing bugs when everything was working fine (clients came up with those algorithms, not us). So we added special logging that writes only one line of log that contains information about every step of algorithm, with it's inputs and outputs. Thanks to that something that could be very hard or even impossible to diagnose is now very easy to reject as 'Not a Bug'.

Alois Kraus
02/22/2014 08:32 PM by
Alois Kraus

If you are after multithreading bugs then you should check out ETW. The CLR emits ETW Events which makes it possible to find hidden exceptions or to check when a garbage collection causes latency issues: http://www.geekswithblogs.net/akraus1/archive/2014/02/17/155442.aspx

Since ETW Events can capture the call stacks you can get the stack trace for any exception with much less overhead than .NET ever can do. It is on of the fastest ways of tracing on Windows which can be essential to find hard multithreading bugs.

Paul Turner
02/24/2014 10:26 AM by
Paul Turner

ETW is the right way to do diagnostic tracing on Windows. It's available on every modern version of Windows and the most recent changes from the BCL team make it quite easy to both produce and consume ETW events. It's best feature is the ability to change your trace listeners whilst the application is running, so you can capture very detailed information only when you really need it, without dramatically affecting a deployed system.

Unfortunately it isn't supported by the Windows Azure PaaS offerings, so if you're deploying there, no such luck until the Azure team gets around to it: http://feedback.windowsazure.com/forums/231545-diagnostics-and-monitoring/suggestions/4852776-add-support-to-include-etw-events-in-azure-diagnos

flukus
02/25/2014 12:10 AM by
flukus

To be fair, Jeff Atwood is being used as a straw man here. Application developments is a different world to database development, at least enough that you could both be 100% correct.

Comments have been closed on this topic.