In normal systems, when we want to understand what is going on, or to investigate a problem, we have a really simple option, just debug it until you have found the root cause. This is so common that we actually have best practices against being too dependent on the debugger.
I think that we can all agree that the industry is going more and more toward parallelism. And distributed applications are more and more common. This present an interesting problem for us when we come to understand & troubleshoot a system. It is no longer possible to simply step through the code as it is executing and thus gain the knowledge that we need in order to understand exactly what is going on.
In order to understand these kind of systems, we need to develop new tools and approaches. Microsoft already did some of that when they built web services for .Net. It is literally possible to debug through a web service call and move from the client to the server (assuming that they are on the same machine) just by pressing F11.
This doesn't really work for most scenarios, however, a common example is a system that is distributed in both time and space (imagine an authorization process that can take minutes or hours). Or a system where the stream of messages is simply too high to be able to individually understand.
In order to deal with this type of systems, we need to go a long way back. Sometimes before we had debuggers, we still needed a way to figure out what is going on, and we found it.
Welcome to printf() debugging (or Response.Write debugging, if you prefer it this way).
And no, we are not quite in the same position, but we are close. One of the main problems here is the fact that we need to coordinate several different machines and correlate between work done on different times and with different capacity.
WCF calls this end to end logging, and achieve this by attaching a guid that you have to carry around, plus some tools that allow you to merge different logging files to give a unified view across systems.
BizTalk has the notion of Business Activity Monitoring, and other tools share the same concept. All of them are based on the notion of a common id that spans multiple messages and can be used across machines to get a single view of the entire set of actions.
In a world that is fast become more and more distributed, such tools are quickly becoming essentials, and I foresee quite a few best practices that will be aimed solely at ensuring that we keep that single thread of traceability in place.
I find it quite amusing that we are basically going back to reading log files to figure out what is going on with our applications. Of course, there are logs and there are logs, and I'll talk about logging & auditing in a lot more detail in a future post.