The role of logs
You are probably familiar with logging, and log levels. In particular, the following scheme is very popular.
I have been using this scheme for pretty much every software project I had in over a decade. And it works. Which is good.
But I found that there are actually not really useful for our scenarios. In particular, we found that there are many cases where a component that logged something as a warn or error would generate a support ticket by eagle eyed admin, while it was just a valid part of the software behavior. The other side of that is that when we need to use the logs to find a problem, we never use fine grained logs, we just need it all.
This lead me to believe that we only actually need two and a half levels.
- Informative – we are reporting so one of our support engineers or a team member can look at that and understand what the software is doing.
- Operational – something deserves attention from the customer’s operations team.
There isn’t really anything in between those. Either this is something that we fully expect an operation team at the customer to look at, or this is something that the development and/or support engineers need to see.
But I mentioned two and a half, what is that half? Well, above operational, there is another half a level, which I like to call Pay Attention Now. This isn’t actually a log level, it is a way to attract someone’s attention to the log, or to some sort of operational alert. This is an SMS sent, or an email, etc.
But beyond the “something need attention now”, there is nothing else that is needed. The production logs are either for routine operations monitoring (in which case they should be relatively succinct and written with an eye to a reader who isn’t familiar with the internal working of RavenDB) or “what is going on here” mode, which is typically done by a team member who needs the most information possible.
Comments
Have seen that numerous times. NHibernate could be logging a foreign key constraint issue as ERROR while it was handled gracefully. As with everything, it requires context. Logging is not the tool to manage operations, it helps in diagnosing. However, specific log statements could be used for operations teams. With log4net you would set filters on certain namespaces and log levels where those log entries would be written to, for example, the windows event log.
Two points: <br/> - although you don't need all log level items to filter them it could be useful to see how each log is classified when inspecting them
Dani, That is a problem to do in a general fashion. Logs dispatch isn't a simple problem if you have a lot of different customers. If you are building an in house system, that is easiest, but otherwise? A file is your best bet.
i think there's need to differentiate between severity and information channel. log items may be informative or operational, but the amount of logging should be controlled by the level of the information. when dealing with linux systems, getting into a detailed log level it's typical to see each relevant processing step of an application. latter one introduces a serious perf penalty - which in case of an error is very good to dig into detail. on production mode this is mostly over the top and just wastes resources - at this point a much more general view (operational) is required.
ideally a logging backend would have some sort of retention buffer that in case of an error flushes buffered (more detailed) events into the log. so there's not just the error logged - maybe there's some information about the origin in the more detailed information and this helps to resolve the issue. this is especially good if you're dealing with "random" errors that are not reproducable and the system is required to be running all the time. one cannot put any system into debug mode at any time.
I really liked about the concept. I think that the log dispatching is alredy resolved by frameworks and apllications that alredy has a stable developed product, just to solve this kind of problem. Like graylog, prometheus or logstash. You could even use serilog + sinks, that enable easy log dispatching, with one line you can send log events to databases like RavenDB or indexers like Elasticseach. Log information is a delicate thing, that depends of who is developing, maybe a Information level of an event could be a debug level for another person. It is complicated to find a line that separetes with no doubt. Seems like this approch reduces this kind of differences, too. Of course, you end up losing some granularity, but it seems woth.
Daniel, You are looking at this from the consumption side. From the producing side, generating so many log entries is its own cost. Especially if we output detailed and complete logs, which require us to dump internal data structures
Yep - we're working on both sides. That's administration of many customers and also development of applications. So we know how hard it is, if there are missing hints in logs - but we also see that too detailed logs are also a pain. I think the trick is to provide detailed logs depending on the situation (increased severity => more detailed logs). When everything is fine, logs may be reduced to an absolutely minimum.
Having a system that is in production we are normally using mid logging levels and punching the data into systems like ELK stack / splunk to analyse the data. this is very helpful and we can easily see many errors and their origins.
On the other side we have systems that - if logging in detailed mode is enabled - the application is nearly unusable and - that's the fun fact - many errors like race conditions don't even occur anymore because logging is slowing all stuff so much down. In fact, currently we're seeing exaclty such an issue at a customer using a third party application at a point of sale. Since we're required to have deep logging, the system is nearly unusable slow and the non-reproducable error that occured several times a day before is now no longer occuring since a few days.
This administrative view point is why i'm personally always willing to pay for an overhead due logging. It's not that i want to give away 50% of the performance. But having 5-10% gone is still okay. Debugging and searching for issues that may otherwise not seen/resolved may take days. That's not worth the optimization.
Hello Ayende,
we have the same problems with our logging and the eagle eyed admin in production. Also with unrelevant exceptions which pollute the logfiles and hide real errors. We decided to use the existient loglevels but downgrade some errors, because they are accepted/handed or unrelevant errors.
e.g. When an user enters an invalid password, an exception occurs, because the username/password is wrong and a connection cannot be opened. Technical this generates an exception, but from the intention this is not an exceptional situation or dangerous sitution and no real error occurs. This is expected behaviour, but we want to see the login attempts in our log, so we downgraded the catched exception to the "Info" level.
e.g. The same happens with connection lost exceptions. The admins feared something is going wrong with so much "connection problems" Connections can be lost, when a client does not log off by just closing the browser, then a connection lost/timeout can occur, but this is not an exception (also its a technical exception). Other connection errors (network problems) we handle with multiple retries, so in the log an exception occurs, but a line below a retry is logged. So we downgraded them to "Debug".
The admins do not know how to intepret this (because the do not know the internal and which exceptions are handled or relevant), they just see the errors. We do not log all technical exceptions as errors, some are downgraded as "Info"/"Debug" (in the code there is comment why it is downgraded).
Hi, Daniel. Why deep logging can cause system to be slow? Seems to me that independtly of how deep a log is, he should never affect the application speed. If your are buffering in another context, why the application speed is affected?
"we found that there are many cases where a component that logged something as a warn or error would generate a support ticket by eagle eyed admin"
I believe the existing logging levels are just fine. If you don't want an error to auto generate a ticket, then for errors that you don't want that, just log a warning. I really like having 5 (where I can actually extend to more) log levels.
Best, John
Comment preview