Performance counters sucks
A while ago we added monitoring capabilities to RavenDB via performance counters. The intent was to give our users the ability to easily see exactly what is going on with RavenDB.
The actual usage, however, was a lot more problematic.
- Performance counters API can just hang, effectively killing us (since we try to initialize it as part of setup db routine).
- They require specific system permissions, and can fail without them.
- They get corrupted, for mysterious reasons, and then you need to reset them all.
- Even after you created them, they can still die on you for no apparent reason.
I would have been willing to assume that we are doing something really stupid. Except that SignalR had similar issues.
What is worse, it appears that using the performance counters needs to iterate the list of printers on the machine.
This is really ridiculous. And it is to the point in which I am willing to just give it up entirely, if only I had something that I could use to replace it with.
Currently I threw a lot of try /catch and a background thread to hide that, but it is ugly and brittle.
I had the same problem as Wallace. After a lot of digging, a Stack overflow question (http://stackoverflow.com/questions/19536688/performancecounter-hang-when-using-vs-2012-iisexpress8) I found this solution : http://support.microsoft.com/kb/300956.
I there any alternative ?
We had similar problem with them, But they even worst than you describe: 1. Their names are limited by their length 2. When you use them as percentage, then the order of their creation actually matters 3. It's a nightmare to collect counters from multiple computers
Finally, we moved to Splunk, reporting to it over TCP. It is an amazingly fast and reliable system. But not a cheap one...
Hi, i've had problems with windows perf counters too, but few years ago I got rid of them completely and came up with an alternative solution, that has an additional benefit of being much easier to set up and maintain.
The solution is based on NLog library, which is used by my company's software as the primary logging tool. Performance data is logged just as any other log messages, and then directed through NLog configuration to a collector program. Performance data is sent over UDP, so the communication overhead is mainly taken by network hardware. Apart from that, there's a dedicated local event aggregator plugged into NLog that collects very frequent events and calculates some stats on them before forwarding to the network - this is to reduce UDP traffic for high freq events.
This way all performance counters can be configured externally with NLog config file, without stopping the application, and remote/centralized monitoring is very easy.
Apart from that I have implemented a data collector/perf monitor application, based on well known RRDTool utility. This is just a prototype, but works in production for two years without too many problems. It's open source, code available at https://github.com/lafar6502/cogmon. Sorry for complete lack of documentation. I'm using this for monitoring appplication and system performance + some business process KPIs.
If you're interested and want to know more pls email me.
Remi, I have pointed users to that on several occasions, but that is really stupid. I don't want to have the ops burden of having to do this.
Maybe it's a good idea to look outside .Net world to how others solved the problem of metrics. Etsy's Statsd is one of the most popular way to log performance data.
They have a very simple way of collecting performance data including from C# apps and they integrate with lots of backends for dashboarding. The most widely used is Graphite (some screenshots: http://graphite.wikidot.com/screen-shots).
An another idea is to expose metrics data as a JSON feed/webservice in the same way Java's Metrics library (http://metrics.codahale.com/) does. If you take the time, you will find out the wealth of stats Metrics is computing for you. There is a .Net port of it named Metrics.Net but I've never use it in a production scenario because Statsd + Graphite is so cool.
Robert, Thank you very much, I'll be looking very closely at Metrics.NET
Please don't dismiss Statsd and Graphite just because their are not natively .Net. They are infrastructure indeed and live around your platforms but it would be extremely beneficial that RavenDb have capability to report internal counters to be analyzed in the entire context (e.g plot counters on graphs along with deployments, os data, etc).
The number of statistical, aggregation, customisation functions supported by Graphite is simply astonishing and only professional paid monitoring solutions match what Graphite can do. I don't see any reason for example not to use them yourself/your company to monitor your cloud infrastructure for RavenHQ (e.g: plot performance by week/day/time of day, display requests/sec of any type, etc). Even Metrics (java version) has a Graphite reporter.
Just take a look at some resources to discover their power:
http://www.codinginstinct.com/2013/03/metrics-and-graphite.html http://www.slideshare.net/itnig/collecting-metrics-with-graphite-and-statsd http://codeascraft.com/2010/12/08/track-every-release/ http://obfuscurity.com/2012/05/A-Precautionary-Tale-for-Graphite-Users
Robert, I think that you are missing a crucial point. There is a big difference between the reporting statd, graphite, etc) than the actual metrics. What I need to do right now is collect the metrics, how I report them is a separate issue.
My point is that, as I see it, there is a fine line between basic metrics collection (timing durations, checkpointing or counting) and having additional logic inside the metrics library in order to compute during application's run stats like histograms (median and other percentiles) or rate of events (e.g. /1sec/1min/5min,etc). Codahale's Metrics library has this approach.
The other approach is to adopt the more lightweight basic collection of metrics inside the application and delegate to infrastructure the statistical calculations by sending metrics as raw data either one by one or in batches at regular times to a central server. This is the StatsD library client approach and, in some extent, Windows's performance counters approach.
Of course I distinguish between collection and reporting, the sites given where merely trying to highlight the idea towards system optimization as a whole vs. local optimum. Anyway, I can't see any reason why you can't mix both techniques in order to address metrics collection and basic stats (maybe in your own admin dashboard or API endpoints) and also give your users the possibility to have more insight into their platforms as a whole by playing nicely with some established infrastructure like Graphite or Cacti. I might've missed your point if you were just looking for metrics library API design, concepts or details about their implementation, but I've taken my chance. :)
Naturally, you know better what you are after and the last judgement is yours. I'm glad that you found something helpful or just interesting in Metrics.Net to solve this problem that you've raised in the post.
Robert, My main issue is that we cannot just rely on an external source, which may or may not be available. We have to be able to provide the full information to the user in a self contained package.
Why not use EventSource/ETW and log events if needed?
Harry, very complex to use.
@Ayende I was looking at that new logging stuff from Microsoft and my response directly on the msdn or codeplex post where they announced it was WTF. They made the most ridiculous system possible. I seriously don't know how they could make a worse experience for actually using it. I have no idea how good it is in use, but as a developer it is atrocious.
@Ayende EventSource/ETW is a lot easier to use now with the newly released libraries e.g. EventSource and EventTrace libraries available via nuget etc. You probably already now this.
But yes it is probably more complex if your scenario is very simple. I guess multi-platform is also an issue.
Another vote for the statsd/graphite combination. It might not be what you need in your scenario, but when all you need is real-time metrics that combo is hard to beat.