Ayende @ Rahien

It's a girl

Reviewing Metrics .NET project

This is a review of the Metrics.NET project (commit cb52da325c0a88336e09412638f72620d9ba7992).

The project is supposed to give us a way to track metrics about our applications, and we want to make use of it in RavenDB instead of the highly unreliable performance counters. This is going to be a pretty short review, mostly because I don’t really have much to say.  There are a few things that I take issues with (async tasks using Thread.Sleep instead of Task.Delay, but that is probably because it is targeting .NET 4.5).

Other than that, most of the code is actually doing crazy math stuff, and there are proper links to the explanations, and you can see why it is doing so. The impressive thing is that pretty much everything that I wanted to do was already there. Including an easy way to expose metrics over the wire, and that the whole things seems pretty seamless.

Very good work, all around.

For our purposes, however, I think we’ll need to do some other things. In particular, one of the major assumptions throughout the code is that there is always a type associated with a metric, which isn’t the case for our purposes.  More than that, the code now assumes a static and fixed set of metrics for the entire system. That doesn’t work very well for us when we have different metrics for each database, so we’ll probably need to change that as well.

But I am very impressed, this looks like it could sort out a lot of the things that we need to do very quickly.

Tags:

Posted By: Ayende Rahien

Published at

Originally posted at

Comments

Daniel Crenna
01/16/2014 03:35 PM by
Daniel Crenna

Cool to hear that this project is being used for something, I wrote it when I was inspired by the original, and needed it for a particularly gnarly distributed app I was working on. I'm mostly doing games now but I'm happy to shepherd changes and like the idea of moving more to a registry pattern for a metrics manifest. The MetricsListener obviously needs some love too since it was just written for a single purpose.

Scooletz
02/05/2014 10:15 AM by
Scooletz

@Ayende, thx for sharing your positive opinion about the stuff u use @Daniel Crenna, I think you just got a nice kick for your project :)

Rémi BOURGAREL
02/05/2014 10:28 AM by
Rémi BOURGAREL

I was just having a look at it and saw your commits, I'll definitely try it. The next step is to use the results.

Joseph Daigle
02/05/2014 12:31 PM by
Joseph Daigle

@Ayende, do you have a post or comments on how windows performance counters have been unreliable for RavenDB? I'm curious to learn the pitfalls and shortcomings that you've found.

Russ
02/05/2014 12:47 PM by
Russ

Depending on how many updates you run through this, you might want to keep an eye on your throughput because some methods in here (especially HistogramMetric.Update, which uses multiple interlocks) will scale terribly on large boxes with a high number of cores. It's much faster to partition each metric across all cores and use normal locks. Compare http://blog.kejser.org/2014/01/04/synchronisation-in-net-part-3-spinlocks-and-interlocks/ to http://blog.kejser.org/2014/01/05/synchronisation-in-net-part-4-partitioned-data-structures/

Torkel Ödegaard
02/05/2014 03:42 PM by
Torkel Ödegaard

I have been obsessed with metrics and getting realtime and long term trend analysis insights into applications.

I just must plug Graphite, the industry leading open source time series database and graph plotter. Not so easy to install but once you have it up and running you gain incredible power and flexibility.

http://www.codinginstinct.com/2013/03/metrics-and-graphite.html

There are a gazillion metric frameworks that send data to graphite. I based mine on twitters ostrich (https://github.com/twitter/ostrich), very easy nice API. There is a .NET port but seems kind of dead, I haven't open sourced my .net clone yet :(

For anyone using Graphite to display/store metrics, check out Grafana (http://grafana.org), a feature rich dashboard editor and graph composer that has quickly gained popularity within the graphite community.

Ayende Rahien
02/05/2014 03:59 PM by
Ayende Rahien

Joseph, see: http://ayende.com/blog/165217/performance-counters-sucks

Ayende Rahien
02/05/2014 04:08 PM by
Ayende Rahien

Russ, That is very interesting information. I don't think that this is that big an issue, in particular because we don't need to do that many update ops in a single ms, but that is something that I'll keep in mind.

Ayende Rahien
02/05/2014 04:09 PM by
Ayende Rahien

Torkel, "Not so easy to install" pretty much guarantee that it won't be there in RavenDB, at least not as part of our default package. Our whole concept is around making things easy.

Torkel Ödegaard
02/05/2014 04:26 PM by
Torkel Ödegaard

Ayende, I did not mean that you would use/bundle Graphite with RavenDB. Graphite does not run on windows at all, if you haven't investigated it check it out, if you are interested in time series stores and graph plotting (which seems to be the case). There is a reason that it has become a leader in this space (mostly due to its simple metric format and ways of sending in metrics, and the hundreds of functions to combine, compare and plot metrics in incredibly flexible and powerful ways.

For another interesting project that is gaining interest is influxdb (time series store and graph query language), no dependencies, written in golang.

Chris Marisic
02/05/2014 04:46 PM by
Chris Marisic

" There are a few things that I take issues with (async tasks using Thread.Sleep instead of Task.Delay, but that is probably because it is targeting .NET 4.5)."

Did you mean to say targeting .NET 4.0 or 3.5? 4.5 certainly doesn't make sense there as it is the latest and greatest which obviously has task.delay

Bartosz Adamczewski
02/05/2014 09:42 PM by
Bartosz Adamczewski

@Russ, The situation is key here as interlocked will be generally faster if a proper back off is implemented and the structure is aligned. Spin locks tend to trash the pipeline with operations due to branch prediction strategies, but this can be avoided I think.

On the other hand should a lock be held longer then around 30 ms, then it will go to the kernel and emit a full lock instruction which is very expensive.

I haven't checked the method that you described but depending on the average running time of the method and loop implementation it just might be faster, some measurements would need to be made :)

Ayende Rahien
02/05/2014 09:45 PM by
Ayende Rahien

Chris, Yes, I meant 4.0, not 4.5

Bartosz Adamczewski
02/05/2014 09:59 PM by
Bartosz Adamczewski

@Russ, Another thing is that CAS loops are only effective up to their physical core count beyond that a hybrid structure is more preferable.

Chris Marisic
02/05/2014 10:27 PM by
Chris Marisic

So many things said in these comments that I know absolutely nothing about, even more so than alot of the file system discussions, and these are so many things I'm happy that I DON'T need to know about.

Ayende Rahien
02/06/2014 04:56 AM by
Ayende Rahien

Chris, Yeah, that is what makes writing RavenDB so much fun / frustrating.

Russ
02/06/2014 10:24 AM by
Russ

@Bartosz

Although interlocked update does not obtain a kernel lock, it DOES still obtain a lock on the cache line - so calling it many times in a highly-concurrent scenario can cause contention. The Update method I mentioned will, best case, obtain 7 individual cache line locks:

1 for _count.IncrementAndGet 1 for _sample.Update (plus a ReaderWriterLockSlim read lock and a ConcurrentDictionary bucket lock, plus another cache line lock and a ReaderWriterLockSlim write lock if rescaling is required) 1 for SetMax 1 for SetMin 1 for _sum.GetAndAdd 2 for UpdateVariance (assuming this isn't the very first Update, in which case it's just 1)

Of those, SetMax, SetMin, and UpdateVariance may need significantly more if they are forced to loop after a failed CAS, with no upper bound.

Take a look at the blog posts I linked (I work with Thomas, and implemented the library that this research was for). The partitioned data structure performs about as well (within a factor of about 1.3x) as the interlocked version at low scale, but as soon as you go above about 6 cores even the basic Monitor.Enter will leave interlocked in the dust when used like this.

Bartosz Adamczewski
02/06/2014 11:31 AM by
Bartosz Adamczewski

@Russ, I'm quite aware about the cache line lock by lock cmpxchg, the thing is that this lock is very short lived (1 instruction) but when introduced to lot's of spins this can get out of hand, this a proper back-off is always needed (like linear back-off and write back-off for e.g) as well as proper structure alignment and partitioning.

So in conclusion proper spin waits are hard, simple lock is much easier but on the other hand should a lock fall though to the kernel and lock all of the processing pipelines (and first wait for all of them to finish) this would be very bad so the time of holding the lock is a key factor here.

So in this case the Update Method would probably be better off with fine grained locks, but again measurements are everything :)

"Of those, SetMax, SetMin, and UpdateVariance may need significantly more if they are forced to loop after a failed CAS, with no upper bound." - So this proves your point, there is no back-off.

Thanks for pointing out the links, (awesome stuff) and sharing some cool insights :).

Russ
02/06/2014 11:54 AM by
Russ

The thing about back-offs when using interlocks, however, is what do you do then? Say I loop 5 times on my CAS and still haven't managed to do an update? Do I throw the metric away? Do I try to rollback the changes I made successfully (since these are individual atomic operations, I might successfully increment count then fail to SetMax). It can get nasty, fast. If I do all of this with normal increments inside a partitioned lock, I don't have any of these problems (simplicity) and I'm fast (normal increments are faster than interlocks).

You are right that the time holding the lock is vital, but that's also why we partition - if we have, say, 16 locks (on a 16-core box) and mod the thread id by 16 to get simple thread affinity, the chances of lock contention drop rapidly. An uncontested Monitor.Enter is amazingly fast, considering.

Incidentally, Windows Performance Analyzer is utterly great for seeing what's going on with your locks :-) It's not as well-known as it should be.

Bartosz Adamczewski
02/06/2014 12:42 PM by
Bartosz Adamczewski

@Russ,

You may start with user mode context switch, then go to wait you set the interrupt times in a linear fashion (linear back-off).

You may after a set of tries back of from calling lock cmpxchg by mataining a structure who owns the lock which may by not an atomic information (depending on a case) but this will increase the probability that next time you do cmpxchg it will be a hit.

You may implement a wait free guarantee based of windows events such that if you exceed the spin limit you block all of the other spinners, and do cmpxchg, then you start to unblock them either linearly or by small bursts so they will benefit the most from boosting, thus again increasing their chances.

But like you said simplicity is the key here, all of the methods described range from hard to extreme.

(we got a bit off-topic and off the rails here, sorry @Ayende)

Comments have been closed on this topic.