The metrics calculation methods
Any self respecting database needs to be able to provide a whole host of metrics for the user.
Let us talk about something simple, like the requests / second metrics. This seems like a pretty easy metric to have, right? Every second, you have N number of requests, and you just show that.
But it turns out that just showing the latest req/sec number isn’t very useful, primarily because a lot of traffic actually have a valleys & peaks. So you want to have the req/sec not for a specific second, but for some time ago (like the req/sec over the last minute & 15 minutes).
One way to do that is to use an exponentially-weighted moving average. You can read about their use in Unix in these articles. But the idea is that as we add samples, we’ll put more weight on the recent samples, but also take into account historical data.
That has the nice property that it reacts quickly to changes in behavior, but it smooth them out that you see a gradual change over time. The bad thing about it is that it is not accurate (in the sense that this isn’t very easy for us to correlate to exact numbers) and it is smooth out changes.
On the other hand, you can take exact metrics. Going back to the req/sec number, we can allocate an array of 900 longs (so enough for 15 minutes with one measurement per second) and just use this cyclic buffer to store the details. The good thing about that is that it is very accurate, we can easily correlate results to external numbers (such as the results of a benchmark).
With the exact metrics, we get the benefit of being able to get the per second data and look at peaks & valleys and measure them. With exponentially weighted moving average, we have a more immediate response to changes, but it is never actually accurate.
It is a bit more work, but it is much more understandable code. On the other hand, it can result in strangeness. If you have a a burst of traffic, let’s say 1000 requests over 3 seconds, then the average req/sec over the last minute will stay fixed at 50 req/sec for a whole minute. Which is utterly correct and completely misleading.
I’m not sure how to handle this specific scenario in a way that is both accurate and expected by the user.
Comments
how about calculating the last minute average like: LastMinuteAvg = (TotalReq = TotalReq- Req61SecondsAgo + CurrentReq) / 60
Whenever a second passes you subtract the request count from 61 seconds (that leaves the window) ago from the total and add the current request count (which enters the window), then perform the average. This way it's a average for the window of the 60 last seconds.
Pop Catalin, We already have the number of requests per minutes available (since we have the number of req per second in that time frame). We don't want the req per minute, we want the req per second in the last minute
"we want the req per second in the last minute"
Isn't that a rolling window with the average number of requests in the past 60 seconds from current time?
Pop Catalin,
Request per minute don't really indicate actual load on the server, req/sec is much more accurate, but you want to see if over time ranges, that is the problem
"but you want to see if over time ranges, that is the problem" You want to see it Aggregated (Avg, Max, Sum) ? or as series (graph)?
Provide several measures - total requests in last second, last 10 seconds, last 5 minutes, then users choose what they want to see. Usually data collection tools record a sample every 5 minutes or so so 5 minute resolution accomodates them well. and 1 or 10 second resolution is good for watching the status online. Further aggregation can be done by the monitoring tool itself.
You could track a rolling average with its standard deviation and additionally record "peaks" and "cliffs" over some time period. The peaks and cliffs signifying loads outside of the average ± n SD. Then next to the average and SD you could present the number of peaks/cliffs (or their ratio to "in-band" loads) and their most extreme values.
Also, or alternatively you could make use of something like a simple process control statistic / chart (e.g CUSUM or Shewart) to detect and visualize whether there are certain "out-of-control" trends.
since you want to know server load per second I'm thinking about the most accurate values. That will be saving every request, lots of data of course, but you could just save all incoming request and insert into database every x records or after a certain time.
When you got the most accurate data you can calculate the requests per second every minute or 2.
Think about how azure does it, they show the data after a certain amount off time, so they can do some calculations too.
I'd go for per-interval (second?) histogram (HdrHistogram?) approach, since you can extract almost anything from histograms. They are of course larger than a single [long] integer, but having detailed information is usually well worth the effort when trying to meet some kind of an internal SLA (like "no request should be slower than 50ms").
You want to show an average request rate and a current request rate. That's two metrics. One number just can't combine these two properties in a meaningful way, for the reasons you state. Why is your last example misleading? What number would you like to see there?
Have you considered percentiles? 95th, 50th (median) and 5th are pretty standard. The spacing between them gives an indication of how "choppy" the underlying data is without the peak value blowing the chart scales... or you plot peak using a different scale.
This reminds me of a post of yours from a few years back
https://ayende.com/blog/162273/raven-xyz-trying-out-some-ideas
Are they related? Did anything come of that thought?
Piers, No, this is about metrics for the req/sec on RavenDB.
Have you ever thought to push or pull these "metric" out of process and let someone else do the actual math. A lot of system provide such events to other 3rd party systems. For example Etcd,Kubernetes,SkyDNS provide stats for prometheus by default. Due to the nature of your process this has to be super optimized of course...
Sotirios, We also do that (via SNMP), but the idea is that we also want to have some basic stats available in the product.
Comment preview