The metrics calculation methods

time to read 3 min | 418 words

Any self respecting database needs to be able to provide a whole host of metrics for the user.

Let us talk about something simple, like the requests / second metrics. This seems like a pretty easy metric to have, right? Every second, you have N number of requests, and you just show that.

But it turns out that just showing the latest req/sec number isn’t very useful, primarily because a lot of traffic actually have a valleys & peaks. So you want to have the req/sec not for a specific second, but for some time ago (like the req/sec over the last minute & 15 minutes).

One way to do that is to use an exponentially-weighted moving average. You can read about their use in Unix in these articles. But the idea is that as we add samples, we’ll put more weight on the recent samples, but also take into account historical data.

That has the nice property that it reacts quickly to changes in behavior, but it smooth them out that you see a gradual change over time. The bad thing about it is that it is not accurate (in the sense that this isn’t very easy for us to correlate to exact numbers) and it is smooth out changes.

On the other hand, you can take exact metrics. Going back to the req/sec number, we can allocate an array of 900 longs (so enough for 15 minutes with one measurement per second) and just use this cyclic buffer to store the details. The good thing about that is that it is very accurate, we can easily correlate results to external numbers (such as the results of a benchmark).

With the exact metrics, we get the benefit of being able to get the per second data and look at peaks & valleys and measure them. With exponentially weighted moving average, we have a more immediate response to changes, but it is never actually accurate.

It is a bit more work, but it is much more understandable code. On the other hand, it can result in strangeness. If you have a a burst of traffic, let’s say 1000 requests over 3 seconds, then the average req/sec over the last minute will stay fixed at 50 req/sec for a whole minute. Which is utterly correct and completely misleading.

I’m not sure how to handle this specific scenario in a way that is both accurate and expected by the user.