A performance profile mystery: The cost of Stopwatch

Aug 09 2023

A performance profile mystery: The cost of Stopwatch

time to read 3 min | 533 words

Measuring the length of time that a particular piece of code takes is a surprising challenging task. There are two aspects to this, the first is how do you ensure that the cost of getting the start and end times won’t interfere with the work you are doing. The second is how to actually get the time (potentially many times a second) in as efficient way as possible.

To give some context, Andrey Akinshin does a great overview of how the Stopwatch class works in C#. On Linux, that is basically calling to the clock_gettime system call, except that this is not a system call. That is actually a piece of code that the Kernel sticks inside your process that will then integrate with other aspects of the Kernel to optimize this. The idea is that this system call is so frequent that you cannot pay the cost of the Kernel mode transition. There is a good coverage of this here.

In short, that is a very well-known problem and quite a lot of brainpower has been dedicated to solving it. And then we reached this situation:

What you are seeing here is us testing the indexing process of RavenDB under the profiler. This is indexing roughly 100M documents, and according to the profiler, we are spending 15% of our time gathering metrics?

The StatsScope.Start() method simply calls Stopwatch.Start(), so we are basically looking at a profiler output that says that Stopwatch is accounting for 15% of our runtime?

Sorry, I don’t believe that. I mean, it is possible, but it seems far-fetched.

In order to test this, I wrote a very simple program, which will generate 100K integers and test whether they are prime or not. I’m doing that to test compute-bound work, basically, and testing calling Start() and Stop() either across the whole loop or in each iteration.

I run that a few times and I’m getting:

Windows: 311 ms with Stopwatch per iteration and 312 ms without
Linux: 450 ms with Stopwatch per iteration and 455 ms without

On Linux, there is about 5ms overhead if we use a per iteration stopwatch, on Windows, it is either the same cost or slightly cheaper with per iteration stopwatch.

Here is the profiler output on Windows:

And on Linux:

Now, that is what happens when we are doing a significant amount of work, what happens if the amount of work is negligible? I made the IsPrime() method very cheap, and I got:

So that is a good indication that this isn’t free, but still…

Comparing the costs, it is utterly ridiculous that the profiler says that so much time is spent in those methods.

Another aspect here may be the issue of the profiler impact itself. There are differences between using Tracing and Sampling methods, for example.

I don’t have an answer, just a lot of very curious questions.

Tweet Share Share 9 comments

Tags:

Comments

09 Aug 2023
15:27 PM

Stuart

No clue if this is it, but two possible guesses: * possibly TieredPGO moved code around so that it looks like it's in Stopwatch methods when it's actually other code * I've noticed that some profilers will assign prologue code of a function to the first line of the function, so it looks like the first line (which is starting the StatsScope) is doing a lot, when it's actually the prologue that's doing a lot, at least relatively. Similarly, epilogue code can end up as being counted as the last line.

Possibly one or both together is confusing the profiler into assigning blame to the wrong method.

09 Aug 2023
16:27 PM

Mario

Is it tracing?

10 Aug 2023
07:35 AM

Oren Eini

Stuart,

I agree, there is absolutely something funky going on here.

The fact that we aren't seeing that in Tracing mode, only in Sampling also lends credence to this being a Profiler issue rather than a real problem.

10 Aug 2023
07:36 AM

Oren Eini

Mario,

We observed this with Sampling, but not with Tracing.

10 Aug 2023
13:32 PM

Dennis

Isn't it because the sampling MUST do the sampling on a safe location? And a pseudo syscall is one such thing, so you end up with an overrepresentation for that reason.

10 Aug 2023
20:03 PM

Oren Eini

Dennis,

I guess that is possible. We are using a lot of unsafe call, etc. Which may have an impact.

20 Aug 2023
14:03 PM

Alois Kraus

Did you try a different profiler like perf, or did you run strace to count the number of system calls? Perhaps some things are disabled when running under the profiler like the mentioned kernel optimization?

30 Aug 2023
18:39 PM

Adam

This might be due to clock/timer interrupt interval being 15ms by default to save power, and this clock interrupt is source of all time inside OS

30 Aug 2023
21:10 PM

Oren Eini

Adam,

I think you are confusing the timer interval vs. the high frequency timer. Those are actually different from one another.

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB