#266 Eliminating A Source of Measurement Errors in Benchmarks

About six months ago, I did a presentation that talks about how to conduct reliable benchmarking in Go. Recently, I submitted an issue #41641 to the Go project, which is also a subtle problem that you might need to address in some cases.

The issue is all about the following code snippet:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
func BenchmarkAtomic(b *testing.B) {
var v int32
atomic.StoreInt32(&v, 0)
b.Run("with-timer", func(b *testing.B) {
for i := 0; i < b.N; i++ {
b.StopTimer()
// ... do extra stuff ...
b.StartTimer()
atomic.AddInt32(&v, 1)
}
})
atomic.StoreInt32(&v, 0)
b.Run("w/o-timer", func(b *testing.B) {
for i := 0; i < b.N; i++ {
atomic.AddInt32(&v, 1)
}
})
}

On my target machine (CPU Quad-core Intel Core i7-7700 (-MT-MCP-) speed/max 1341/4200 MHz Kernel 5.4.0-42-generic x86_64), running the snippet with the following command:

1
go test -run=none -bench=Atomic -benchtime=1000x -count=20 | tee b.txt && benchstat b.txt

The result shows:

1
2
3
name                         time/op
Atomic/with-timer-8 32.6ns ± 7%
Atomic/w/o-timer-8 6.60ns ± 6%

Is it interesting to you? As you can tell, the measurement without introducing StopTimer/StartTimer pair is 26ns faster than the one with the StopTimer/StartTimer pair. How is this happening?

To learn more reason behind it, let's modify the benchmark a little bit:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
func BenchmarkAtomic(b *testing.B) {
var v int32
var n = 1000000
for k := 1; k < n; k *= 10 {
b.Run(fmt.Sprintf("n-%d", k), func(b *testing.B) {
atomic.StoreInt32(&v, 0)
b.Run("with-timer", func(b *testing.B) {
for i := 0; i < b.N; i++ {
b.StopTimer()
b.StartTimer()
for j := 0; j < k; j++ {
atomic.AddInt32(&v, 1)
}
}
})
atomic.StoreInt32(&v, 0)
b.Run("w/o-timer", func(b *testing.B) {
for i := 0; i < b.N; i++ {
for j := 0; j < k; j++ {
atomic.AddInt32(&v, 1)
}
}
})
})
}
}

This time, we use a loop of variable k to increase the number of atomic operations in the bench loop, that is:

1
2
3
for j := 0; j < k; j++ {
atomic.AddInt32(&v, 1)
}

Thus with higher k, the target code grows more costly. Using similar command:

1
go test -run=none -bench=Atomic -benchtime=1000x -count=20 | tee b.txt && benchstat b.txt

One can produce similar result as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
name                          time/op
Atomic/n-1/with-timer-8 34.8ns ±12%
Atomic/n-1/w/o-timer-8 6.44ns ± 1%
Atomic/n-10/with-timer-8 74.3ns ± 5%
Atomic/n-10/w/o-timer-8 47.6ns ± 3%
Atomic/n-100/with-timer-8 488ns ± 7%
Atomic/n-100/w/o-timer-8 456ns ± 2%
Atomic/n-1000/with-timer-8 4.65µs ± 3%
Atomic/n-1000/w/o-timer-8 4.63µs ±12%
Atomic/n-10000/with-timer-8 45.4µs ± 4%
Atomic/n-10000/w/o-timer-8 43.5µs ± 1%
Atomic/n-100000/with-timer-8 444µs ± 1%
Atomic/n-100000/w/o-timer-8 432µs ± 0%

What's interesting in the modified benchmark result is that: By testing target code with a higher cost, the difference between with-timer and w/o-timer gets much closer. More specifically, in the last pair of outputs, when n=100000, the measured atomic operation only has (444µs-432µs)/100000 = 0.12 ns time difference, which is pretty much accurate other than the error (34.8ns-6.44ns)/1 = 28.36 ns when n=1.

Why? There are two ways to trace the problem down to the bare bones.

Initial Investigation Using go tool pprof

As a standard procedure, let's benchmark the code that interrupts the timer and analysis profiling result using go tool pprof:

1
2
3
4
5
6
7
8
9
10
func BenchmarkWithTimer(b *testing.B) {
var v int32
for i := 0; i < b.N; i++ {
b.StopTimer()
b.StartTimer()
for j := 0; j < *k; j++ {
atomic.AddInt32(&v, 1)
}
}
}
1
go test -v -run=none -bench=WithTimer -benchtime=100000x -count=5 -cpuprofile cpu.pprof

Sadly, the graph shows a chunk of useless information where most of the costs shows as runtime.ReadMemStats:

pprof
pprof

This is because of the StopTimer/StartTimer implementation in the testing package calls runtime.ReadMemStats:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
package testing

(...)

func (b *B) StartTimer() {
if !b.timerOn {
runtime.ReadMemStats(&memStats) // <- here
b.startAllocs = memStats.Mallocs
b.startBytes = memStats.TotalAlloc
b.start = time.Now()
b.timerOn = true
}
}

func (b *B) StopTimer() {
if b.timerOn {
b.duration += time.Since(b.start)
runtime.ReadMemStats(&memStats) // <- here
b.netAllocs += memStats.Mallocs - b.startAllocs
b.netBytes += memStats.TotalAlloc - b.startBytes
b.timerOn = false
}
}

As we know that runtime.ReadMemStats stops the world, and each call to it is very time-consuming. Yes, this is yet another known issue #20875 regarding reduce runtime.ReadMemStats overhead in benchmarking.

Since we do not care about memory allocation at the moment, to avoid this issue, one could just hacking the source code by comment out the call to runtime.ReadMemStats:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
package testing

(...)

func (b *B) StartTimer() {
if !b.timerOn {
// runtime.ReadMemStats(&memStats) // <- here
b.startAllocs = memStats.Mallocs
b.startBytes = memStats.TotalAlloc
b.start = time.Now()
b.timerOn = true
}
}

func (b *B) StopTimer() {
if b.timerOn {
b.duration += time.Since(b.start)
// runtime.ReadMemStats(&memStats) // <- here
b.netAllocs += memStats.Mallocs - b.startAllocs
b.netBytes += memStats.TotalAlloc - b.startBytes
b.timerOn = false
}
}

If we re-run the test again, the pprof shows us:

pprof
pprof

Have you noticed where the problem is? Obviously, there is a heavy cost while calling time.Now() in a tight loop (not really surprising because it is a system call).

Further Verification Using C++

As we discussed in the previous section, the Go's pprof facility has its own problem while executing a benchmark, one can only edit the source code of Go to verify the source of the measurement error. You might want ask: Can we do something better than that? The answer is yes.

Let's write the initial benchmark in C++. This time, we go straightforward to the issue of calling now():

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#include <iostream>
#include <chrono>

void empty() {}

int main() {
int n = 1000000;
for (int j = 0; j < 10; j++) {
std::chrono::nanoseconds since(0);
for (int i = 0; i < n; i++) {
auto start = std::chrono::steady_clock::now();
empty();
since += std::chrono::steady_clock::now() - start;
}
std::cout << "avg since: " << since.count() / n << "ns \n";
}
}

compile it with:

1
clang++ -std=c++17 -O3 -pedantic -Wall main.cpp

In this code snippet, we are trying to measure the performance of an empty function. Ideally, the output should be 0ns. However, there is still a cost in calling the empty function:

1
2
3
4
5
6
7
8
9
10
avg since: 17ns
avg since: 16ns
avg since: 16ns
avg since: 16ns
avg since: 16ns
avg since: 16ns
avg since: 16ns
avg since: 16ns
avg since: 16ns
avg since: 16ns

Furthermore, we could also simplify the code to the subtraction of two now() calls:

1
2
3
4
5
6
7
8
9
10
11
12
13
#include <iostream>
#include <chrono>

int main() {
int n = 1000000;
for (int j = 0; j < 10; j++) {
std::chrono::nanoseconds since(0);
for (int i = 0; i < n; i++) {
since -= std::chrono::steady_clock::now() - std::chrono::steady_clock::now();
}
std::cout << "avg since: " << since.count() / n << "ns \n";
}
}

and you will see that the output remains end in the cost of avg since: 16ns. This proves that there is an overhead of calling now() for benchmarking.

Thus, in terms of benchmarking, the actual measured time of a target code equals to the execution time of the target code plus the overhead of calling now(), as showed in the figure below.

Mathematically speaking, assume the target code consumes in T ns, and the overhead of now() is t ns. Now, let's run the target code N times. The total measured time is T*N+t, then the average of a single iteration of the target code is T+t/N. Thus, the systematic measurement error becomes: t/N. Therefore, with a higher N, the systematic measurement error gets smaller.

The Solution

Back to the original question, how can we avoid the measurement error? A quick and dirty solution is subtract the now()'s overhead:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#include <iostream>
#include <chrono>

void target() {}

int main() {
int n = 1000000;
for (int j = 0; j < 10; j++) {
std::chrono::nanoseconds since(0);
for (int i = 0; i < n; i++) {
auto start = std::chrono::steady_clock::now();
target();
since += std::chrono::steady_clock::now() - start;
}

auto overhead = -(std::chrono::steady_clock::now() -
std::chrono::steady_clock::now());
since -= overhead * n;

std::cout << "avg since: " << since.count() / n << "ns \n";
}
}

And in Go, you could do something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
var v int32
atomic.StoreInt32(&v, 0)
r := testing.Benchmark(func(b *testing.B) {
for i := 0; i < b.N; i++ {
b.StopTimer()
// ... do extra stuff ...
b.StartTimer()
atomic.AddInt32(&v, 1)
}
})

// do calibration that removes the overhead of calling time.Now().
calibrate := func(d time.Duration, n int) time.Duration {
since := time.Duration(0)
for i := 0; i < n; i++ {
start := time.Now()
since += time.Since(start)
}
return (d - since) / time.Duration(n)
}

fmt.Printf("%v ns/op\n", calibrate(r.T, r.N))

As a take-away message, if you are writing a micro-benchmark (whose runs in nanoseconds), and you must interrupt the timer to clean up and reset some resources for some reason, then you must do a calibration on the measurement.

If the Go's benchmark facility addresses the issue internally, then it is great; but if they don't, at least you are aware of this issue and know how to fix it now.

Further Reading Suggestions

  • Changkun Ou. Conduct Reliable Benchmarking in Go. March 26, 2020. https://golang.design/s/gobench
  • Changkun Ou. testing: inconsistent benchmark measurements when interrupts timer. Sep 26, 2020. https://golang.org/issue/41641
  • Josh Bleecher Snyder. testing: consider calling ReadMemStats less during benchmarking. Jul 1, 2017. https://golang.org/issue/20875
  • Beyer, D., Löwe, S. & Wendler, P. Reliable benchmarking: requirements and solutions. Int J Softw Tools Technol Transfer 21, 1–29 (2019). https://doi.org/10.1007/s10009-017-0469-y
打赏催更