Microbenchmark is used to measure the performance of a small piece of code for the purpose of performance optimization. Writing a good microbenchmark is hard and that’s why we should use microbenchmark frameworks (e.g. JMH for Java and Google Benchmark for C++) to help us. This post contains microbenchmarks that I think are interesting.
Don’t directly use the performance numbers in this post, do your own measurement! Those numbers are highly dependent on the environment where those benchmarks are running.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
BENCHMARK_DEFINE_F ( BenchmarkFixture , ForLoopAssignmentBenchmark )( benchmark :: State & state ) {
for ( auto _ : state ) {
char c [ 2048 ];
for ( size_t i = 0 ; i < 2048 ; ++ i ) {
c [ i ] = 'a' ;
}
benchmark :: DoNotOptimize ( c );
}
}
BENCHMARK_REGISTER_F ( BenchmarkFixture , ForLoopAssignmentBenchmark );
BENCHMARK_DEFINE_F ( BenchmarkFixture , MemsetBenchmark )( benchmark :: State & state ) {
for ( auto _ : state ) {
char c [ 2048 ];
memset ( c , 'a' , 2048 );
benchmark :: DoNotOptimize ( c );
}
}
BENCHMARK_REGISTER_F ( BenchmarkFixture , MemsetBenchmark );
1
2
3
4
5
6
Run on ( 12 X 2500 MHz CPU s )
------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------------
BenchmarkFixture / ForLoopAssignmentBenchmark 1013 ns 1013 ns 750073
BenchmarkFixture / MemsetBenchmark 84 ns 84 ns 6209926
As we can see, memset is much faster than the for loop assignment in this case. Looking at the generated assembly code, memset uses rep stos
instruction which can be the reason why it’s faster.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// https://shipilev.net/blog/2014/nanotrusting-nanotime/
inline uint64_t NanosecondsSinceEpoch () {
timespec tp ;
clock_gettime ( CLOCK_REALTIME , & tp );
return tp . tv_sec * NS_PER_SEC + tp . tv_nsec ;
}
BENCHMARK_DEFINE_F ( BenchmarkFixture , NanoTimeLatencyBenchmark )( benchmark :: State & state ) {
for ( auto _ : state ) {
benchmark :: DoNotOptimize ( NanosecondsSinceEpoch ());
}
}
BENCHMARK_REGISTER_F ( BenchmarkFixture , NanoTimeLatencyBenchmark );
BENCHMARK_DEFINE_F ( BenchmarkFixture , NanoTimeGranularityBenchmark )( benchmark :: State & state ) {
uint64_t cur_nano = 0 ;
uint64_t last_nano = 0 ;
for ( auto _ : state ) {
do {
cur_nano = NanosecondsSinceEpoch ();
} while ( cur_nano == last_nano );
last_nano = cur_nano ;
}
}
BENCHMARK_REGISTER_F ( BenchmarkFixture , NanoTimeGranularityBenchmark );
1
2
3
4
5
6
Run on ( 12 X 2500 MHz CPU s )
------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------------
BenchmarkFixture / NanoTimeLatencyBenchmark 39 ns 39 ns 18361529
BenchmarkFixture / NanoTimeGranularityBenchmark 39 ns 39 ns 18073581
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
public class NanoTimeBenchmark {
@State ( Scope . Benchmark )
public static class BenchmarkState {
public long _lastNano ;
@Setup ( Level . Iteration )
public void setup () {
_lastNano = 0 ;
}
}
@Benchmark
@Fork ( 1 )
@Threads ( 1 )
@Warmup ( iterations = 10 , time = 5 , timeUnit = TimeUnit . SECONDS )
@Measurement ( iterations = 10 , time = 5 , timeUnit = TimeUnit . SECONDS )
@BenchmarkMode ( Mode . AverageTime )
@OutputTimeUnit ( TimeUnit . NANOSECONDS )
public long nanoTimeLatency () {
return System . nanoTime ();
}
@Benchmark
@Fork ( 1 )
@Threads ( 1 )
@Warmup ( iterations = 10 , time = 5 , timeUnit = TimeUnit . SECONDS )
@Measurement ( iterations = 10 , time = 5 , timeUnit = TimeUnit . SECONDS )
@BenchmarkMode ( Mode . AverageTime )
@OutputTimeUnit ( TimeUnit . NANOSECONDS )
public long nanoTimeGranularity ( BenchmarkState benchmarkState ) {
long cur ;
do {
cur = System . nanoTime ();
} while ( cur == benchmarkState . _lastNano );
benchmarkState . _lastNano = cur ;
return cur ;
}
}
1
2
3
Benchmark Mode Cnt Score Error Units
NanoTimeBenchmark . nanoTimeLatency avgt 10 42.705 ± 0.302 ns / op
NanoTimeBenchmark . nanoTimeGranularity avgt 10 43.875 ± 0.674 ns / op
This shows the overhead of getting time in both C++ and Java. They are not free!