Update August 9: urikanegun has kindly contributed a Japanese translation of this article.

Developers love speed, so developers love benchmarks. Benchmarks on programming language performance, app server performance, JavaScript engine performance, etc. have always attracted a lot of attention. However, there are lots of caveats involved in running a good benchmark. One of those caveats is benchmark stability: if you run a benchmark multiple times then the timings usually differ a bit. A a lot of people have the tendency to hand-wave this caveat by just shutting down all apps, rerunning the benchmark a few times and averaging the results. Is that truly good enough?

Lately, I have been researching the topic of benchmark stability because I am interested in creating reliable benchmarks that are reproducible by third parties, so that they can verify benchmark results by themselves — e.g. allowing users of my software to verify that my benchmarks are reliable. Such research has led me to Victor Stinner, a Python core developer who has been focusing on improving Python 3 performance for several years.

Stinner recently gave a talk at PyCon 2017 (with an excellent summary by Linux Weekly News) about this topic. Stinner needed a way to test if optimizations really helped, and if performance regressions had been introduced. He needed this in a form that can be integrated in a continuous integration tool so performance changes can be tracked reliably over time and commits. To that end, he developed a benchmarking methodology, as well as a benchmarking framework named perf.

Stinner's methodology can be summarized as:

  • Warm up the benchmark
  • Combine time budgets with iterations
  • Use multiple processes, in a sequential manner
  • Investigate large deviations
  • Tune the system at the kernel and hardware level

System tuning work consists of:

  • Use CPU affinity
  • Run the kernel scheduler on a dedicated core
  • Set the CPU to a constant, minimum clock speed
  • Use dedicated hardware

Let's have a closer look at his methodology and tuning strategies. I have not yet put his methodology into practice, although I hope I will get the chance to do it soon. Having said that, I will report my own analysis on the matter as well.

Warm up the benchmark

This one is pretty obvious and standard:

Stinner considers the first benchmark iteration a warmup round, and discards its results. The idea is to allow the warmup round to initialize lazily-loaded stuff, and to fill necessary caches (whether that be CPU caches, I/O caches, JIT caches or whatever). In some cases one warm up round may not be enough, so Stinner's perf tool allows configuring the number of warmup rounds.

Combine time budgets with iterations

Choosing the number of iterations to run in the benchmark is not trivial. Having too few iterations means that the benchmark may produce unreliable results, while having too many iterations means that the benchmark takes too much time, which negatively impacts developer motivation.
Stinner's methodology embraces the concept of time budgets (also known as "stopwatch benchmarking"). Instead of choosing a number iterations, he runs the benchmark for a reasonable time, and observes the number of iterations. He then locks down the test to that number of iterations, repeats it several more times (see next section), and averages the results.

Use multiple processes, in a sequential manner

Instead of running the benchmark inside a single process only, Stinner repeats his benchmark in 20 different processes, which run sequentially. Each process run performs the benchmark according to the number of iterations calculated from the time budget described in the previous section. The results from all process runs are then averaged. The reason why Stinner does this is because processes have some amount of random state, which can affect performance.

For example, Address Space Layout Randomization (ASLR) is a commonly-employed technique for preventing certain classes of security exploits, but this randomization can affect the locality of code and data, which impacts benchmark performance. Other examples include the hash table function seed as used in Ruby and Python: in order to mitigate attacks that try to exploit hash table collisions, Ruby and Python's hash table functions depend on a seed which is randomly chosen during startup. This results in variable hash table performance.

Disabling ASLR, random hash table seeds, and other similar techniques is not a good idea. Not only does disabling them make systems potentially more vulnerable, it is also not representative of real-world performance. Furthermore, the aforementioned random factors are just the tip of the iceberg and there are many more. Stinner accounts for such variations in performance by averaging the results over multiple process runs.

Investigate large deviations

Stinner not only looks at the average, but also at the standard deviation and how much various metrics deviate from the average. If the standard deviation is too high, or if the maximum and minimum deviate too much from the average, then Stinner considers the results invalid and proceeds with investigating what causes the deviations. His perf prints a warning if the standard deviation is greater than 10% of the average, or if the maximum/minimum deviate from the average by at least 50%.

Use CPU affinity

The operating system can schedule a process or a thread on any CPU core at any time. Every time a process/thread moves to a different CPU core, benchmark stability is affected because the new core does not have the same data in its L1 cache as the old core. By the time that that process/thread is scheduled back to the old core, that core's cache may already have been polluted by other work that was scheduled on that core. So it is important to:

  • ensure that a process/thread "sticks" to a certain CPU core.
  • ensure that a CPU core only runs one thread (and thus only one process).

This is called CPU affinity.

A benchmark should be run on a system with a sufficient number of real CPU cores. HyperThreading does not count as a "real CPU core": two HyperThread cores share the same physical core (and thus the same L1 cache), where by the CPU sort-of time slices the core based on which HyperThread is waiting for RAM.

Linux provides two tools to set CPU affinity:

Run the kernel scheduler on a dedicated core

The Linux kernel has a scheduler interrupt which fires up to 1000 times per second in order to perform time slicing on a CPU core. This is not good because every time the tick fires, the CPU core context switches to the kernel, which pollutes CPU caches (the kernel has to load its own code data structures in memory). Therefore it is recommended to configure the kernel to limit the scheduler tick to a core that isn't used for running benchmarks.[1][2][3]

This can be done using two configuration options: the nohz_full and the rcu_nocbs command line options. Both of these options are documented in the kernel's NO_HZ.txt document. The former sets the scheduler to "full tickless" mode, while the latter limits the kernel scheduler to a specific CPU core. Both of these options have to be used together because of a compatibility problem with Linux's Intel P-state power saving driver, as described on Stinner's blog and as reported on Red Hat's Bugzilla.

Set the CPU to a constant, minimum clock speed

Modern CPUs do not have a constant clock speed (or frequency). They change their clock speed frequently in order to maximize performance when needed, but save energy when there is little work to be done. As a result, performance on a system can vary wildly based on the workload and when the workload is run. Furthermore, temperature also plays a big role: modern CPUs have lots of mechanisms to prevent overheating. If the CPU gets too hot then it will dial back its own performance.

The following technologies and mechanisms govern clock speed scaling: Intel Turbo Boost (the modern equivalent of the CPU turbo button, but automatic), C-states (power saving states) and P-states (execution power states). These are described on Stinner's blog: part 1 and part 2.

All of this behavior is detrimental to benchmark stability. So Stinner's methodology is to set the CPU clock speed to the minimum value possible and to disable all clock speed autoscaling mechanisms. The reason why he sets it to a constant minimum instead of a constant maximum is to prevent heat. After all, the raw performance of a benchmark doesn't matter: just that a benchmark result can be reliably compared to another benchmark result.

Setting a CPU's clock speed to the minimum value can be done at the kernel level with:

cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq | sudo tee  /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq

However, this option does not persist between reboots: there are userspace tools — such as cpufrequtils — that also set this option. It seems that the best way is to configure cpufrequtils instead of the kernel directly. On Debian, cpufrequtils can be configured by editing /etc/default/cpufrequtils and setting GOVERNOR=powersave. The Red Hat Power Management Guide explains what the different cpufrequtils governor values mean.

Use dedicated hardware

Running a benchmark locally on your laptop is not good for benchmark stability, because other apps can interfere. Even if you close all apps, it is still not good enough because some things — which are unnecessary for the benchmark but necessary for normal laptop use — still need to run in the background. The various kernel- and hardware-level tunings described above cannot always be conveniently performed on your laptop. For example, my Mac doesn't run Linux, and I'm not keen on figuring out how to do that just so I can run stable benchmarks.

Another problem is that your laptop isn't reproducible. Users cannot easily get an exact replica of my laptop's hardware and software in order to reproduce the benchmark.

Your laptop is convenient though because it allows a short development-testing cycle. Buying and setting up dedicated hardware is expensive and time-consuming. Cloud servers and virtual machines are not ideal either because they cannot guarantee stable performance. Other VMs running on the same host system will leech away CPU and will cause hypervisor context switches. The way that a lot of cloud providers provide burstable CPU performance (giving your VM more CPU when other VMs are idle) is great for production environments, but bad for benchmark stability. There are also various hardware-level tunings that cannot be performed on cloud servers because they are virtualized.

In search of a viable hosting provider

So what do we do then? I have found the following options that seem reasonably achievable, convenient and cost-effective:

  • Amazon Dedicated Hosts. Amazon allows you to launch EC2 instances on defined dedicated hardware, which is billed on an hourly basis. They don't allow you direct access to the hardware because the instances are still virtualized, but it comes pretty close.

    Pros:

    Cons:

    • Unable to set CPU affinity to specific physical CPU cores because the CPU is still virtualized. I don't know how big of a problem this is, it’s something to investigate in the future.
  • Hetzner is a budget dedicated hardware hosting provider. You can get a dedicated server for as cheap as 39 EUR/mo (though there is a ~79 EUR setup fee).

    Pros:

    • Not hourly billing, but still cheap.
    • Full access to underlying hardware. You can set CPU affinity, disable CPU frequency scaling, etc.

    Cons:

    • Depending on the exact server model, there is a setup fee.
    • Monthly commitment. Not bad but not as good as hourly billing.
    • Hetzner regularly changes their hardware offerings, making older hardware models unavailable. This is a problem for third parties who want to reproduce a benchmark on the exact same hardware.
    • Limited automation possible. Their API is not as good as Amazon's.
  • RackSpace OnMetal provides dedicated hardware with hourly billing. I haven't been able to try this out because of a problem with their control panel, so I can't tell whether they allow true access to dedicated hardware or whether there's still a virtualization layer.

  • Scaleway Bare Metal Cloud provides dedicated x86_64 (Intel Atom) and ARM servers.

    Pros:

    • Hourly billing.
    • Automatable through their API and CLI tools.
    • Very very cheap.

    Cons:

    • I have not had good experience with their ARM hardware. They seem to crash in strange ways.

    Potential pitfalls:

    • It is not clear whether their x86_64 servers are virtualized in any way, I don't know whether it's possible to set CPU affinity to physical cores.
    • Their x86_64 offerings are Atom CPUs. I do not have any experience with ARM. At a first glance, it appears that ARM does not do CPU frequency scaling (or at least, it's not configurable), but who knows?
    • Their reliability in general is pretty dodgy. I would not recommend them (yet?) for serious production workloads, but running benchmarks is fine.

Conclusion

Stinner's benchmark methodology does not involve a lot of statistical techniques. It is mostly system tuning. Stinner checks the standard deviation of the benchmarks results, and if the deviation is too big he'll discard the results and proceed to investigate what needs tuning in order to make the benchmark more stable. A large part of his methodology focuses on the operating system and the hardware level. Benchmarks are run in multiple processes in order to cope with the effects of ASLR and random seeds; the kernel's CPU settings are tuned; and the CPU's clock speed is tuned.

Most of this methodology is codified in Stinner's perf tool. Stinner reported that his methodology has helped him a lot with obtaining stable and reliable results. This not only led to a new Python benchmark suite, but also to a continuous integration system where the benchmark results are tracked over time.

I have not yet put this methodology into practice, but I will definitely do it next time I run a benchmark and I will report on my practical findings.

Are you also interested in benchmarking? Let me know whether this article has helped you and whether you have any insights to share. Good luck and happy benchmarking.

Sources and further reading