The field computer science is surprisingly lax with statistics, when it comes to benchmarking. When you see a variance or standard deviation, rejoice! Usually it is just an average and a claim like "bigger, thus better". Nobody seems to talk about the null hypothesis and significance. Here I want to provide a simple schema to make better benchmarks.

Usually there is some baseline. Then someone changed something and wants to compare. E.g. a new compiler optimization, a different data structure, or another application. So let's just run the benchmark five times with A and B each, average the score and look which one is better.

An example: A = {9,7,4,6,3} and B = {2,3,3,8,4}. Could be the number of seconds our test ran. A is the baseline and B our optimized version.

The averages are 5.80 and 4.00, so B seems to be 45% faster and thus better. Good.

Wait a moment. For serious statistics you need to include standard deviations. The stdevs are 2.39 and 2.35. You could say A is 5.80±2.39 on average. B is more than a standard deviation faster. Looks good? Well, the stdevs overlap (5.80-2.39 > 4.00+2.35), so it does not look that good, but still an improvement, isn't it?

Let's do a significance test. This means we assume the null hypothesis is true and then compute the probability (p-value) that we observe our data. The null hypothesis traditionally is "no difference". In other words, we assume that our optimizations are useless and compute the probability that we still see the results above. If this probability is low (e.g. less than 5%), we conclude we "significantly optimized".

A common approach is Student's T-Test. That leads to the question "which variant of the t-test?" First, we are "unpaired" (aka independent). The 9 and 2 have no special connection. We could randomly permute our measurements of each set. In other words, the order of our measurements is irrelevant. Second, we have "equal variance", because both our measurements have the same number of results. Third, we are "two-tailed", because our averages can deviate in both directions.

Using LibreOffice's TTEST(A,B,'two-tailed','unpaired,equal variance'), we compute a p-value of 0.26. Unfortunately, 26% is way above the usual 5% barrier to reject the null hypothesis. Well, you are allowed say "I am 74% sure my optimization makes a difference". You are not allowed to say "my optimization significantly improves".

What can you do? Use more runs for your benchmark. Let us assume 15 instead of 5 runs, which accidentally just repeat the 5 runs 3 times. Now LibreOffice computes a p-value of 0.01, which is quite good.

You can download my example as LibreOffice document or as IPython Notebook (view online). Now go and do better benchmarking!