Evaluating randomised tests
Having run a randomised test on a service or product for some time, we may see that version B produces 3% better results on average than version A. But how can we be certain that this is a significant difference? That it is statistically significant? The ideas below helped me understand what's behind this question.
A randomised test usually means that some users see a different version of the product than others. The success of both versions is then measured by a score we give to each user (e.g. how much money they ultimately spend). We see a difference in the overall average (or arithmetic mean) of these scores, but it may not be caused by the different versions.
The reason for this is simple. Each user has a different score, and from the whole pool of users we have selected a subset (e.g. those who were shown version A) for which we calculated the mean score. This will vary slightly from subset to subset, and this can explain the difference between the mean scores for users shown version A and those shown version B. What our goal is, therefore, to try to decide whether this could be the explanation.
To do this, we assume that this is indeed the case, and the difference between the means is only caused by this normal variation between subsets. This is our null hypothesis, and it would mean that the two subsets of users came from the same distribution and the different versions of the product shown to them made no difference whatsoever. So we pool the users who were shown version A (let's say there were n of them), and those shown version B (m of them), and from this combined set we randomly select subsets of n and m users with replacement. This process is called bootstrapping, and it results in many pairs of subsets for which we calculate the difference between their mean scores.
This gives us a number of differences that we expect to see just because the users are divided into two subsets. We can arrange them in a histogram and check where the difference we saw in our test is on this histogram. If, say, it turns out that 70% of the differences we produced by bootstrapping is smaller than the difference given by the test, and 30% is larger, then it is entirely possible (and quite likely) that the difference in the test is just caused by separating users into subsets. In other words, the null hypothesis is accepted and the difference in the test is not statistically significant.
However, if only 5% or less of the generated differences is greater than the difference in the test (5% being the usual benchmark), then the null hypothesis is rejected as it is very unlikely that the difference in the test could have occurred on its own, and we say that it is statistically significant. That is, showing the different versions to users mattered.
There are many ways to calculate whether some result is statistically significant, but some rely on assumptions about the underlying distribution of scores. Actually performing the bootstrapping doesn't, and a simple C program, Manatee can do just that - it's available on GitHub.