An A/B Test In Slow Motion: Before Your Experiment Becomes Significant
Looking at the final snapshot of any experiment result, the liveliness of its effects are completely rendered invisible. The reality of any a/b test is quite dynamic and anyone who looks at experiments in real-time will know this. Understanding these subtle dynamics might be useful to anyone looking at test results. And so I’ve slowed down one such positive experiment result, along with its various time frames. My hope is that this might shed some light on what to expect from your data as you run a/b tests.
Note About The Experiment
One important mention about this experiment is that it's a positive one based on the highly repeatable Canned Response pattern. We know that it was a strong likelihood likely p (with multiple positive experiments combining together for a higher degree of prediction). If this were a flat experiment (with no or very tiny insignificant effects), we could also expect way more up and down movement.
- Effects Fluctuate - Especially In The Beginning
Early experiment results and their effects may be extremely misleading. Effects at this point fluctuate quite often with what might seem negative turning around into the positive and vice versa. This is very normal behavior which can also be seen in our experiment above. Comparing day 1 and day 4, our effects have completely reversed. More so, in the early days of the experiment, the effect range is also more drastic. In the first 10 days of the experiment we can observe effects ranging from -23% all the way to +17%. Whereas in between days 10 to 31, the effect stabilises between +20% and +13% (again keep in mind that this is a positive experiment at the end of the day).
- P-Values Fluctuate - Also More In The Beginning
In the beginning of any experiment, it's also common for p-values to move up and down dramatically (reminder: p-values are a measure of statistical significance - the higher the number, the less likely that A and B are any different). As a p-value decreases (significance rises), it does not mean that it will continue to do so. If we simply look days 5, 6 and 7, one can easily spot time frames of such significance reversal. However, at the same time, given enough sample (visitors or test participants), if there is a true effect, the p-value should continue its downward trend (as visible in the later stages of the experiment).
- Confidence Intervals Tighten
Finally, if we're looking at an experiment with a true effect, then as more data is accumulated, our confidence intervals will also narrow. This essentially means that we can be more confident that statistically our effect falls more or less within a range. For example, both days 10 and 31 have the same +17% effect, but the latter confidence interval is way more defined ranging between 8.2% and 26% (as opposed the wider 1.1% and 33% earlier range).