Simulating How Long To Run Your Test

How much time is enough for the true performance of your variations to come through the noise?

In this video, we'll see a simulation of an A/A/B/C/D test as it moves from the initial state dominated by chance towards a state of equilibrium. In the process, we observe how the performance of variations can change over time due to chance alone and what sorts of intermediate outcomes we can expect. How does a false positive tend to behave over time? What is a true +10% winner likely to do half way into the test? Answering these questions helps me interpret real tests.

To speed things up, this simulation is based on a 20% baseline conversion rate and 1,000 visitor hits per day. The duration of 10 days is just an example. In your real tests, the conversion rate might be as low as 1%, which means it would take far longer to get to a similar equilibrium.

Exercise:

  1. Use Evan Miller's Sample Size Calculator to calculate the sample size needed to detect a 10% relative lift over a 20% baseline (answer is at the bottom of this post) - leave the power and significance on default.
  2. Rewatch the simulation video and see how the test behaves as it approaches this sample size target.
  3. Consider: How accurate is the relative performance of each variation at this point? What sort of outcomes are still possible by chance alone that would obscure the true performance of the variations? Based on this simulation would you run your test longer or less than this target?

Do you use simulations for planning and analysis? Share with us.

(Answer: 6347 visitors per variation)




Comments

  • Gavin Morrice

    Gavin Morrice 8 years ago 10

    I just built a small Ruby program that simulates this, the results are pretty mind-blowing.
    It takes several thousand samples before you get an accurate result with Split A/B testing. Even more if you're using something like an Epsilon Greedy algorithm.

    Have a play around with it, the code is on GitHub
    https://github.com/Bodacious/OptimisationTestComparisons/

    • Jakub Linowski

      Jakub Linowski 8 years ago 00

      Hey Gavin. Awesomeness. These simulations are nice in that they visualize false positives and false negatives. Agreed. And yes, it usually takes thousands of visitors for the patterns to stabilize. I think the key factors are: sample size, baseline conversion rate, and the magnitude of effect.

      Unfortunately, I don't know Ruby to run your script :(
      Cheers,
      J