Agile A/B Testing: Using Stop Rules To Minimize Losses & Time Wasted
Time and money are precious, among other things. So when we seek gains while minimizing losses through risky a/b tests, we are also faced with an interesting problem of making the most optimal decisions, in the most optimal time frame. The reality is that tests will have variations which will hurt the business by exposing it to loses and waste its time with insignificant changes. This challenges the classic approach to experimentation of estimating the sample size, setting and forgetting. I'm beginning to believe that a more effective approach to optimization is one where stop rules are combined with good enough data. By not committing to run tests to their fullest estimated duration, being agile, winning tests should have greater chances to emerge.
Wasting Time With Insignificant Tests - Problem One
Here is the first problem. Let's say that you estimate your a/b test to detect a +20% increase off your magical 10% purchase rate, to realize you need over 7,200 visitors (3,600 per variation x 2) to detect it. Great. You design the A/B test, set it and the wizards will warn you not to peek or touch anything until the 7,200 sample size has been fully exposed. The hardcore statisticians do have a valid point as repeated significance testing (checking on the test) introduces your chances of detecting a false positive rate. But that aside for now, remind yourself that for some online businesses the wait time associated with a 7,200 sample size might be hours and for others, possibly months. And let's say that you do glance at the test mid way through and see a highly insignificant +3% increase with a humongous ±22% margin of error. This is as grey as it gets and here the problem begins to shine through - your chance of turning the test around and getting B to jump from a +3% increase to over +20% are way smaller than a measly false positive rate from checking. When you made the estimate to detect the +20% increase, that was just a guess and now you have more concrete data. If you continue running the test longer, most likely you will continue exposing the test with a change that has no effect - essentially wasting time.
Wasting Money With Losing Variations - An Even Bigger Problem
There is an even bigger problem than the above which happens to the best of us. Building on the example above, let's say that you peek at the test mid way through and you see a frightening -23% drop to sales. Now you are faced with the problem of exposing your sales funnel to losses and the question arises of how bad does it have to get for you to pull out - a very valid question indeed. If you continue running the test, the chances for the test to turn around, diminish, and your loses widen.
Setting Stop Rules
To address the above problem, we are exploring the idea of using stop rules for variations when they match particular criteria (we might actually perform and share simulations of these in a follow up post). The problem of introducing false positives from peeking to stop variations early is closely tied to how the stopping rule is defined. The most typical stop rule that is often criticized is when someone looks to stop the test as soon as a p-value of 0.05 is reached. We agree that this is probably a bad stopping rule as it has been observed to generate a false positive rate of as high as 30% or so. With that, here are our latest thoughts about more conservative stopping rules which combine a minimum conversion threshold, an effect magnitude, the p-value, and the p-value's persistence into the test.
Stop Rule For Losing Variations (Proposed)
Disable any tested variation for revenue sensitive tests, as soon as the following are met:
- Has at least 100+ conversions
- Has a negative effect of -10% or lower
- Has a p-value of 0.03 or lower, and is maintained for 50 more conversions
Stop Rule For Insignificant Variations (Proposed)
Disable any tested variation as soon as the following are met:
- New tests are waiting in the pipeline and ready to be started
- At one third of the estimated test duration, the p-value is 0.75 or higher
- At one half of the estimated test duration, the p-value is 0.65 or higher
What About You?
Please share your comments below on what you do to avoid losing variations or wasting time with insignificant changes (aside of increasing your chances of generating winning tests - a different topic).
Comments
Georgi 8 years ago ↑1↓1
Hi Jakub,
It's nice to see that someone taking into account both the peeking problem AND the need for futility stopping rules, that is to be able to cut unpromising tests early. It is exactly these kinds of problems that I addressed in my A/B testing white paper where I borrowed statistical methods from the medical experiments field and packaged them for application in the A/B testing field. If you want to take a look, I've published a free white paper, available here: https://www.analytics-toolkit.com/whitepapers.php?paper=efficient-ab-testing-in-cro-agile-statistical-method
Best,
Georgi
Reply
Will 9 years ago ↑0↓0
My ideas on stopping rules are basically formulated around the premise that statistics is a tool to aid reasoning not a substitute for it. As such I tend to think that generic stopping rules are a poor substitute for reasoning about the results.
Most of my general advice can be found in this post I did for the KISSmetrics blog: https://blog.kissmetrics.com/your-ab-tests-are-illusory/ The basic point is that it’s okay to stop early, or at a low confidence if you have another idea lined up and ready to test. In Computer Science talk this is the trade off between exploration and exploitation. The former being trying new ideas and the latter getting the most out of what you know works. Favoring exploring new ideas tends to perform a bit better in simulation.
If you don’t have another idea in the queue I highly, highly recommend waiting until you are quite certain that you have a superior variant before switching. In this post https://www.countbayesie.com/blog/2015/7/6/how-big-is-the-difference-between-being-90-and-99-certain I cover the often missed detail that being 99% certain is 100x more certain than 90%. My testing setup is typically more Bayesian where I am determining a true posterior probability (i.e. what is the probability that A is better than B) rather than a p-value (what is the probability this difference would be observed by mere chance), but in practice these results are very similar.
Reply
Jakub Linowski 9 years ago ↑1↓0
Will. Thanks for the comment.
My point is what / if any rules do people apply to avoid waiting till the very end of an estimated experiment if it shows strong weakness or insignificance. Let's say this is what week 6 of 8 looked like: https://www.thumbtack.com/labs/abba/#Baseline=100%2C1000&Variation+1=103%2C1000&abba%3AintervalConfidenceLevel=0.95&abba%3AuseMultipleTestCorrection=true Would you wait 2 more week to most likely end with an insignificant result or would you rather stop it and put something better into testing? If so ... what triggers / rules would you use to remove such weak variations?
Reply