Agile A/B Testing: Using Stop Rules To Minimize Losses & Time Wasted
Time and money are precious, among other things. So when we seek gains while minimizing losses through risky a/b tests, we are also faced with an interesting problem of making the most optimal decisions, in the most optimal time frame. The reality is that tests will have variations which will hurt the business by exposing it to loses and waste its time with insignificant changes. This challenges the classic approach to experimentation of estimating the sample size, setting and forgetting. I'm beginning to believe that a more effective approach to optimization is one where stop rules are combined with good enough data. By not committing to run tests to their fullest estimated duration, being agile, winning tests should have greater chances to emerge.
Wasting Time With Insignificant Tests - Problem One
Here is the first problem. Let's say that you estimate your a/b test to detect a +20% increase off your magical 10% purchase rate, to realize you need over 7,200 visitors (3,600 per variation x 2) to detect it. Great. You design the A/B test, set it and the wizards will warn you not to peek or touch anything until the 7,200 sample size has been fully exposed. The hardcore statisticians do have a valid point as repeated significance testing (checking on the test) introduces your chances of detecting a false positive rate. But that aside for now, remind yourself that for some online businesses the wait time associated with a 7,200 sample size might be hours and for others, possibly months. And let's say that you do glance at the test mid way through and see a highly insignificant +3% increase with a humongous ±22% margin of error. This is as grey as it gets and here the problem begins to shine through - your chance of turning the test around and getting B to jump from a +3% increase to over +20% are way smaller than a measly false positive rate from checking. When you made the estimate to detect the +20% increase, that was just a guess and now you have more concrete data. If you continue running the test longer, most likely you will continue exposing the test with a change that has no effect - essentially wasting time.
Wasting Money With Losing Variations - An Even Bigger Problem
There is an even bigger problem than the above which happens to the best of us. Building on the example above, let's say that you peek at the test mid way through and you see a frightening -23% drop to sales. Now you are faced with the problem of exposing your sales funnel to losses and the question arises of how bad does it have to get for you to pull out - a very valid question indeed. If you continue running the test, the chances for the test to turn around, diminish, and your loses widen.
Setting Stop Rules
To address the above problem, we are exploring the idea of using stop rules for variations when they match particular criteria (we might actually perform and share simulations of these in a follow up post). The problem of introducing false positives from peeking to stop variations early is closely tied to how the stopping rule is defined. The most typical stop rule that is often criticized is when someone looks to stop the test as soon as a p-value of 0.05 is reached. We agree that this is probably a bad stopping rule as it has been observed to generate a false positive rate of as high as 30% or so. With that, here are our latest thoughts about more conservative stopping rules which combine a minimum conversion threshold, an effect magnitude, the p-value, and the p-value's persistence into the test.
Stop Rule For Losing Variations (Proposed)
Disable any tested variation for revenue sensitive tests, as soon as the following are met:
- Has at least 100+ conversions
- Has a negative effect of -10% or lower
- Has a p-value of 0.03 or lower, and is maintained for 50 more conversions
Stop Rule For Insignificant Variations (Proposed)
Disable any tested variation as soon as the following are met:
- New tests are waiting in the pipeline and ready to be started
- At one third of the estimated test duration, the p-value is 0.75 or higher
- At one half of the estimated test duration, the p-value is 0.65 or higher
What About You?
Please share your comments below on what you do to avoid losing variations or wasting time with insignificant changes (aside of increasing your chances of generating winning tests - a different topic).