10 Answered Questions About A/B Testing

This week, Rodrigo Maués asked me a number of interesting questions about running experiments. Instead of answering directly to him (under our highly productive mentorship), I thought it would be even better to share my answers right here with everyone - this way more people benefit. Here we go.

Q1: When should you do an A/B and when shouldn’t you?

Answer: If testing is seen as a way to generate certainty, then the more certainty you already have, the lower the need for testing. Certainty can come from many different sources such as: past data (ex: your baseline), your own tests, other people's tests, customer complaints, qualitative research, books, theories and your experience. Where you draw the line between testing and implementation is a highly subjective one (ex: do you need one solid a/b test to implement or two?). Although it is subjective, it's still good to write down your criteria for implementation that you accept as good enough.

One example of a high certainty scenario might be around a possible bug that was accidentally introduced. Let's say you know your typical conversion rate on a given page and somehow the form stopped submitting. Having an understanding of what the conversion rate should be (past data), you could simply change/fix the form to submit again without the need for testing - common sense.

Q2: Why are A/B tests important?

Answer: Tests allows us to measure and talk about the real effects of changes. With testing we move beyond guesses, beliefs and opinions as we carve away at uncertainty. When we share test results and methods we further enable comparisons and reproducibility that can unlock even greater degrees of certainty.

Q3: Is there a limit of metrics to track during a test?

Answer: I don't think there should ever be a limit to the amount of metrics in a test. You should be able to measure whatever you like and make as many comparisons - otherwise science would move in the direction of censorship. What is important however is to separate two ways that you could approach experiments: exploratory and strict.

When you run an exploratory experiment (usually with a larger number of metrics or variations) you might be looking for relationships that you did not even imagine before. This open and informal way of collecting data is acceptable in my opinion as long as it's used as a starting point for further experimentation. Usually these types of experiments don't validate hypotheses, but actually generate hypotheses and questions as their outcome! Of course, the more metrics we are tracking, the higher the chance that we could cherry pick data. And that's why we need to follow up on such observations with stricter experiments.

Most experiments that people tend to run however are skewed toward the formal or strict sense in that they test hypotheses. In this case, we make a guess and are strictly seeing if our prediction will hold true. Even in this type of an experiment, we might measure a number of metrics (ex: progressions through a funnel with multiple measures). Or we might wish to increase sales of a given product without hurting the sales of 99 other products.

Q4: How did you come up with the logic to calculate the effect strength?

(Rodrigo is referring to our classifications of test results: strong, possible and insignificant.)

Answer:Our test effect classification was made up based on rules of thumb and experience. The problem started when we began to analyze the tests of other people where we didn't always know how the test was run (when it was stopped and if it had enough statistical power). So we needed a quick way of gauging how believable the test result was. We knew that we could not rely on significance and p-values alone as there are highly significant yet under-powered tests like this one. So we combined significance with a threshold of successes. Essentially, we're saying that if a variation has 300+ successes and a p-value of less than 0.03, then we see such a test result as a pretty strong one.

Q5: If a variation did not win or lose with significance (according to Optimizely for instance) are the results still worth anything?

Answer: Test results are never delivered as black and white but instead come in ranges of effect, gradients, probabilities and likelihoods. You as a human determine where you draw the line. If you are too strict with yourself, you might pass on potential winners while lowering your improvement rate (throwing out valuable results). Weaker or suggestive results may also begin to matter more when supported by data from other experiments. That is, if your test result is only suggestive and you also know of other similar and significant experiments, then I think you have more certainty to account for.

Q6: What are the worst kind of errors when running an A/B?

Answer: Anytime you have a technical setup issue, you could invalidate the experiment (or cut out the data that was unreliable). A perfect example of this might be a case where instead of seeing the variation, a group of people in the test is only exposed to the control (where they should have seen the variation).

Q7: What percentage of the traffic should be used to an A/B test?

Answer: Our default is to always test with 100% of the traffic for the greatest speed. Some people lower the test exposure because of fear - which does nothing good, as you still have to collect the same amount of data, further extending your testing duration.

Related to this question, it's also very important to keep the ratios between the control and variations equal. When experiments are setup with uneven ratios (ex: 70% of traffic to control and 30% of traffic to a variation) it may introduce time based fluctuations that skew the results (ex: A is more effective at the end of an experiment than at the beginning).

Q8: Which precautions should we take about external influences in a test and how to mitigate them?

Answer: Measure everything. Don't avoid tests just because of an event. If you measure during, before, and after an event, at least you'll be able to answer whether that event mattered or not (by slicing your data by time). We've once seen a discount campaign that skewed our results. This external email campaign provided an overall surge in sales, while minimizing our own effect being tested in a variation (the external discount was so good that it rendered our own changes temporarily obsolete bringing the control and variation closer together). One quick way of checking for this is to look at the day charts and see if there are any weird spikes.

Q9: Can we change a test while it is running?

Answer: Sure thing. If you have 5 variations and one is losing heavily, if you are sure enough, you could remove just that losing one. Of course when you pause or add a new variation into a test, you have to keep in mind that comparing can only be done within the same time frames. Furthermore, if you have long lag times between exposure to variation and the action being measured (ex: two week purchase lag) then comparison gets messy quite quickly.

Q10: Do you prefer to determine the samples and the time of the test prior or to let the test run until you have enough data?

Answer: I prefer to be more agile with testing. The problem with hardcore frequentist set-and-forget approaches is that they force you to guess the effect upfront (on which the sample estimate is based on). Most of our guesses will be wrong and only slightly improve as we remember past test results. More so, if the sample size estimations are taken to heart, they may force us to run a test for a full estimates duration (ex: 6 weeks?). Assuming that we see a very strong negative result in week 1, we might lock the business into an unnecessary loss. Instead, I believe we should have solid stop rules to protect the business. If we can act faster (on losing and insignificant tests), then that's one way of increasing our testing velocity needed for more positive results. So yes, I do believe on acting on good enough data instead. One thing to keep in mind when peeking at test result is to be more conservative with additional testing time (and avoiding using p-values as stop rules).