Is It Correct To Make Multiple Design Changes In A Single Test Variation?
As you look and analyze a given UI screen or flow, design ideas of how you might improve it rush into your consciousness. Here you are faced with two general approaches: you either a/b test each change individually, or you group some of them together into a single variation. Which is right?
Your Opinion [Poll Closed]
My Thoughts & Experience [Updated Sep 27, 2018]
First of all, thank you for voting everyone! I'm surprised we received that many votes. Thanks again!
As for the results, it clearly looks like the more people are in favor of isolating a single change within an experiment. This I think is often motivated by a desire to understand if a given change has an impact, or not (as seen in the comments below). Alternatively, grouping multiple changes together in a single variation is done with the hope of striving for a higher overall impact. This is done on the assumption that most of the changes are positive and stack up. Of course the reality is also that multiple changes run the risk of cancelling each other out (with some being negative and some positive).
How Do Bigger Vs Smaller Tests Compare In Terms Of Impact?
To answer the above question, we should look at evidence how larger tests (with multiple changes) compare to smaller tests (with isolated changes). Luckily we have some data on this.
For a number of years we've been running larger tests for our clients and writing about it under the GoodUI Datastories project. Some of these stories include retests from failed attempts so the median impact is slightly inflated. Nevertheless when we look across 26 such projects, we have a median impact of 23%.
Now in comparison, we have also been collecting more isolated test results as patterns. Currently, we have published 159 a/b tests that are smaller with more isolated changes with a median impact of 6.6%.
This comparison is our first signal validating the potential of grouping changes into single variations.
Should Make As Many Changes As We Can Think Of?
Quite recently were were also privileged to run an interesting experiment on a landing page for a premium service of an online driving school. I just want to focus on the high level test setup which was designed in the following way:
- A) CONTROL
- B) ONLY HIGH PROBABILITY CHANGES (BASED ON PAST TESTS)
- C) AS MANY CHANGES AS THE TEAM COULD THINK OF
As you can imagine, the results were the following:
This suggestive test (it was stopped early for external business reasons) is a subtle demonstration that "more changes" are not necessarily better as seen in the C variation. Instead, the B variation combined only a handful of changes based on already tested patterns (all with net positive probability). Essentially were using positive past test results to decide which changes to group together. Such and approach taken in the B variant outperformed both C and A.
Summarizing This As Principles
Taking all of the above into consideration, we have come up with the following two guiding principles to be used on our projects:
THEN: GROUP HIGH PROBABILITY IDEAS INTO 1 VARIATION
THEN: ISOLATE THE CHANGE INTO 1 VARIATION
To make things more interesting, please also share your thoughts as to why you voted in a particular way as a comment. Perhaps there are cases when both answers are true? Let's talk about it.