GoodUI BETTERDATA

Get results you can trust. Choose the right things to measure, improve the accuracy of your A/B testing tool, and know how much data is enough.

Estimate sample upfront for more confidence

Use a sample size calculator to figure out how many visitors you need based on conversion rate, desired outcome, and risk tolerance. A planned sample size goal will give you a measure of confidence in your data and an objective criterion for stopping the experiment.

Example: Your conversion rate is 20% and you want to see if your variations managed to lift that by at least 10%. The calculator will show that you can expect to detect a statistically significant lift with 6,000 visitors per variation (including control) 80% of the time. This is called "power analysis". If you ran the experiment with 1,000 visitors and saw a statistically significant 60% lift, you would know you're still far from your original plan of 6,000. You might be sceptical of this and decide to run your experiment longer, by which time the lift might drop to 20%.

Be realistic. Analysis can show that an experiment has no chance of success. In that case, adjust your tactics and avoid running a futile experiment.

To detect smaller effects or have a higher chance of success, you'll need more visitors.

If time is constrained, you'll need to increase risk of failure or aim for less definitive results.

Track page visits for more accuracy

Make page visits your primary goal. In our tests, we've seen visit goals up to 50% more accurate than clicks or form submits. Ensure there is only one way to get to the unique goal page URL once a visitor is part of the experiment.

Add a URL parameter that uniquely ties your goal page URL to the page you're testing. For example, setting visits to /goalpage?from=home as your primary goal and directing experiment participants there ensures that visitors who bypass your experiment and land on /goalpage don't count as conversions.

Clicks as a secondary metric can be useful. For example, higher clicks could mean your form validation design is stopping users from getting to the goal page (e.g., captchas and inflexible format requirements)

If a user is likely to immediately close the goal page or to click a link to another page, it is possible the page visit won't get tracked. Just in case, ensure your goal page can keep a visitor for 1-2 sec.

Set Immediate Goal as Primary for more accuracy

The primary goal should be on the page you are testing or the immediate next page. If you have 5 steps in your funnel, and you're testing a change in Step 1, you should track visits to Step 2 as your primary goal. This gives you the greatest chance of detecting a significant change.

If you track visitors from Step 1 to conversions at Step 5, for example, you'll run your experiment longer to get reliable data. One reason is that you'd be measuring the smallest conversion rate, since visitors drop off at each step. There is more noise from the intervening steps that can distort what you're trying to measure.

When improving Step 1, track other steps in the funnel as secondary metrics and be sceptical

Target each step with a separate experiment. So to improve Step 3, start your test on Step 3 and track visits to Step 4. This raises your conversion rate and gives reliable data sooner.

Keep reading

Include Redundant Goals for more accuracy

Track the same primary metric in more than one way. For instance, always track clicks on Step 1 and Visits to Step 2. Even page visit goals are not fool-proof, and in testing you may find situations where visits do not get tracked as often as they should.

Plain clicks are not equivalent to visits (because someone can click a button without filling in a form for example), though they are still useful if there is no other alternative. If you are testing a form, the better option is to set up a smarter Custom click goal that fires after validation on Step 1. This metric would be equivalent to visits to Step 2.

Once or twice, we've found a smart click goal outperformed the visit goal on one variation but not other by a few conversions. In these situations, we take the highest numbers from either goal, since it's the more accurate (assuming the goals are truly equivalent).

Use Naming Conventions for easy identification

Instead of naming a variation "B" or "Variation 2", try "B: Larger button", so you can identify it at a glance. Instead of naming a goal "Goal 1", try "1: Clicks primary", "1: Clicks secondary", "2: Arrives on payment", where 1/2 designate page number in the funnel and clicks/arrives designates the type of goal.

Bind to mousedown for more accuracy

A mousedown event fires slightly faster than the click event. If you are binding Custom goals to your event, use the mousedown event. A few milliseconds can ensure the event fires in time before the browser redirects and can prevent dropped events in Chrome and Safari, especially with events bound to links.

Test your experiment to catch errors early

Test each variation to make sure it's behaving properly. Submit each form and make sure the data is in the database and your 3rd-party analytics look right. Then launch the experiment internally (so only your team sees it) and check that each type of goal is being tracked.

Some issues are transient, so test at least twice, clearing cookies and cache in between.

If you catch errors and have to stop the experiment, it is best to duplicate the experiment and start a clean version.

You can QA an experiment safely in production by restricting it to your IP or a URL parameter like ?include=true.

Target New Visitors for more confidence

If you have run many experiments on your site, you may not want returning visitors to join your current experiment. Visitors who notice a change will behave differently than new visitors. We have seen conversion rates differ by up to 400% depending on whether returning visitors are included or not.

If you keep targeting new visitors in experiments, you will notice your traffic and conversion rate change as you run out of new visitors (in VWO, a new visitor has not been part of any experiment). At that point, you may reopen experiments to all traffic.

Keep reading

Exclude Browsers for more accuracy

Look through your visitors stats and target the most common browsers. Test your design thoroughly on each browser to avoid contaminating your experiment with browser effects. Unless your visitors are heavy users of IE and you test thoroughly, you should target IE users in a separate experiment run in parallel (unless you A/B tool has segmenting features).

Use CSS3 and JavaScript features that your target browsers support.

Separate Mobile and Desktop for more confidence

Make sure each visitor to your experiment has an equal chance to see the same page. A mobile visitor will essentially see a different page than a desktop visitor. So, you should always target an experiment to only mobile or only desktop. Your tool may exclude mobile by default.

If you Control page is responsive, make sure your variations are responsive as well. All your visitors should have the same experience, except for the effect you are testing.

Separate Competing Goals to reveal relationships

If you are tracking user choices that compete with each other, use a deeper goal to ensure the metrics are mutually exclusive. If you were to track clicks or intermediate page visits, you could be double-counting, since nothing stops a user from going back and triggering a different goal too.

For shallower metrics, try tracking how often users go back and change their mind as a secondary metric. You can also add goals that track just the user's first choice and separately their final choice. Identifying why people change their minds can itself be a valuable insight and can help you interpret the effect of double-counting in your shallower metrics.

All A/B testing tools we know of won't count multiple activation of the same goal, so there is no risk of double-counting there.

Track time on site for greater insight

It can be useful to know not merely that an event happened but how long it took for a user. You can add goals to your experiment to track things like duration of the page visit or how long it takes a user to start or complete a form. To do this, frame the time as a binary goal, such as "User has been on the page for 2 minutes". You can then set a timer and fire the conversion once the target time has been reached:

setTimeout(function() { trigger_2min_goal(); }, 120000);

Have a good reason for tracking the time. You hypothesis might be: "If people stay on the page longer, they are more likely to read the content and make a purchase" or "If people are not able to complete the form in 1 min, they will quit".

Try Value Instead of Revenue for greater insight

You can track value using a revenue goal. For example, a $100 plan might actually be more valuable to your business than a $200 plan, because the $100/month plan leads to greater Customer Lifetime Value. You can also assign a value to a free product other than zero. If 5% of your free users upgrade to a $100 plan, than a free plan is really worth $5. This way you can see if a statistically significant trade-off between free and paid plans will benefit your business.

Track Both Revenue & Choices for greater insight

For a purchase, use a revenue goal on the purchase confirmation page, where a dollar value has been assigned. This brings together competing goals into one handy metric that tells you if the change benefits your business or not. It can also reveal situations where the total sales volume is stays unaffected but revenue changes, because the user's choice has shifted to a higher or lower value choice.

However, revenue is not always appropriate and does not always tell the whole story (e.g., if multiple products are purchased and you need to track which, or if the dollar value can be zero). Add a URL parameter to the goal page to track not just the dollar value but track changes in users' choices (e.g., ?plan=free).

If you are tracking a URL based on multiple parameters, make sure the order of parameters is fixed. For instance, if you want to track Plan A purchases of Product B with parameter ?plan=A&product=B, make sure the parameters always go in that order or the URL won't match.

Eliminate flicker in A/B tests for more accuracy

When you set up an A/B experiment, your A/B tool will inject changes into the existing page, turning it into one of the variations (this doesn't apply to back-end tools). Most tools will prevent displaying a page until it is ready, but this is not always fool proof. The page can momentarily flicker or briefly show original page elements. If users notice this and behave differently, you experiment will be invalid. Test for the flickering effect and ascertain if any remaining effect might skew your experiment results (clear cache in between and try different browsers).

Using your tool's built-in editor, while less flexible, should reduce flicker better than injecting custom CSS or JavaScript. In VWO, a workaround is to use the built-in editor to hide and then show an element and only then inject custom CSS and JavaScript to modify it. This tells VWO to hold rending this element until it is ready.

Another way to minimize flicker is to optimize your code and reduce amount of injected code. If possible, insert Variation Content directly into your production site, tag it with an id or class like "variation1", and hide it with CSS. Then tag the Control content with class = "control". Now, instead of injecting lots of code, your A/B experiment just shows the "variation1" class and hides the "control" class.

Ignore inconclusive results

An experiment of statistical significance (p-value) will tell you if your result is real or likely to be the result of chance. If the difference is not statistically significant, you can't say a variation won, lost, or is the same. A p-value of 0.5 does not mean it's a 50/50 chance of winning. A p-value close to 1 does not mean the variations are the same. Any high p-value simply means there is not enough evidence to make any determination.

An inconclusive negative result is not a loser. The only time you should say that a variation lost is when you detect a negative effect that is statistically significant.

A p-value can be low enough to be "suggestive". It is evidence but very weak. You should definitely note suggestive results and run an experiment confirm them. For example, if you're aiming for 95% confidence, a p-value of 0.1 is suggestive. Remember that even a p-value of 0.01 is not proof, just strong evidence.

With a sample that shows Control and Variation performing about the same, it is tempting to conclude they are the same. However, even with a large sample, there is a small probability that you won't detect a true effect just by chance. A sample size calculator allows you to mitigate this risk, called beta, by running your experiment longer. To say that variations truly perform about the same, you should have a large sample with confidence intervals that are narrow and almost completely overlapping.

Test fewer variations to avoid false positives

Cut non-essential variations and retest results that match what is expected by chance. A/B tests are ideal. We recommend that you avoid multivariate tests.

For each comparison you make, there is a risk that the winner is a false positive (called alpha or significance level). If you make multiple comparisons, whether by adding more variations or more experiments, be aware that the overall likelihood of finding a false positive is inflated.

Example: You use a sample size calculator to estimate the number of visitors you need for a 5% chance of a false positive. You run 20 variations over 2 tests and find a winner. You should retest this result, because at least one winner was to be expected just by chance (5% chance multiplied by 20 is a 100% chance). If you condensed 10 variations into 5 in each experiment, you would cut the probability of a false positive in half.

If you include numerous variations in the spirit of experimentation, that is valid. Just keep in mind the increased probability of finding a winner or loser just by chance. We have found that experiments with more than 4 variations (including the baseline) are more likely to run long, end up underpowered, and produce results that are hard to interpret.

A false positive can be any statistically significant effect. A losing variation may also be a false positive.

Track Shallow Goals to get data faster

If your site has low traffic or a low conversion rate, a sample size calculator will tell you you don't have a good chance of measuring the ideal metrics, like revenue. However, you can increase your conversion rate by measuring more shallow metrics instead, like clicks, scrolls, or searches, which are nonetheless solid indicators of desired behaviour.

Example: If people search products, they can find them and purchase them. Therefore, by increasing searches, we are likely to help sales. To raise your 1% sales conversion rate by 10%, you'd need 150K+ visitors. However, to raise your 50% search conversion rate by 10%, you'd only need 1-2K visitors.

Have a hypothesis for more confidence

If the outcome of your experiment confirms a hypothesis that you had stated upfront, it makes the outcome more trustworthy. In contrast, if you test many variations hoping to hit on something by chance or if you discover something by accident, there is greater risk of false positive.

If you did not state a hypothesis upfront or got some unexpected results, come up with a post-hoc hypothesis that explains the results. If your sample size is still inadequate, but the overall pattern among variations makes sense, it makes the result more trustworthy.

Example: Say you're running an ABCD experiment. A and B are visually similar, minor changes. C is a bigger redesign, which tests a hypothesis about what motivates your visitors. You therefore expect C to perform better. If A and B indeed perform similarly and C outperforms both, this result is trustworthy. On the other hand, if the results are unexpected, and B did best, then your scepticism would lead you to seek an alternative hypothesis to explain this.

A good hypothesis is a theory about the motivations, goals, and behavior of visitors. It does not describe what you plan to do but explains the "why" you are doing it. For example, this is not a hypothesis: "We can increase conversions by adding a security badge". A good formula is: IF [we remove dollar signs before our prices], THEN [people will spend longer on the page and be more likely to purchase], BECAUSE [it may be that dollar signs trigger negative associations for people].

Agree on drop rules for more confidence

Decide ahead of time what constitutes strong enough evidence against a variation that you can drop it. For example: "drop a variation if the p-value is at most 0.2, and we have at least 50% of our planned sample size". An early stopping rule like this allows you to make decisions rationally and consistently.

Stopping a losing variation mitigates the risk of further losses, but it comes at the risk of dropping a winner, wasting effort, missing insights, and even reaching the wrong conclusion. If you drop a variation when the evidence against it is still weak, you are saying "I’m not willing to find out for sure, because of perceived risk". As a result, you will always have lingering uncertainty about whether it really did better or worse.

We do not recommend you drop variations and or stop experiments early, unless the evidence is reasonably strong, and the cost of ongoing losses is high.

Prepend !important for more accuracy

If you are injecting code in an A/B experiment to create your variations, add !important to all your CSS. This ensures that existing styles don't override the styles you are injecting if they happen to load last.

Know URL parameter order for more accuracy

If you are tracking page visits based on a combination of URL parameters, make sure you get the order of the parameters correct. Sometimes, the order of the parameters is not fixed. For instance, if you track visits to http://example.com/?plan=free&success=1, this won't match the URL http://example.com/?success=1&plan=free.

Track HTTPS and HTTP for more accuracy

If your site can be visited through both http and https, you need to track each one explicitly. In VWO, use the wildcard http* to ensure both http and https visitors are included.

Check Significance Yourself for more confidence

Most tool vendors show optimistic results. Winners are declared too early, and confidence intervals (margin of error) are shown at 80% level. Seeing stronger results and more winners keeps you motivated about testing. It also reduces your risk of missing true effects. The problem is you'll see many false positives and inflated effect sizes. To get a truer measure of confidence, use a tool like Abba to see 95% confidence intervals.

Even if 80% Confidence is sufficient to support your decisions, we recommend checking 95-99% Confidence Intervals to see the full extent of the margin of error. For example, an 80% confidence interval might show an effect in the 5% to 15% range (a winning variation), but a 95% confidence interval would show this effect is really in the -15% to 35% range, some possibility of being a losing variation.

Segment with caution for more confidence

Segmenting is very likely to produce false or exaggerated effects and to understate true effects. Each additional analysis you do on the data increases the chances of finding an effect by chance (inflates your alpha). At the same time, segmenting reduces sample size, which too increases false positive risk, tends to suggest exaggerated effects, and at the same time makes it harder to distinguish true effects (reduces power). These factors distort comparisons, especially of unequal segments.

Have a hypothesis before you segment to reduce these risks. Avoid the practice of segment-until-you-find.

Consider any significant differences you find between segments only "suggestive", especially if you did not have a hypothesis. Run a separate experiment to confirm any such effect.

If you find a sub-segment in your sample with a different conversion rate, you should ignore that if the degree of the effect is similar and points in the same direction, especially if the sample sizes are greatly unequal. The important thing is that all your segments are choosing the same variation.

Test For Whole Weeks for greater confidence

Run tests for at least 1 full week (or retest on different days at different times), even if your traffic is high. You should not run tests for hours nor aim to run several tests per day.

Sample size is important but so is duration. Anything can affect user behavior - day of the week, time of day, holidays, weather, a surge of traffic from an unanticipated source. If you want your data to have greater predictive power, results need to hold over time.

If your testing service charges by the visitor, throttle your traffic to 50% or less to make sure you don't exceed your limit. If that is not an issue, you can run a concurrent test on the remaining traffic by adding mutual exclusion criteria to the two tests, to ensure that a visitor may enter one but not both tests.

If you need to run your test longer, make it 2 weeks, 3 weeks, and so on. Try to start and end your tests at the same time on the same day. If you start at 5pm on Tuesday, end at 5pm on Tuesday. Think in whole weeks.

Show same variation when visitors return

Each visitor must ever see only one branch of your test. Tools like VWO take care of showing a random variation to each visitor and showing them the same variation each time they return. If you implement an A/B test manually using JavaScript or some other tool, make sure you show a random page on first visit, store that variation ID in a cookie, and show the same variation when the visitor returns.

Run split tests even on URLs you can't change

Sometimes you need to run a split test, but you you can't create a new page URL for your variations. For example, you might have a dynamically generated or a CMS-based site, so a page might be dynamically generated and must resolve to a specific URL like www.mysite.com/checkout.

To run a split test, use the same URL for all variations but add a URL parameter to each variation. For example, your variation B will point to www.mysite.com/checkout?v=b. On your back end, check for the URL parameter and then display the right version of the page. If the changes are minor, consider setting up a dynamic A/B test instead.

[Update: With VWO's recent update, asterisks are no longer needed]In VWO, you'll need to add asterisks. you'll want to make your Control URL *www.mysite.com/checkout and your variation URL www.mysite.com/checkout?v=b* (see this VWO article for details)

Develop in Greesemonkey to speed up dev

If necessary, you can build complex, dynamic A/B tests in VWO (insert code dynamically into the page instead of doing back-end coding). When doing so, use Greesemonkey or another user script tool, then move code into the in-browser editor provided by VWO. VWO's editor has advantages but can be finicky and lacks version control.

Develop the Javascript and CSS in separate files using your current favourite coding tool. Then install Greesemonkey and set it to load your JavaScript and CSS dynamically when you visit the test page:

$("head").append("<script src='http://yoursite.com/dev/seating.js'></script>");
$("head").append($("<link>").attr({"rel" : "stylesheet" , "href" : "http://yoursite.com/dev/seating.css"}));

If you're inserting a lot of HTML using JavaScript, use a tool like this HTML to JS converter to create a JavaScript-friendly string you can insert into the DOM.

Once your script and CSS are ready, you can usually copy it straight into VWO's editor.

If you find flickering on page load, then use VWO's "Edit HTML" feature instead of JavaScript for the affected elements, because this tells VWO to hold off showing those elements until they are ready. The disadvantage is the HTML is then locked into VWO's environment instead of residing in one .js file.

Use Heat Maps to corroborate results

There is other evidence out there besides your primary metric for positive changes in user behavior. Sometimes a change in behavior can show up on a heat map, which tells you where users click or don't click. For one thing, a heat map can tell you what visitors were interacting with that had an effect. It can also tell you how focused or distracted visitors were.

In one test, featured in Data Story #9, we used the heat map to corroborate our statistically strong finding. We tested a new home page with a simple gradual engagement element: a clear question with several buttons to choose from. We also tested a very minimal and a more complicated variation.

Our hypothesis was that gradual engagement would guide visitors better toward signing up. The heatmaps for the other variations showed clicks on top menu link and all over the page. In contrast, our winner showed clicks on just the component we were testing with fewer menu clicks and virtually no distracted clicking elsewhere. This was reassuring. We also saw a pattern in the choices visitors were clicking.