22 Jan 2020 by Jakub Cwynar
Planning and Running Effective A/B Tests
A few best practices that should greatly improve the quality of your A/B testing and bring fast results to your e-commerce business.
In our previous article on A/B testing, we learned why running experiments is vital to every modern business. Everything boils down to reducing guesswork and making decisions based on hard data. In a data-driven era, A/B testing is the primary tool for optimizing user experience, increasing conversion rates and the business value of decisions. We previously talked about what can be tested but there's still plenty to bear in mind when it comes to the actual design and execution of your experiments. Here are a few best practices that should greatly improve the quality of your A/B testing and bring fast results to your e-commerce business.
A/B Test Design 101
When designing an experiment, we need to ask ourselves these questions:
- What will be tested?
- When will we run the experiment?
- How long should it last?
- How can we measure the impact of changes?
- How do we interpret the results?
Determining what to test
The possible UI changes to an application are virtually limitless, so deciding what to test can be a daunting task. To make it easy, we can split the types of changes you can A/B test into four basic groups:
- UI changes - modifications of the visual side of an application
- Navigation - for example, changing the flow of your checkout process
- Content - changing the copy and images
- Algorithms - new algorithms like recommendation systems or personalization
We can use our intuition and understanding of our store to decide which elements to test, or we can analyze metrics using tools like Hotjar and Google Analytics to identify bottlenecks and UX issues. Once we see where the problem lies, we need to come up with our own ideas of what changes we might suggest for testing - which is where domain expertise and creativity come to the fore. Tools will tell you what needs tweaking, but you need to come up with your own proposed solution.
The timing and duration of experiments
Timing is crucial. We want to run tests at a time that best reflects our average user experience, so testing at the same time as we launch a new product or big promo event is not recommended as the site will experience an atypical spike in traffic. This does not mean we have to throw away our results, just that we need to collect more data to validate what we learned from the experiment.
There is no single answer as to how long tests should run for, but the timespan strongly depends on two factors:
- The usual traffic in your store
- Our confidence in the results of the test
If the results of our experiment are consistent across variants (i.e. results from variant A are markedly different than those from variant B), we don't need too much data to assume that the results are valid. However, if the results for one or more variant are suspicious or have large deviations from the mean, we might need to double-check if they are accurate with another test.
Best practices suggest running experiments for more than a week¹, as the results extracted from that data have a lower chance of being influenced by day-to-day fluctuations and we see a wider picture of aggregated customer behavior. However, if you've got a huge international store with tens of thousands of daily visitors from across the world, you may get the results you need in a matter of hours and won't need to leave the test running for a week.
Smaller sites with more limited data interpretation resources should also avoid running multiple experiments simultaneously as they can interfere with each other. One element may affect another and it becomes difficult to pinpoint the exact cause of customer behavior and can lead to incorrect conclusions. You can use the more complex multivariate testing method if you are confident enough about running experiments. But even if you are running a single test, you might still test multiple variants of your website and gain clear insight.
Direct and indirect impacts on business
Measuring results is a crucial step to gaining value from our experimentation. Our A/B tests expose different users to alternate versions of our storefront but we need to effectively define and interpret the results of this to elevate future performance.
The most generic and broadest approach is to look at revenue - after all, it's the metric that directly corresponds to business value. Even though this is the most obvious choice, we can compare our variants using a range of different behaviors and actions, including:
- Average session duration
- Number of checkouts
- Product details views
- Any Call-To-Action response on our site
Whilst these may not always seem to directly translate into business value, bearing them in mind can be an indicator of long-term business value. For example, longer average session duration and an increased number of checkouts may be a sign of better user experience, indicating that customers want to spend time exploring the store. But beware, using these metrics in isolation could be misleading: higher session duration and lower revenue may mean that the UX is actually worse and that users have trouble navigating a storefront. More time is spent on the page but with less conversion. This is why it makes sense to always back up the first assumptions with new confirmatory tests.
Once your results roll in, how do you decide which variant of your storefront is the best? Analyzing raw numbers might not be enough, as slight fluctuations in user behavior might lead to incorrect interpretations.
There are two approaches that allow us to verify that our results are not skewed by random user behavior: the frequentist method (Null Hypothesis Significance Testing) and the Bayesian approach. In Saleor Cloud, we use the latter, as it is much easier to interpret than NHST. Using Bayesian inference, we can ask direct questions like: What is the probability of variant X being better than baseline? or What is the probability that variant X is best overall? In NHST, all we get is the p-value, which is, roughly speaking, the probability of our results being a random guess. This probability is difficult to interpret in terms of A/B testing and decisions made with NHST might not always be the best.
We should also accept that not every A/B test will give us clear and actionable results. This should also be expected, as not every change is significant and sometimes our original ideas were already good. We can therefore still draw positive conclusions from 'failed' tests; they may not allow us to significantly increase the metrics that we care about, but they show which elements are working or at least prove that our proposed changes were not worth the expenditure in resources and effort.
Our top A/B testing tips
You don't need to remember everything. Just take away these top 5 tips which we believe are the most important elements of running A/B tests:
- Use A/B tests to verify new ideas for your store
- Choose an appropriate result validation method (Bayesian or NHST)
- Run tests for a minimum of a week to achieve trustworthy data
- Avoid testing of overlapping changes
- Don't be discouraged by negative results
A/B tests are an essential part of any modern e-commerce and Saleor Cloud is one of the only solutions on the market that lets you unlock their potential straight out of the box. We hope that this short guide will help you plan better experiments and get the most from your store.
If you'd like to learn more about the Bayesian approach to validating results, check out John Kruschke's book Doing Bayesian Data Analysis: A Tutorial with R, JAGS and Stan. For those with a Python background, we recommend PyMC3 library, as well as the Edward package.
If you want to find out more about Data Science capabilities or start a conversation about bringing data-driven insight to your business with Saleor Cloud, contact us at firstname.lastname@example.org.
 Multiple sources suggest periods of more than one week long. The recommended length of an experiment for Google Optimize is about 2 weeks and Invesp suggests a period of 1–2 weeks. To further approximate test duration, sample size calculation can be helpful, but the exact method depends on the statistical validation approach (frequentist or bayesian).