The A/B testing culture — a guide about how to design successful online controlled experiments (1/3)

Claudia Chitu
6 min readAug 19, 2021

1. To whom is this article addressed and WHY

I have been writing this article based on and after reading the book: “Trustworthy online controlled experiments — A practical guide to A/B testing”, by Kohavi, Tang and Xu, probably the most helpful business AND technical book. The motivation is to share the beautiful story of how science plays a vital role in business today more than ever and encourage product owners to get closer to engineers so that they have fun while building successful growth for their organizations. At the same time, adding a grain of salt after witnessing the impact of experimentation in a digital product, sharing it in an actionable manner so you could choose the missing parts from your current strategy.

The shift in the product design has happened already inclining the balance from intuition and expertise to experimentation. Moreover, to achieve growth at an organizational level, decisions can be taken further using experimentations and relying on data mostly for developing not only the product, but the marketing and the engineering units.

However, keep in mind that even though data is objective, the interpretation of data is very subjective and can be misleading if the full context is not presented. Here is where experimentation comes into the picture, bridging intuition with objective data and improving the product over time. Additionally, even though experimentation — A/B tests, A/B/n tests, randomized controlled experiments, split tests -, is a common word in the business space, it is still not adopted by so many startups and scaleups, and often perceived as one of the biggest challenges when it comes to compete in the digital environment.

Highlighting two of the big wins of testing ideas in the online environment, the focus would fall on providing an iterative path to improve over time AND how to measure the business impact in a tangible way, attaching metrics to ideas.

Going one step further, online controlled experiments are the best way to establish causality with high probability and they provide a way to detect unexpected changes. Additionally, it is probably, the best method to bring in customer’s voices while combining product ideas with machine learning and statistics.

2. Prerequisites to start online experiments

Experimentation is a critical tool in the digital space, and online controlled experiments are used heavily by the big players such as Airbnb, Amazon, Facebook, Google, LinkedIn, Microsoft, Netflix, Booking.com, Uber, Twitter etc. So, what do they test? They run thousands of experiments per year, ranging from relevance algorithms (recommendation and ranking), changes in the UI, to latency/performance, involving multiple channels: mobile apps, web and e-mail.

Prerequisites for running online controlled experiments.

1. The organization is committed to make data driven decisions and has formalized an Overall Evaluation Criterion (OEC) — a quantitative measurement of the experiment’s objective — e.g.: profit is not a good OEC, but LTV (user lifetime value) is a strategically powerful OEC.

2. The organization is willing to invest in infrastructure and follow the methodology in ensuring the results are trustworthy.

3. The organization is open to accept that the ideas might not help the growth and hurt the metrics. As a powerful illustrative example for this point, is what Slack’s Director of Product and Lifecycle tweeted about having only 30% of monetization experiments with positive results: “get used to, at best, 70% of your work being thrown away”.

The results of running online controlled experiments seen through different lenses could be mapped to the following table (Table 1) that could help you understanding the hidden value of the A/B test results:

Table 1. The hidden value of A/B tests’ results

3. Running and analyzing experiments

An end-to-end example provided by the authors is to evaluate the impact of adding a coupon code field. The aim is not only to assess the impact on revenue but also to verify the concern that it might distract people from checking out.

  1. Start with formulating a hypothesis: “Adding a coupon code field to the checkout page, will degrade the revenue”.
  2. Define success metrics to measure the impact of this change. Opting for Revenue itself as a metric isn’t a good choice, as it is dependent on the size of the groups of users involved in the experiment. The experiment will contain two variants, A and B, called Control and Treatment. Then, a pseudo-randomization process is applied to units (e.g., Users) to map them to variants (independently and in a persistent manner). A great recommendation here is that the key metrics are normalized by the actual sample size: making revenue-per-user a good OEC.
  3. Select which users to include in the experiment: include only those that start the purchase process.
Fig. 1 Steps in writing a good Hypothesis

Now, go back to refine the hypothesis to: “Adding a coupon code field to the checkout page will degrade the revenue-per-user for users who start the purchase process”.

Before diving into the details of designing, running and analyzing an experiment, it is important to mention that the sensitivity is the ability to detect statistically significant differences and can be achieved by either allocating more traffic to both control and treatment group, or running the experiment longer to have more users in the experiment.

With experiments, we quantitatively verify whether the difference between a pair of Treatment and Control samples is unlikely, given the Null hypothesis that the means are the same. In case it is unlikely, we reject the Null hypothesis and claim that the difference is statistically significant. To perform this, the p-value is used, with a standard threshold less than 0.05, which means that in case there is no effect due to the treatment, we can infer that there is no effect 95 out of 100 times.

Equally important, from a business perspective is to understand how big of a difference in the treatment matters to us? In case of Google, a 0.2% change is significant, but if you are a startup, you might be looking for changes that improve the revenue by 10% or more.

In the design phase, some questions arise, through which the following is often through the first ones: How long to run the experiment?

- The longer the experiment runs, the more users enter the groups of the experiment, resulting in increased statistical power.

- Keep in mind that you might have a different population of users on weekends than weekdays or often, the behavior during the weekend is different (ensure that the experiment captures weekly cycles — similar aspect with holidays; seasonality).

- Is your experiment touching a novelty effect of any of the features? In other words, is there a risk that clicks on a button will decrease over time? On the other hand, features that require adoption take time to build an adopter base.

From results to decisions: Do you need to make tradeoffs between different metrics? If the revenue goes down and user engagement goes up, should the change be rolled out? Another example, do you think that the hit rate for a search request would be a satisfactory metric? Or maybe you would like to add a quality related metric for the served name (here — the correctness of the name)?

The way the process goes in practice is to abandon the idea or iterate over it in case the result is not statistically significant. In case you are confident about the magnitude of change, but that magnitude is not sufficient to outweigh other factors such as cost, maybe the change is not worth it. If the result is not statistically significant but it is likely practically significant, the recommendation would be to repeat the test with greater power to gain more precision in the result. Same if the result is statistically significant and likely practically significant. But if the result is statistically and practically significant, roll it out!!!

Next parts will follow soon :)

Fun fact about one of the authors of the book:

Special thanks to Liniker Seixas for a thorough review.

--

--

Claudia Chitu

Hi! This is Claudia, data strategist and data science evangelist! I love to work on changing organizational cultures to take data driven decisions