The A/B testing culture — a guide about how to design successful online controlled experiments (3/3)

Claudia Chitu
5 min readOct 12, 2021

--

This is the last part of my summary of the book: “Trustworthy online controlled experiments — A practical guide to A/B testing”, by Kohavi, Tang and Xu, probably the most helpful business AND technical book. I wrote this for you, data people, and for you, product owners and product managers, to help improving the processes of building better products that make happier users.

Online controlled experiments are performed on real people, so Ethics in Controlled Experiments and end-user considerations are very important. Why you should care? “Understanding the ethics of experiments is critical for everyone, from leadership to engineers to product managers to data scientists; all should be informed and mindful of the ethical considerations”. Asking questions such as how sensitive is the data? What is the re-identification risk of individuals from the data? — are only 2 of the questions we should ask ourselves, and I will pause and share a related fact here as food for thought: “Facebook and Cornell researchers did a study to observe the emotional contagion via social media to find whether randomly selected participants exposed to slightly more negative posts posted more negative content a week later and the other way: whether other randomly selected participants exposed to slightly more positive posts did a positive post a week later”.

Running experiments requires maturity over time and implies continuous learning and innovation. Therefore, we need to include the human judgement and the user experience research (diary studies), log-based analysis etc. The log-based analysis helps building intuition (for instance, finding answers to the following questions: How do the distributions shift over time? How does the engagement evolve over time?), generating ideas for A/B tests based on exploring the underlying data.

From a more practical point of view, this section presents some key points of instrumentation in the context of experimentation. What comes firstly to my mind is client-side versus server-side instrumentation. Through some of the drawbacks of the client-side instrumentation, the following will be probably the ones that impacts the most the user experience: significant CPU cycle utilization and network bandwidth. On the other hand, something that I often encountered while working with data, is that the client clock can be changed, manually or automatically.

To build a healthy culture of instrumentation, the authors of the book propose some tips:

  • Don’t ship anything without instrumentation
  • Invest in testing instrumentation during development
  • Monitor the raw logs for quality (ensure there are tools to detect outliers on key observations and metrics); creating this habit will help also in other scenarios, and maybe the data engineering team can take part on this too

It is a common practice to run experiments with a given traffic allocation that provides enough statistical power. A new feature gets the Treatment exposed to only a small percentage of users and if the metrics look reasonable and the system scales well, then, it is safe to expose more users to the Treatment. To mitigate the risky ramping, the roll out can be scaled by geographical areas for instance.

Just by measuring and monitoring the progress, one can find out the pace of ramping. Ramping too slowly wastes time and resources. Ramping too quickly may hurt users and it risks also making suboptimal decisions.

To plan and execute the risk mitigation (trading off between speed and risk) implies in creating “rings” of testing population and gradually expose the Treatment to successive rings, is a good example. Commonly used rings are:

  • Whitelisted individuals (team members)
  • Company employees
  • Beta users
  • Users from different markets

Think that Bing, LinkedIn and Google all process terabytes of experiment data daily and having delays in the experiment scorecard generation means delays in the decision making process. That’s why having a near real time path, processing the raw logs with trigger alerts and automatic experiment shut-off could be highly beneficial.

The ultimate goal is to visualize a summary and highlight key metrics, in an accessible way to people with various technical backgrounds, from marketers to data scientists and engineers to product managers. Allow individuals to subscribe to different metrics they care about and get an email to digest them, or hide some technical metrics, such as debugging metrics, from less technical audience. You can categorize the metrics by tier or function; for instance, LinkedIn categorizes metrics intro three tiers:

  1. Company wide
  2. Product specific
  3. Feature specific

Microsoft groups metrics into:

  1. Data quality
  2. Overall Evaluation Criterion
  3. Guardrail
  4. Local features

Most of the times, the Treatment effect measured in a short timeframe is sufficient in understanding the effects, in the conditions when it is stable and generalizes to the long-term Treatment effect. But there are cases when the long-term effect is different than the short-term one; for example, raising prices is a strategy to increase the short-term revenue but on a longer term, users might renew or purchase less and less and overall, the revenue will be impacted negatively. Similarly, showing more ads including more low-quality ads could lead on a long term to decreased revenue and even searches (e.g.: network effect takes a while for a feature to reach its full effect as it propagates through the users’ network).

The most popular approach to measure long-term effects is to keep an experiment running for a long time (maybe 6 months, one year), and measure the effect at the beginning of the experiment, first week or first two weeks and then measure the effect at the end of the experiment (in the last week); with these results, the analysis will include then the average effect over the entire Treatment period. An alternative solution to create a holdout group and do a comparison (most of the analytics tools offer this option, e.g. Clevertap).

With these lessons in mind, getting ready to start the journey in the experimentation world, the reader can go now through the last chapters of the book that present the statistics behind online controlled experiments: two sample t-test, p-value (A p-value is a measure of the probability that an observed difference could have occurred just by random chance) and confidence interval, normality assumption (under the Null hypothesis the distribution has a mean 0 and variance 1), Type I/II errors and power, variance estimation and improved sensitivity and the A/A test, leakage and interference between variants (there might be indirect connection such as in the case of Airbnb marketplace — if the conversion flow for Treatment users is getting higher, resulting in more booking, it would naturally lead to less inventory for Control users).

With Ethics, Instrumentation, Key metrics and Measuring long term effects, I end my summary of the book with a couple of personal examples, a book that I will always come back to when working with online controlled experiments.

--

--

Claudia Chitu

Hi! This is Claudia, data strategist and data science evangelist! I love to work on changing organizational cultures to take data driven decisions