The A/B testing culture — a guide about how to design successful online controlled experiments (2/3)

Claudia Chitu
6 min readSep 15, 2021

1. To whom is this article addressed and WHY

I have been writing this article based on and after reading the book: “Trustworthy online controlled experiments — A practical guide to A/B testing”, by Kohavi, Tang and Xu, probably the most helpful business AND technical book. The motivation is to share the beautiful story of how science plays a vital role in business today more than ever and encourage product owners to get closer to engineers so that they have fun while building successful growth for their organizations. At the same time, sharing some personal points of view from my book summary, about the impact of experimentation in a digital product, sharing it in an actionable manner so you could choose the missing parts from your current strategy (for a deep dive into any of the sections I present in this article, please read the book).

This article talks about How to avoid design experimentation mistakes, The experimentation platform and culture (& Organizational metrics) and Institutional memory.

2. How to avoid design experimentation mistakes

Experience tells us that many of the surprisingly positive results are more likely to be the result of an error in the instrumentation or human error. For instance, being enthusiastic about an increase in number of purchases due to an increase in price in one of the new markets, might be attributed to a human error in the price setting, such as the price being incorrectly n(!) times smaller than the rest (especially if is only the comma placed in a shifted position). To increase trust in the experiment’s results, I recommend creating a checklist and build best practices to verify and diagnose if something may be wrong with the results and keep a log of the learnings and mistakes found in the former experiments. On a more granular level, run analyses by different dimensions (country, platform, time of the week, user type — new or returning) to spot insights that could lead to discoveries.

These techniques will keep you away from the human errors, but there is another angle that is worth to be mentioned here: stay away from what is called the survivorship bias — one of the most frequent mistakes when designing experiments. This comes from World War II, when discussing the armor addition to bombers, debating whether to add armor on the planes where the most damaged was produced, and Abraham Wald pointed that those were the worst places, since those planes with most damage never made it back to be inspected.

Another important point to consider when performing A/B tests is the novelty effect. This one is introduced by the release of new features and initially attracts many users, but unfortunately it is mostly not sustained. One way to catch such situations with possible novelty effects is to consider the user cohorts from the first days and plot the treatment effect for them.

3. The Experimentation platform and Culture

This is one of my favorite parts as it projects the phases that helps inculcating the process of experimentation into the organizational culture. The experimentation maturity models are presented in Fig.1. and describe as follows:

  1. CRAWL: in this phase, the efforts are focused on building the basic data science capabilities, compute the summary statistics for hypothesis testing.
  2. WALK: what is different here is that the goal is moving to define standard metrics, running A/A tests to validate the instrumentation.
  3. RUN: this is when the experiments are performed on a large scale and tradeoffs between multiple metrics are codified.
  4. FLY: in this last phase of the maturity model, the company is running A/B test for every change and the operational business units are analyzing most of experiments without the help of data scientists. This is the phase when institutional memory is established.
Fig.1 — The experimentation maturity models. Each company lives these phases on the way of building the experimentation platform and consolidating on the culture and values

As a high-level image, in the Crawl phase, an organization is running an experiment about once per month, whilst, in each of the next phases, the frequency increases by 4–5x, reaching a rate of thousands per year in the last maturity phase.

This has a massive impact on the team itself, and you will notice that the team setup will shift whilst it approaches a more mature phase, including the leadership and the processes. To reach the last model phase, there is a multidimensional strategy to buy-in from executives. Here I will list two of these dimensions:

- Setting goals in terms of % improvements in KPIs, instead of bullet points of shipping feature X and Y. The shift in thinking should happen more only if it improves the metrics (how much of an improvement is another discussion, but the short answer is that it depends on the scale of the business).

- Establishing a culture of failing fast.

How to share the knowledge and to buy-in from executives? The learnings from experiments are crucial to develop the institutional memory in the Fly phase. It can be done through geek lunches, via regular blog posts, wiki, Twitter feed, etc.

There is a question that always pops up, sooner or later, whether if you join a conference, or a meetup: can an external platform provide the functionality you need? Many third-party solutions are not versatile enough to cover different types of experiments you need to run (front-end, back-end, server vs client etc.), with all the metrics and access the data. Read it again! Need to think if you can integrate additional data sources and if there are any tools to reconcile when summary statistics diverge.

As we discuss about the experimentation platform, one example in a nutshell would be to implement an experiment on a website. This implies:

1. A randomization algorithm — a function that maps end users to a variant (pseudo-randomization with caching or a good hash function — MD5, SHA256)

2. An assignment method — determines which experience each user will see on the web (client-side, server-side assignment etc.)

3. A Data path — captures raw observations as the users interact with the website, aggregates it, determines statistical significance using a statistical test and prepares reports of the experiment’s outcome

Then, the processed data is ready and the goal is to summarize it into key metrics to help guide decision makers to a launch/no-launch decision. In the Run and Fly phase, there can be thousands of metrics and you might want to group them by tier (companywide, product specific, feature specific). One way of clustering the metrics is illustrated in Fig.2.

Fig. 2 — Metrics taxonomy

Remember that metrics are proxies, and they have their own set of failures; sometimes might be easier to measure what you don’t want, such as user dissatisfaction.

4. Institutional memory and meta-analysis

The authors of the book present the idea of having a digital journal of all changes through experimentation, including descriptions, creatives and key results. Additionally, it is crucial to capture meta information on each experiment such as who the owners/contributors are and how much impact the experiment had on various metrics. Why is this important? See table Tab.1. This again, can be stored in a wiki space, Confluence, dashboard, repository of documentation, in an easy to access manner with a summary and learnings highlighted.

Tab. 1 — Why to keep a well structured log of the experiments

This article comes to continue the first one from this series (which can be found here) and shed some light about how to avoid experimentation design mistakes, how to build the experimentation platform and culture, touching a little bit on the metrics taxonomy and concluded with the institutional memory section.

The last part will follow soon :)

Bonus: Check this resource about ideas for A/B tests.

--

--

Claudia Chitu

Hi! This is Claudia, data strategist and data science evangelist! I love to work on changing organizational cultures to take data driven decisions