What To Do When You Can't AB Test
What To Do When You Can’t AB Test
Will the new search software improve sales conversion? What’s the incremental impact of our new store pickup process on omni-channel sales? Can you find out today?
I’m a data scientist at Best Buy Canada and these are some of the important questions we work to answer to support product development and corporate strategy, in parallel with all the cool machine learning products we build (more on that in a future post).
Why is a data scientist answering questions a digital analyst can as well? Of course, AB testing can be used to answer some of these questions but it is not always possible.
There may be technology or product barriers to implementing full AB test randomization in the time you need to deliver answers. Or maybe there’s a need to test changes in the offline world, like we do at Best Buy with a huge physical retail presence, where customer level randomization isn’t possible.
So at my job, I’ve been using a variety of counterfactual methods to assess these inferential questions, and I’d love to share them with the community.
Beyond AB Testing: Counterfactuals
Counterfactuals are simply ways of comparing what happens given a change, versus what should have happened had some change not occurred in the first place. There’s a variety of methods to conduct this statistically. Some of the powerful ones we’ve been using include: geo/market-based approaches such as generalized synthetic controls, and time series-based approaches such as Google’s Causal Impact.
Market Based Approaches
Market-based approaches try to measure impacts between geographic areas where an effect is introduced, for example a new marketing channel, and control areas where it has not. Simply comparing the two areas (treatment regions vs control regions) isn’t useful because they’re not a direct comparison.
What we need to do is compare our treatment regions with a counterfactual of those treatment regions had no intervention occurred. In most market-based approaches, we can actually use our control regions as inputs into a model to predict the behavior of our treatment regions before an intervention has occurred with a high accuracy. Measuring the difference from the prediction after an intervention is introduced gives us a measure of what is the true effect of any change to our business.
One of main methods we use is the Generalized Synthetic Control method introduced in 2015 by a political scientist. Yi used it to measure the effect of American electoral day registration laws on voter turnout in their paper [1]. This is an evolution of other synthetic control frameworks: it’s more robust to differences between regions and generalizes for multiple treated units.
This was used to great effect recently to robustly test the effect of a strategic change to Best Buy’s fulfillment channels. The key problem we were trying to solve was measuring the revenue impact across the entire business, online and in-store. We could not accurately randomize users for an AB test with changes happening to the store experience.
For this project, we divided the country into over a hundred large geographic regions with sales mapped to each given the delivery postal code or the store location. Regions larger than single cities were especially key in large metropolitan areas where customers could move between cities in a single day. It’s quite typical for customers to make purchases downtown and pickup in the suburbs.
Of the roughly hundred regions, we picked twenty to switch over to the new fulfillment experience. These treatment regions were selected based on power analysis simulations of different combinations of treatment and control groups, and together were largely representative of the nationwide business including large & mid-sized cities, suburban and rural areas. We often find that the most predictive features are simply sales of other regions that are similar in size and relative proximity such as mid-sized cities in the same province/state.
Once the experiment began to roll out, the great open source R library for GSC’s made reporting on results to our stakeholders with impact estimates, confidence intervals, and predicted values incredibly easy. Example outputs can be seen below (results were done on fake data developed to test model):
Time-Series Based Approaches
We also use a time-series based approach when the impact being measured can’t be split geographically, and again, when a fully randomized experiment isn’t available. Instead of comparing treatment and control regions, we can compare treatment and control time series.
The main method we use here is Google’s Causal Impact, and the counterfactual principle is still the core of how we measure effects. Taking several control time series (that are not affected by the intervention), we can build a predictive model of the treatment time series and measure the differences in their outcomes.
We used this method recently to test a change in a shipping fees that could only be applied to a specific group of products sold. There were technical limitations in conducting a fully randomized AB test that we typically would use, and so the key problem was to measure any sales impact to the specific time series that received this new shipping experience.
Given different combinations of products, we built the control group so its time series would be relatively insensitive to changes in the treatment group. We also simulated potential effects with different combinations of treatment and control groups to find the final groupings.
Similar to the aforementioned GSC method, Causal Impact has another great open-source library that we made use of to return all types of statistical results that we communicated to our stakeholders. A simulated example of the types of outputs we used can be seen below, again based on fake data for modeling:
Why Inference Is A Key Part Of The Data Science Toolkit
Much of the focus in the popular press, online discussions and in blogs is on the machine learning and predictive side of data science. Subjectively the “sexy” stuff.
What is often overlooked though is the inferential side of data science. It’s important to keep in mind the wealth of problems statistics can tackle, especially when predictive models are difficult or impossible to setup. Looking through in the statistics literature, as well as areas such as economics, psychology and biology, there are a ton of powerful methods that a data scientist can use to support strategic decision making.
References:
[1] Y. Xu., Generalized Synthetic Control Method: Causal Inference with Interactive Fixed Effects Models (2017), Political Analysis.
[2] K. H. Brodersen, F. Gallusser, J. Koehler, N. Remy, and S. L. Scott, Inferring causal impact using Bayesian structural time-series models (2015), The Annals of Applied Statistics.
OhNoCrypto
via https://www.ohnocrypto.com
, @KhareemSudlow