**Overview**

WhatCounts estimates lift using split (i.e., A/B^{1}) testing. In marketing, split testing is often used to determine whether a difference in effectiveness exists between two alternative approaches. For example, an email marketer may wonder whether emails that include a large image have higher click through rates than those with a smaller image.

Another classic example of a split test is a medical trial in which study participants are chosen to receive either a treatment dose of a proposed medication or a placebo. The difference in outcomes between the treatment and placebo groups at the end of the study is attributed to the effectiveness of the potential drug.

The ultimate goal of split group testing is to compare groups that are similar on all attributes *except for the one under study,* using inferential statistics. The sections below describe how WhatCounts achieves this goal in our lift estimation framework.

**Observed Response and Counterfactuals**

In a perfect experimental design, we would would be able to determine the effectiveness of a treatment by observing every study participant twice - both under treatment and without treatment. Of course, this is impossible, once someone is treated they cannot be untreated again. Since participants can’t be both treated and untreated at the same time, split tests divide participants into two groups that are as similar as possible in every way. With this method we obtain a treatment group that provides us with data for those under treatment, and a control group that provides us with counterfactual evidence. In other words, although we can’t observe what would have happened to the treatment group had they not been treated, we can identify a group of people similar to those we plan to treat, who are, in aggregate, expected to behave just like the treated group would have without treatment. Because the members of the two groups are the same except for the treatment protocol, we assume that any changes we observe in the treated group are due to the treatment effect.

In the WhatCount's framework, automators define the treatment protocol. The treated group is those customers being sent an automator’s emails. Once we treat a customer, hoping that our email will move her to make an order that she would not have made otherwise or would have made with a different provider, we cannot tell how this customer would have behaved had we not treated her. Instead, we approximate it with a similar control case that is eligible to receive treatment, but does not.

**Randomization**

As is clear from the discussion above, the splitting of customers into treatment and control groups is a key component of split test designs. Split testing is built on the assumption that treatment and control groups are indistinguishable from each other on any attributes relevant to the effectiveness of the treatment. This attribute is referred to as *ignorability of exogenous ^{2} factors.* If this attribute does not exist, bias occurs and results are not accurate.

The secret to creating treatment and control groups that are not biased on any exogenous attribute is random assignment. With appropriate randomization, under large enough sample sizes, treatment and control cases will be the same, in aggregate.

The proportion of cases randomly assigned to treatment vs. control does not impact the overall similarity of the groups and can therefore be based on other needs. Generally, a 50/50 split provides the most statistical power, but when the treatment is expected to be particularly helpful or harmful, other proportions can be used. By default, we use a 90/10 treated/control split to minimize the loss of revenue resulting from the control group not receiving WhatCounts emails.

##### Overlapping Automator Eligibility

Customers may be eligible for multiple automators at the same time. For example, a customer may be eligible for treatment from a Best Customer automator and a Cart Recovery automator concurrently. Because customers are assigned to treatment or control groups at the automator level and as they become eligible, a customer may be a member of the control group for one automator and the treatment group for another. This preserves the ignorability of exogenous factors for individual automator analyses, but complicates the estimation of an overall effect. This is discussed more below.

**Analysis**

The second phase of lift estimation is the analysis phase. Having randomly split customers into treatment and control cases and observed those groups for a given period of time, we apply inferential statistics to the task of determining whether we have a treatment effect. Our null hypothesis is that there is no difference in the revenue per customer (RPC) for members of the treatment and control groups at the end of the observation period. Our corresponding alternative hypothesis is that the two groups are not equal at the end of the period^{3}:

There are multiple ways to test the null hypothesis outlined above. For example, one might choose a regression model with a binary treatment indicator or an aggregate comparison of means. Because customers can be members of the treatment group for one automator and the control group for another, the latter approach works better for computing an overall treatment effect. Since the methods are otherwise interchangeable, we also use it for the individual automator estimates.

##### Computing the Effect

A handful of acronyms and terms will be useful for the following methodological discussion:

- RPC = Average revenue per customer
- TR = Total revenue during the observation period
- N = Number of customers
- ATET = Average treatment effect on the treated
- Revenue Base = Total expected revenue, absent the treatment
- Revenue Lift = Additional revenue attributable to the treatment
- Percent Lift = The percent of revenue attributable to the treatment

The mathematical definition of average revenue per customer (RPC) is straightforward; it is the total revenue (TR) for a given group divided by the total number of customers (N) in that group, i.e.,

RPC is calculated for both the treatment and control groups, once per automator and once for the overall estimate. For automator-level estimates, the control and treated Ns are simply the number of individuals assigned to each group. For overall estimates, customers can make fractional contributions to control and treatment groups. I.e, a customer who is in the treatment group for one automator and the control group for another contributes .5 to the total overall treatment N and .5 to the total control N.

The two primary metrics of concern for our analysis are *Revenue Lift* and *Percent Lift.* These describe the amount of additional revenue created by WhatCounts in terms of both raw dollars and as a percent of the expected base revenue, absent any WhatCounts interventions. Revenue lift is the average treatment effect on the treated (ATET) in dollars, i.e.,

multiplied by the number of treated customers. The revenue base is estimated by multiplying the control RPC by the total number of treatment eligible customers during the observation period. Percent lift, then, is the ratio of additional dollars (revenue lift) to the amount expected without any WhatCounts interventions (revenue base) multiplied by 100:

We express *lift* as a rounded percentage for a quick and ready idea of the relative difference that an automator makes, but the revenue lift is often the more important figure. An automator with a small lift that affects a large revenue base is more interesting than one that shows a large percent lift but moves few actual dollars.

##### Statistical Significance

The calculations above provide a point estimator for the additional revenue attributable to WhatCount's efforts. We don’t yet know whether this is likely to be a true effect or whether it may be attributable to random variations in purchasing behavior. To be confident that we’re showing a statistically robust effect, we need to obtain standard errors and 95% confidence intervals for the treatment effect.

Again, estimating a simple regression with a binary predictor in our favorite statistical package would provide us with appropriate standard errors and would work fine for the by-automator approach. However, the overlap in the treatment and control groups in the overall lift calculation prevents that approach. Therefore we manually calculate the standard error for all estimators as^{4}:

Where , the Mean Squared Error, is defined as:

Using this approach we obtain standard errors that can be multiplied by the appropriate critical value, +/- 1.96 in this case, and added to the treatment effect to obtain upper and lower confidence interval bounds. The effect is statistically significant if both the upper and lower confidence interval bounds have the same sign.

**Some Final Details**

##### Intended to treat vs. actually treated

Not everybody intended to be treated is actually treated: some emails don’t reach their recipient. For this reason, the effective split between the treated and the controls is not always 90/10, but more often something less than 90 and something more than 10. We consider as treated only those recipients whose emails show up as sent successfully.

##### Winsorization^{5}

Some customers, whether in the treated or the control group, are outliers on the high end: they spend a lot more than others. It’s possible that in small treated and control groups outliers will move the average revenue per customer to such a degree that that they alone will be responsible for the observed effect.

A way around the problem is *winsorization.* For these analyses data are winsorized to the 99th percentile. After we collect the actual order data, customers whose order totals are above the 99th percentile will have their actual figures replaced with those at the 99th percentile. For example, if that is $700, then all order totals in the input data that are higher than this are overwritten with $700. Extremes at the low end are left unchanged.

=====

**Notes**

^{1} https://en.wikipedia.org/wiki/A/B_testing^{2} https://en.wikipedia.org/wiki/Exogeny^{3} There is little reason to expect that a WhatCounts treatment will have a negative effect on RPC, but such an effect could highlight a delivery or formatting error, for example. For that reason, and a general preference for conservatism in claiming significant results, we use a two-sided hypothesis test.^{4} https://en.wikipedia.org/wiki/Simple_linear_regression#Normality_assumption^{5} Nope, we didn’t create this. All props go to Charles Winsor: https://en.wikipedia.org/wiki/Winsorizing.

## Comments