In Q4 of 2016 Windsor Circle released a Predicted Customer Value (PCV) module based on the "Buy Til You Die" approach. At the time, the model was vetted on a subset of historical data. Now that it has been in the field for roughly 6 months, we have an opportunity to use current data to evaluate its real-world effectiveness. To do so, we use predictions generated using data through January 1, 2017 and evaluate how well those predictions held up over the next 6 months, through July 1, 2017.
Based on the analysis described below, the PCV model is able to predict future spend and churn at very high rates (83% and 86% accuracy, respectively).
A simple performance metric for two continuous variables (e.g., predicted spend/total spend, predicted number of orders/actual number of orders) is a Pearson correlation coefficient. Correlation coefficients range from -1 to 1, with either extreme indicating a perfect relationship. That is, a correlation coefficient of -1 indicates a perfect negative relationship — as one variable increases the other always decreases. A perfect positive relationship, indicated by a correlation coefficient of 1, means the two variables always move in the same direction, increasing or decreasing together. When evaluating how well our models predict the intended outcomes, we hope for a strong positive relationship between our predicted and actual values. We consider three pairs of predicted and actual metrics: predicted and actual total spend, predicted and actual AOV, and predicted and actual number of orders.
Across roughly 150 clients, we see an average correlation of .66, .76, and .65 for these metrics, respectively. These values suggest a strong positive relationship between our predictions and the customer behavior observed in subsequent months.
These correlations are a good quick check of our model performance, but we can’t stop there. With a few minor adjustments to our data we can use three standard performance metrics - accuracy, precision, and recall - to evaluate how well our model predicts certain categories of our outcome variables.
Binary Accuracy, Precision, and Recall
When dealing with binary variables (variables that can take only two values, e.g, on/off, purchased/didn’t purchase, churned/not churned) classification metrics allow for a more intuitive interpretation of model performance1. The most straightforward model validation metric is simple accuracy - out of all of our classifications, what percent of time were we correct? However, in some cases it’s possible for a model to have very high accuracy but do a really bad job of discriminating between possible outcome categories. As such, it’s good practice to consider two other metrics, precision and recall.
Below is a quick refresher on these metrics.
But first... Confusion Matrices
All three metrics are based on just four numbers that are very easy to derive. We’ll start by constructing a confusion matrix — a 2 x 2 table that crosses the number of actual and predicted cases in each group. Convention is that the matrix columns represent the predicted classifications and the rows represent actual (i.e., ground truth) classifications:
Table1. Confusion Matrix
Cases that are correctly predicted to be members of the group of interest — group "A" in Table 1 — are referred to as "true positives" and are counted in the top left cell. Cases that are predicted to be in group A, but are actually in group B are counted in the false positive count in the bottom left cell. False negatives and true negatives are placed in the top and bottom right cells, respectively. Table 1 illustrates this using the common abbreviations for these groups: tp = true positive, tn = true negative, fp = false positive, and fn = false negative.
Calculating accuracy is a natural next step once the confusion matrix is created. Accuracy is simply the proportion of correct classifications of all classifications made; or stated differently, the sum of the correct predictions - the top-left (tp) and bottom-right (tn) cells of the table - divided by the sum of all cells in the table:
Although straightforward and easy to understand, accuracy can be misleading — particularly in cases where the outcome is either very rare or very common. In these situations, an algorithm can achieve a very high accuracy by classifying everyone into the largest group without really being able to discriminate between the groups at all.
Precision and Recall
Precision and recall both focus on the number of correct "positive" classifications in an algorithm’s predictions. Remember that "positive" doesn’t necessarily mean "good," it’s just data-science-speak for the event of interest, e.g., purchase, churn, etc.
Precision is the proportion of correct classifications out of all cases that were predicted to be in the category of interest:
In the context of churn prediction, precision is the proportion of cases that had actually churned, out of those we predicted were churned.
Recall is the proportion of cases correctly predicted to be in the category of interest out of all cases that were actually in the category of interest:
In the context of churn prediction, recall is the proportion of cases correctly predicted to be churned, out of all those who had actually churned.
For simplicity we convert precision and recall to percentages. Average accuracy, precision, and recall across all clients are 86%, 99%, and 83%.
Recall from our initial correlation analyses that some of our variables of interest have more than two possible values. But we can apply the metrics above to our continuous variables by first making them binary. To do that, we create two new indicator features that identify cases in the top quintiles of predicted and actual values. Total spend is often a variable of interest, so we will focus on it. For each client, if a case has a predicted value in the top quintile of predicted spend it is assigned a value of "1" in our new binary, prediction-based feature, all other cases are assigned a value of "0." If a case has an actual spend in the top quintile of spend values it is assigned a "1" in our new binary, actual spend-based feature, all other cases are assigned a value of "0." This re-mapping allows us to apply the classification metrics discussed above to our total spend variable. In doing so, we observe an average classification accuracy of 83% and precision and recall of 60% and 59%, respectively. On average, 83% of the time we predict a case will be in or out of the top spend quintile over the next 6 months, we are correct.
- When predicting customers’ churn rates for the following six months, average accuracy, precision, and recall across all clients are 86%, 99%, and 83%, respectively. These metrics indicate a very well-performing churn prediction model.
- When predicting a customer’s total spend will be in the top spend quintile during the following six months, average accuracy, precision, and recall across all clients are 83%, 60%, and 59%, respectively. On average, 83% of the time we predict a case will be in or out of the top spend quintile over the next 6 months, we are correct.