When moving a machine learning model into production, the most critical thing is understanding the actual value the model will bring to your business. Because models are custom to specific business applications, each model needs to be evaluated in the context of its use.
This article will work through a reference example and show how to think about model performance and capture business value. Most importantly, I will demonstrate why adopting AI into your unique business process requires critical thinking - machine learning models are tools, and like any other application work best when used correctly.
For this example, let’s use the famous direct-mail banking dataset. This dataset is the classic lead scoring challenge. It contains over 41 thousand customer records, each the target of a marketing campaign. The dataset records if any given prospect subscribed to the service as a result of the campaign. It also contains 20 different demographic and financial data points on each customer - everything from age and education to job and housing information.
The objective is to build a machine learning model that can predict if a customer will subscribe to a service given the same demographic and financial information - allowing the bank to focus its marketing efforts and increase the efficiency of acquisition. This is a classification problem - we are looking for the model to bucket potential customers into subscribers and non-subscribers. To get a little more specific, this is a rare-case classification problem because only 11% of the customers (4.6 thousand of the total 41K) subscribed to the service. Rare case detection is an excellent use of machine learning and has lots of potential value to capture, as we will see!
You can train this model on pretty much any AutoML platform. When you train a model, most platforms will take the data and split it randomly 80/20. The model is trained on 80% of the data, and then the performance is evaluated on the un-used 20%. With this dataset, you will get results that look something like this: 91% accuracy and a 0.58 F1 score.
The first metric accuracy is pretty straightforward - when the model predicts if a user will subscribe, it’s correct 91% of the time. That looks like a pretty good score, but it has the potential to be misleading. Because only 11% of the customers subscribe - our model could ONLY ever predict “not-subscribed,” and it would have an accuracy of 89% (while also being completely worthless).
F1 score is a combined measurement of precision (the percentage time the model was correct when predicting “subscribed” - 376/572 or 66%) and recall (the percentage of subscribers the model correctly captures - 376/718 or 52%). F1 score ranges from a low of 0 to a high of 1, but like accuracy, it requires a more in-depth look to understand the connection to the model’s actual business value.
So how do you evaluate this model’s business value? For that, we need to think about how we can use the model in practice. First, we can see that among the set of customers the model predicts will subscribe, any given customer is 466% more likely to convert. Even though the model is not capturing all of the eventual subscribers, the density of subscribers is much higher than the base rate of 11%. That alone presents an almost 5x efficiency improvement. On the other hand, if we ignored the customers that the model predicts would not subscribe, we would lose half our potential customers.
Fortunately, we can take advantage of another feature of our model to substantially increase its value to the business. Binary classification models pattern matching each new potential subscriber, and the model can output the percentage of confidence it has in its prediction. Let’s look at a given prediction run against the model. For this set of inputs, the model has a 99.85% probability that the customer will not subscribe.
What’s happening with the binary classification is when the model thinks the customer is over 50% likely to subscribe, it predicts “subscribed.” When the projection is less than 50% likely, it predicts “not-subscribed.” If we could identify the predictions where the model has a high probability they will not subscribe, we could drop them from the campaign.
To do that, let’s look at the actual distribution of probability the model returns. We could keep track of the results on new data for a month or two to build this out, but a quick short-cut is to just run the entire original dataset through the model and graph the subscription probabilities. Doing so results in the following distribution:
Now we are getting somewhere. We can see that the model thinks that almost three-quarters of the customers have a less than 5% likelihood of subscribing (the area highlighted in red). Checking the dataset, we can see that only 105 out of 27 thousand customers with a less than 5% probability of subscribing actually ended up subscribing. Remembering that the total dataset had 4.6 thousand subscribers, we can easily make the business decision to drop 27 thousand prospects from the marketing campaign (cutting our cost by almost two-thirds) while still capturing 98% of eventual subscribers. Now that is a substantial improvement in real business performance!
To sum it all up - it’s not enough to just look at model performance metrics. You need to go a bit deeper and think critically about applying the results to your actual operations to capture the full benefits of predictive modeling. Once you do, you will see how machine learning can unlock massive gains in real-world business performance.