The Data Guide to ML

As machine learning (ML) becomes more prevalent, organizations are looking to adopt the technology to improve their business processes. However, many organizations are not sure where to start or what data to use to train their models.

This guide will provide an overview of how ML works, what types of data are needed, and four hands-on examples of how ML can be applied in different business domains.

ML uncovers historical patterns

Your subconscious is very good at pattern recognition. An IT salesperson might intuitively know that a customer who buys a server rack may also be interested in supporting software and services. An insurance salesperson might guess that a customer with a low income, an old car, and living in a high-crime area is more likely to file a claim, and thus be a higher risk.

Similarly, machine learning algorithms look for patterns in data to make predictions. However, unlike humans, machines can analyze large amounts of data much faster and more objectively.

To understand how machine learning works, let’s take a look at a simple example. Let’s say we want to build a machine learning algorithm to predict how many items a customer will purchase in an online store.

To do this, we first need to collect data on past customers and their purchases. This data might include information such as the customer’s age, gender, location, what items they purchased, and how many items they purchased.

Once we have this data, we can train a machine learning algorithm to look for patterns. For example, the algorithm might find that customers who are female and live in urban areas tend to purchase more fashion items than other customers.

Once the algorithm has learned these patterns, we can then use it to recommend products to new website visitors based only on high-level information about them.

How it works

Many different types of algorithms can be used for machine learning, but they all have one goal in common: to reduce error. Error is the difference between the predicted value and the actual value.

For example, if our machine learning algorithm predicts that a customer will purchase 10 items, but the actual number of items purchased is 12, the error is 2. The goal of the algorithm is to minimize this error.

To do this, the algorithm starts with a random state and then iteratively makes small changes to reduce error. One way to achieve this is known as "gradient descent." With gradient descent, the algorithm looks at the error of the current state and then tries to find a new state that will reduce error. This new state is found by taking a small step in the direction that reduces error the most. The algorithm then repeats this process until it reaches a state where error is minimized.

A simple, two-parameter algorithm can be visualized as a bowl. The algorithm's goal is to find the lowest point in the bowl, which represents the state with the minimum error.

While highly complex algorithms with hundreds, thousands, or even millions of parameters can't be visualized the same way, the goal is still to find the minimum of the error function. In neural networks, "backpropagation" is a common method for doing this, which involves working backwards from an output layer to an input layer to iteratively update the weights of the connections between neurons.

The need to find relevant data‍

The saying "correlation is not causation" is especially relevant when it comes to machine learning. Just because two things are correlated does not mean that one caused the other.

For example, there might be a correlation between the number of hours of sunlight and the number of ice cream sales. Does this mean that sunlight causes people to buy ice cream? Of course not. The two are correlated because both are related to the weather. When it’s hot outside, people are more likely to buy ice cream.

This is why it’s important to find data that is relevant to the task at hand. For instance, consider the task of churn prediction. There are many reasons that a customer might churn, from poor customer service to a lack of usage. These kinds of metrics are reflected in customer data sources, from Intercom to Salesforce. Similarly, if you wanted to score leads, you’d want to connect to a CRM where you can track the sales pipeline.

To give another example, consider fraud detection. A fraudster’s behavior will be different from a normal user, so data sources that track user behavior, such as web logs and clickstream data, can be used to train a model to detect fraud. Finance teams also may want to predict loan default rates, which can be done using data from past loans, or they may want to predict credit scores, which requires connecting to a credit report data source. Once you understand your business question, it’s usually not difficult to figure out which data sources will be most relevant.

It’s important to remember that not all data is created equal. The data that is most relevant to the task at hand will produce the best results.

Finding relevant data comes down to just 3 steps:

1. Understanding the business task. No matter the KPI you'd like to optimize, it's important to consider the drivers of that metric. This step requires some domain knowledge and understanding of the business. For example, if you’re trying to predict customer churn, you might want to look at data on customer service interactions or product usage.

2. Identifying what data is available. This includes both internal data sources (like enterprise databases and spreadsheets) and external data sources (like social media data).

3. Assessing the quality of the data. Once you’ve identified relevant data sources, it’s important to assess the quality of the data. This includes things like completeness, accuracy, and timeliness. Data that is complete, accurate, and timely will produce better results than data that is incomplete, inaccurate, or stale.

The next step is to connect the data to Akkio. Akkio integrates with popular business platforms like Salesforce and Snowflake, making it easy to find the data you need. You can also export your data to a CSV file or connect to your database directly.

Once the data is connected, Akkio will automatically identify the appropriate algorithms and run the machine learning models. All you need to do is select the column, or KPI, you’d like to predict, whether it’s churn, conversion, attrition, fraud, or any other metric.

Example 1: Churn prediction

Pandora, once the king of music streaming, has fallen out of favor in recent years. Though the reasons for this are many, one key factor has been their inability to properly predict and prevent customer churn. This is a difficult problem to solve, as it requires understanding a broad range of data points, but it is crucial for Pandora if they want to regain their throne.

Any subscription-based service faces the challenge of churn, and it is important to take proactive steps to prevent it. A common way to think of churn is in terms of customer lifetime value (LTV). LTV is the total amount of money a customer is expected to spend with a company over the course of their relationship. Obviously, the longer a customer sticks around, the higher their LTV.

Churn, then, is any behavior that causes a customer to reduce their LTV. This could be anything from canceling their subscription to simply using the service less often. Predicting churn is difficult because it requires understanding not just what customers have done in the past, but also why they did it. Traditional methods like surveys and customer interviews can only get you so far.

This is where Akkio comes in. Our machine learning platform makes it easy to get started with churn prediction, even if you don't have a lot of experience with data science. We can start by connecting a demonstration dataset from Kaggle, called "Telco Customer Churn." This file has 7,043 rows of customers and 21 columns, or 20 features excluding the target column.

With just a few clicks, Akkio will automatically build and evaluate many different machine learning models to find the one that performs the best. In this case, the best model is a neural network with a raw accuracy of just over 80%. This means that, on average, our model will be able to correctly predict whether a customer will churn or not more than 8 out of 10 times.

Of course, accuracy is not the only thing that matters. We also need to consider things like false positives and false negatives. A false positive is when the model predicts that a customer will churn when they actually won't. A false negative is when the model predicts that a customer won't churn when they actually will.

In this case, our model has a relatively low rate of both false negatives and false negatives. Akkio makes it easy to get started with machine learning, even if you don't have a lot of experience with data science. With just a few clicks, you can connect to a dataset, automatically build and evaluate many different machine learning models, and find the one that performs the best.

From here, we can deploy our churn prediction model into production, where it can start automatically making predictions and enabling businesses to take actions to prevent churn. For instance, if your churn model predicts that a customer is likely to cancel their subscription, you could reach out to them proactively and offer them a discount or other incentive to stay.

Example 2: Increasing order value

Order value is a fundamental driver of success for ecommerce businesses. Many of the largest and most successful companies in the world, such as Amazon, place a strong emphasis on order value because they know that it has a direct impact on their bottom line. In fact, McKinsey predicts that 35 percent of all purchases on Amazon are the result of product recommendations based on data-driven algorithms.

While order value is a crucial metric, it can be very hard to increase. Traditional methods of analysis, such as human intuition or trial-and-error, simply aren't up to the task of understanding the complex relationships between the hundreds or even thousands of variables that can impact order value. That's where AI comes in.

By using an AI platform like Akkio, businesses can automatically build predictive models that analyze a broad range of data to identify the factors that have the biggest impact on order value. With these models in place, businesses can then make recommendations and take actions that will increase order value, such as up-selling and cross-selling to customers who are most likely to be interested.

We can use a Kaggle dataset to demonstrate how this works in practice. The dataset, titled "Health Insurance Cross Sell Prediction," is from a health insurance provider that wants to know which customers would be interested in vehicle insurance. With a model to predict which customers are interested, they could then optimize their outreach strategy, and provide targeted offers that increase their revenue.

By selecting the "response" column, or whether the customer was interested in vehicle insurance, as our target, we can use Akkio to build a predictive model.

Looking at the results of our model, we can see that segment 1 highlights a customer group that's particularly interested in vehicle insurance. We can see that they’ve not been previously insured, and report having vehicle damage. This would be a good group of customers to target with a cross-sell, since they're likely to buy vehicle insurance.

In conclusion, order value is a key metric for ecommerce businesses, and AI can be used to increase order value by making recommendations and taking actions that maximize revenue.

Beyond insurance, up-sell and cross-sell are relevant for many other industries. For example, in the retail industry, up-sell and cross-sell can be used to recommend additional products to customers who are buying a specific item. For example, if a customer is buying a dress, they might also be interested in buying a matching handbag.

In the hospitality industry, up-sell and cross-sell can be used to recommend additional services to guests, such as a spa treatment or an upgrade to a suite.

In the airline industry, up-sell and cross-sell can be used to recommend additional products and services to passengers, such as seat upgrades or travel insurance, depending on the passenger's specific needs and preferences.

AI-powered up-sell and cross-sell can also be used in the subscription business model, such as recommending additional products or services to customers who are already subscribed to a service. For example, a customer who is subscribed to a fitness app might be interested in additional content, such as workout videos or nutrition plans.

Example 3: Detecting fraud

Fraudsters are taking advantage of a perfect storm: The work-from-home shift during the pandemic has increased online activity, while many businesses are still struggling to implement effective fraud detection measures. As a result, fraud is on the rise.

In 2019 alone, new account fraud grew by a shocking 88%. And the problem is only getting worse: the pandemic has led to a surge in online activity, giving fraudsters more opportunities to steal sensitive information.

Fortunately, artificial intelligence can help. AI can analyze a broad range of data to detect patterns of fraud and automatically take action to prevent it.

Akkio is the leading AI platform for fraud detection. Our platform makes it easy to get started. Simply upload your dataset and select a column, and our platform will do the rest.

To show you how our platform works, let's take a look at an example of AI-powered fraud detection.

The dataset we used is a historical credit card transaction dataset sourced from Kaggle Datasets. This file has nearly 285,000 rows of real credit card transactions from the EU. Because this credit card information is sensitive, the information in each one of the columns has been encoded using a mechanism called principal component analysis, or PCA. This obscures the private information, but preserves the relative information between the different variables, such that a machine learning model can still be built.

There are 28 different pieces of information that come along with each credit card transaction, along with the size of the transaction in Euros, and finally whether or not that transaction was fraud. That final column is what we're trying to predict.

We ended up with a very high-quality model, with a raw accuracy of 99.94%. Among 32,875 normal transactions and 52 fraudulent transactions, there were only 12 false positives of "not fraud" and 7 false positives of "fraud."

Akkio’s platform used AI to find patterns in the 28 pieces of information associated with each credit card transaction. These patterns are then used to make predictions about future transactions. If a transaction is flagged as potentially fraudulent, Akkio can take action to prevent it, such as sending an SMS warning to the customer via an automated workflow.

If the prediction is very confident that a transaction is fraudulent, it may be blocked automatically. At scale, this kind of automation can prevent millions of dollars in fraud losses and save countless man-hours for fraud prevention teams.

Example 4: Predicting employee attrition

News of "The Great Turnover" has been dominating headlines lately. Employees have become a hot commodity, with the skills shortage and tight job market making it easy for top talent to jump ship for a better offer. And the cost of turnover is high - businesses lose a trillion dollars a year when good employees walk out the door.

As businesses struggle to keep their best workers, AI can be a powerful tool for understanding what drives attrition. With the right data, businesses can identify patterns and take action to prevent turnover before it happens.

IBM's HR Analytics Employee Attrition & Performance dataset from Kaggle is a great starting point for understanding employee turnover. This file has synthetic data on over 2,000 employees, including columns on the employee's wage, department, travel amount, education, overtime hours, and more. The column labeled "Attrition" is particularly important, as it indicates whether or not an employee has left the company.

Using Akkio, we were able to quickly build a high-quality model to predict whether or not an employee would leave the company. Our model had a raw accuracy of almost 90%, which means we correctly predicted which employees would quit and which wouldn't 9 times out of 10.

Armed with this predictive power, businesses can take action to prevent turnover. For example, if our model predicts that an employee is at risk of leaving, the company could take steps to increase job satisfaction, such as offering incentives or increasing pay. Alternatively, the company might change its hiring methods to target different types of candidates.

AI is a powerful tool for understanding and preventing employee turnover. With the right data, businesses can avoid the high cost of losing their best workers.

The Bottom Line

Akkio offers a user-friendly interface that makes it easy to get started with AI. While we've covered churn, attrition, fraud, and cross-sales, there are countless KPIs you could surface and optimize with Akkio.

To get started, check out our applications page, which includes demonstrations and sample datasets for tasks such as:

Sentiment Analysis
Lead Scoring
Fraud Detection
Time Series Forecasting
Cross-Selling

As we've covered, finding relevant data and curating it in a useful format is a critical first step in data science. Akkio’s built-in data cleaning functionality can take care of a lot of the tedious grunt work for you, so you can focus on results.

The Data Guide to ML
ML uncovers historical patterns
How it works
The need to find relevant data Example 1: Churn prediction Example 2: Increasing order value Example 3: Detecting fraud Example 4: Predicting employee attrition The Bottom Line