Data Preparation For Machine Learning

Your data comes from many different places: from public APIs to internal company databases, from Excel spreadsheets to CRMs.

The key is that your data has to make it into your machine learning project in one way or another. And this can be a challenge if you're working with unstructured or semi-structured data, or even if you’re just unsure of what data to use. For example, if you want to use sentiment analysis on customer text, such as product reviews, you might need to apply text pre-processing techniques like converting to the right format first.

Or perhaps you need to add metadata about documents so that they can be easily classified by content type when using document classification. Other times, you may need to preprocess images before training neural networks for image recognition applications like facial recognition or object detection in photos.

In all cases, the goal is the same: prepare your raw input so that it's ready for modeling and prediction purposes without losing important information along the way. There are many ways in which this can be achieved depending on the nature and complexity of each dataset.

Why does machine learning require data preparation?

Data is the fuel for machine learning algorithms, which work by finding patterns in historical data and using those patterns to make predictions on new data. As such, data preparation is a fundamental prerequisite to any machine learning project.
The term "data preparation" refers broadly to any operation performed on an input dataset before it is used in a machine learning application.

This includes the gathering of the training data in the first place, as well as operations such as cleaning up noisy input values, transforming categorical variables into numerical values where possible (e.g., binning), normalizing variables, and filling in missing values.

There are many different kinds of preprocessing that can be performed depending on what kind of dataset you're working with — whether it's structured or unstructured; numeric or categorical; continuous or discrete; large or small — but they all aim at improving the quality and usefulness of your input dataset for downstream tasks such as model building.

What is data preparation for machine learning? What steps are involved?

Data preparation includes gathering, cleaning, and enriching data to make it suitable for machine learning. Traditionally, data scientists would use tools like Python’s Pandas for data prep - to turn raw data into good data - but with Akkio, it’s now possible for non-technical people to connect and merge data sources via no-code AI workflows.

1. Data gathering

Since data is the oil for machine learning engines, data collection is an important step. The data sources you’ll use depend entirely on your use case.

For example, marketers might use HubSpot data to predict customer churn, while sales teams might use Salesforce data for sales funnel optimization, and finance teams might use Snowflake data of transactions to detect financial fraud.

Besides these special-purpose data tools, teams can also make build models off of simple Google Sheets or CSV files. The right data will include your target variable, whether that’s churn, conversion, attrition, or something else.

For use-cases like deep learning, big data gathering is a particularly crucial part of the artificial intelligence process, as these models are very data-intensive. Those use-cases typically involve data lakes like Snowflake.

2. Data cleansing

After gathering data, it often needs to be cleaned. This is also called data preprocessing, and is a vital step in the data preparation process. Akkio does a lot of data processing in the background, but it’s still important for any machine learning project to be powered by clean, high-quality data.

Data cleansing involves removing noise from the data, which can be done through various techniques such as feature engineering, dimensionality reduction, or normalization. Further, it’s important that the data has consistent formatting.

For example, if you’re trying to use location as a variable in predicting financial fraud, but the location of “New York” is formatted in different places as “New York,” “NY,” and “NYC,” the model will have worse quality, as it won’t treat those values as part of the same category. Data cleansing will entail making sure these entries are all listed as the same value.

3. Data enrichment

In the data enrichment step, we add new information to the data set by applying additional transformations or importing external sources of information that were not present in the original dataset.

For example, if you want to predict sentiment from customer feedback forms, and you’ve used both Google Forms and JotForms to collect feedback, you can enrich your overall training data by merging these two sources. With Akkio, merging data can be done in a few clicks simply by selecting the columns you’d like to merge on.

This will help your ML models learn more accurate patterns and make better predictions based on your data. The end result will be a much more valuable dataset for your ML models since it will contain richer contextual information that was previously unavailable in your original, smaller dataset.

Work out what data you need for your application

Data is king in machine learning. The more data you have to work with, the better your AI will be at performing the tasks you want it to. This is why working out what data you need for your AI is so important. You'll also need to decide how and where that data should be stored.

For example, suppose you want to predict whether a new sales lead will close. You may want to know the following:

The lead's company size and revenue
The lead's product category (e.g., software, hardware)
The lead's stage of the buyer's journey (e.g., qualify, pre-qualify)
The internal point of contact (e.g., which salesperson is engaging)

The more potentially relevant data you can get, the better. No single data point will be perfectly predictive of conversion probability, but you can use a wide range of data points to better predict your selected KPI.

Work out what kind of data it is

Different machine learning models will require a different type of data and a different approach. Let's look at two main kinds in detail: Classification and forecasting.

Classification

In classification, you're trying to determine what category a data point falls into. For example, you might want to know whether a new lead is likely to convert into a paying customer for your company's products and services.

In this case, you'll need to collect some information about the lead—their name, their job title, their company name, and so on—and then use that information to make a prediction about whether they'll become a paying customer. To do this, you'll need to build an algorithm that can analyze the data you have collected and make an informed decision about whether the lead is likely to convert into a paying customer or not. The most common approach here is called logistic regression.

In clustering models like K-means clustering, you're trying to group similar objects together into groups based on similarities between those objects without necessarily knowing anything about those objects.

For example, you might want to group together all of the leads that have a high likelihood of conversion. You could do this by using a K-means algorithm like this: where k is the number of clusters you want to create, and each cluster will contain a set of similar objects.

While machine learning engineers would traditionally have to select the model architecture and parameter details, now AutoML tools like Akkio can be used to automatically select and build models for you.

Forecasting

In regression models like linear regression, you're trying to predict a continuous outcome variable based on one or more continuous predictor variables.

The outcome variable could be anything from how many customers someone has acquired over time to whether they'll buy something online today. In contrast to classification problems, which are about discrete outcomes (e.g., fraudulent or not fraudulent), forecasting is about predicting continuous outcomes (e.g., the dollar amount of revenue over time).

Work out how to collect it

In most cases, internal teams are already collecting relevant data in some way. For example, your sales and marketing teams are likely already using a CRM of some sort - even if that’s just a Google Sheets or Excel file - and this data may be enough for modeling, as long as one of the columns represents the KPI you’re interested in (such as conversion), and there are other columns that are potentially predictive of that KPI.

If the needed data isn’t in one place but several, you can use Akkio to merge this data on shared columns.

If, however, the data you need isn’t there at all, you’ll need to work out how to collect it, which typically involves a bottom-up approach of using existing resources to collect data. For example, you could have your sales teams implement Salesforce, and collect relevant data with each sales interaction, and you could have your marketing team implement Hubspot to collect relevant data from marketing leads.

In short, your training set should reflect relevant, real-world metrics. Since your input data is automatically split into a test set to measure accuracy, it’s especially important to have high-quality data in place.

Check the quality of collected data

As we’ve seen, data is the fuel for AI, so to build accurate and robust models, your data should be accurate, valid, complete, consistent, and uniform. Errors in data are inevitable, so to minimize them, data quality is everyone’s responsibility. Let’s look at five characteristics of quality data that every data team should know.

1. Validity

Validity is how well the data measures what it’s supposed to measure. For instance, to predict customer satisfaction, you’d need a dataset with customer satisfaction data, like responses to a questionnaire that asks customers to rate their satisfaction.

If there was a glitch in the tracking system, such that ratings didn’t match up with their corresponding tickets, then the data would be invalid.

2. Accuracy

Accuracy is how well the data reflects the real value of the measurement. Going along with the customer satisfaction example, if a customer tells you how satisfied they are, that’s the ground truth. If, in contrast, you use an employee-determined value for customer satisfaction, you’ll have fairly inaccurate data.

3. Completeness

Completeness is the degree to which relevant data has been collected. For example, if you’re trying to predict customer satisfaction, but have very few satisfaction scores to train on, then your data has low completeness.

4. Consistency

Consistency is the similarity between different sources. For instance, if you’re using customer support data from email and Intercom, you’ll need to ensure that there are the exact same measurements from customer satisfaction in both cases, such as a 1 to 10 rating.

5. Uniformity

Uniformity means that different systems refer to the same value in the same format. For example, the word “male” and the letter “M” may both refer to the male gender, but a computer would process them as two different categories. Ultimately, the closer the format, the higher the uniformity, and the better the data.

Data quality is crucial not only for AI, but even for accurate data analysis and visualization. Since AI learns from data, it’s important to use data transformation and feature selection to ensure that the AI has relevant, high-quality data to learn from.

Prepare data for use in your ML application

There are many steps that can be taken to prepare data for use in your machine learning application.

The first step is to de-duplication and removing irrelevant observations. This is important because duplicates and irrelevant observations can skew the results of an analysis and decrease accuracy. This is especially important when you’re data merging, which can inadvertently introduce duplicate rows.

The next step is data filtering, or refining a subset of data from a larger set of data. This is mainly used to remove outliers, or values that are very different from other values in the dataset, which can be caused by issues like measurement errors, or simply natural variation.

Another step is to handle missing data, or gaps in the dataset, where some numbers are missing, or some columns are blank. To handle missing data, you should fill in as many blanks as possible, such as with the median value of the column.

Lastly, there's the process of validating and quality assurance. The process of validation is important because it provides assurance that your data meets the expectations needed for quality analysis.

With Akkio, these complex data cleaning steps are done in the background, ensuring that machine learning models can be automatically built without any manual intervention. With tools like Microsoft Azure ML or Amazon AutoML, data science professionals are still needed, which makes them inaccessible to most.

Examples of prepared data from Akkio’s demos

There are many ways to get AI training data, with different degrees of quality. Since data quality is critical for building robust and accurate models, businesses should be picky about where they get data from. Perhaps the best quality data sources will come from internal company tooling—or software from Salesforce to Snowflake, which work with well-organized data by design.

There’s also a lot of free training data available from sources like Kaggle, which offers nearly 90,000 datasets. A variety of these are used for Akkio’s demos, from fraud detection to churn prediction.

You can follow these AI flows as simple tutorials to see what prepared data looks like across various data types.

Conclusion

We’ve now seen what data preparation is - perhaps not as daunting as it originally appeared. Data preparation is a key component of building robust, accurate, and fair models, so it’s worthwhile to spend time on this process, but nowadays, it’s much easier than it used to be.

Try Akkio for free to see how easy data preparation can be.

Data preparation for machine learning Why does machine learning require data preparation? What is data preparation for machine learning? What steps are involved?1. Data gathering 2. Data cleansing 3. Data enrichment Work out what data you need for your application
Work out what kind of data it is
Classification
Forecasting
Work out how to collect it Check the quality of collected data Prepare data for use in your ML application Examples of prepared data from Akkio’s demos Conclusion