Augmenting Data With AutoML: Fuzzy Merge

TABLE OF CONTENTS

Leveraging AI for business applications depends on building and deploying the most accurate machine learning (ML) models. The more precise the model, the greater the business value. It is almost always the case that bringing more data to the table when training models will improve prediction performance.

A Basic Example of Data Augmentation

Consider, for example, ShovelCo - a manufacturer of high-end, direct-to-consumer snow shovels. They capture current and past customer information, including name, address, and phone number. Like any business, some of the prospects in the database will purchase over time. Still, they could drive a lot more revenue if they could predict which people are actively in the market for a new snow shovel and increase their marketing volume to those customers.

using weather data to augment predictive modeling

Using historical purchase data and training a machine learning model will capture general trends, adding efficiency to the revenue acquisition engine. People who live in snowy regions are more likely to buy and tend to purchase during the winter months. Their model yields a meaningful improvement in business velocity, but what if they could combine each prospect’s zip code with historical weather data?

Taking the date of past purchases and the associated zip code and merging the weather report for a week around the purchase date, they can train an AI model that will recognize exactly when users are most likely to need a shovel. Once the model is trained, they can run it in real-time, taking the latest weather forecast, merging it to the prospect database, and predicting precisely the customers that are most likely to buy a shovel ahead of an upcoming snow day. Now ShovelCo has achieved a massive improvement in revenue generation efficiency and handily wins their market.

Fuzzy Matching Without Unique IDs

Like our hypothetical ShovelCo, today's businesses are built on the backbone of big data. They often already have the information in-house that will allow them to unlock massive business value. Unfortunately, that data lives in multiple systems. Pulling it all together into a single database to use with a Machine Learning platform can be surprisingly difficult and time-consuming -- particularly when you try to match records that don’t have unique identifiers.

Matching records gets worse when you need to augment external information, such as supplementing a business email address with 3rd party demographics and firmographics. Historically the task of pulling together disparate data sources has fallen on the data scientist -- if your organization is lucky enough to have one.

A Fuzzy Match or Fuzzy Merge refers to matching records that are similar but not the same. Some quick examples are matching the email craig.wisneski@akk.io with the name Craig Wisneski, or maybe one data system assigns facility ID "615," and another data system sets "00615".

Those examples are relatively easy -- it gets more complicated from there. You can imagine two records with multiple data fields that are similar but not unique. For instance, you’d likely want to match craig.wis@akk.io with another record that has FirstName=Craig, LastName=Wiz, Company=Akkio.

Akkio has built a state-of-the-art fuzzy matching tool that leverages our in-house AI technology to make it simple to join datasets. Yes, we’re using ML to get your data ready for ML. Here is how it works.

Using Fuzzy Match

When you create a new flow, start with two datasets that you want to combine. The first dataset you select -- your “step 1” -- is your primary dataset. It should include the reference or trigger dataset (i.e., the primary data you’ll feed into your model to make a prediction). In your second step, add the dataset with backing information, the one that augments your first. As your third step, select “Merge Data.” Slide on down to “Advanced,” and you’ll see that you have the option to combine on “Exact Match Only” or “Fuzzy Match.”

When Fuzzy Match is selected, you can choose any two columns to match. Add as many more columns as you like to dial it in and click "Merge." Once you’ve combined your data, you can move on to the Predict step, where you can quickly build an AI model and check its performance.

Check out this video to see it in action.

Deploying Your Model

You can deploy your machine learning model in a few clicks in a web app, a few lines of code with our API, or set up automation with no-code with our Zapier integration (and other integrations that are on the way).

Once a model with a merged dataset is deployed, here’s how the data flow works. You pass in data that matches the format of the primary dataset. Akkio will try to look up matching data from your second data-source (your augmentation set) and then run the matched records through your model. The model returns to you (1) the merged record, (2) the model's prediction, and (3) the confidence the model has in its projection, expressed as a percentage of likelihood.

Akkio's mission is to make it incredibly easy for any user to build with AI. Getting the right data together is a core part of any AI strategy, so we created an easy-to-use tool that lets you quickly merge and augment records -- no data science or software skills required.

Why more data is (generally) better

All machine learning models have one commonality: They aim to minimize prediction error. In order to do so, they need data. More data often leads to more accurate models because it allows the model to learn the underlying patterns and relationships in the data better.

There are a few reasons for this. For one, more data provides more examples of the different types of inputs and outputs that the model can learn from. This allows the model to learn the correct mapping between input and output more effectively.

In addition, more data gives the model more chances to learn the relationships between different features. The model can learn which features are important for prediction and which are not. This is especially important in deep learning models, which learn by extracting features from data automatically.

Finally, more data allows the model to generalize better. That is, the model can learn from the data and then apply what it has learned to new data that it has never seen before. This is important because in the real world, we very rarely encounter data that is exactly like the data we used to train our models.

Of course, there are limits to how much data a model can effectively learn from. At some point, adding more data will not lead to further improvements in accuracy. But in general, more data leads to better machine learning models.

Data merging can help any use-case

For instance, consider a sales team building a lead scoring model. They have data in HubSpot (contact information, activity history, etc.) and data in Salesforce (deals won, products purchased, etc.). Merging these two data sources would give the team more complete data to train their model on, leading to a more accurate lead score.

Another example is a team that wants to do predictive maintenance on a manufacturing line. They have sensor data in CSV files and machine data in BigQuery. Merging these two data sources would give the team more complete data to train their model on, leading to more accurate predictions about when equipment will need maintenance.

Or consider a human resources team that wants to build a model to predict employee turnover. They have data in an HRIS (human resources information system) and data in a learning management system. Merging these two data sources would give the team more complete data to train their model on, leading to more accurate predictions about which employees are likely to leave the company.

The possibilities are endless! Whatever machine learning task you’re working on, there’s a good chance that data merging can help you get more training data and improve your results.

‍

<- Previous

How Much Data Is Required To Train ML Models in 2024?

Next ->

Machine Learning in Retail: Top Trends & Real Use Cases

Published on

January 8, 2024