Data augmentation is the process of modifying, or “augmenting” a dataset with additional data. This additional data can be anything from images to text, and its use in machine learning algorithms helps improve their performance.
For example, say we wanted to build a model to classify dog breeds, and we have a lot of images of most breeds, except for pugs. As a result, the model wouldn’t be able to classify pugs well. We could augment the data by adding some (real or fake) images of pugs, or by multiplying our existing pug images (e.g. by replicating and distorting them to make them artificially unique).
This article will cover the basics of data augmentation, as well as provide a practical overview of how to augment data in real-world ML pipelines.
Data augmentation is crucial for many AI applications, as accuracy increases with the amount of training data. In fact, research studies have found that basic data augmentation can greatly improve accuracy on image tasks, such as classification and segmentation. Further, large neural networks, or deep learning models, need a huge amount of data, so they benefit even more from data augmentation techniques.
The problem is that most companies don’t have enough data to train their AI models. This is where data augmentation comes in: even if you’re starting with very little, you can end up with massive amounts of data to generate insights, predictions, and recommendations that were previously unavailable due to a lack of relevant information. Additionally, using a small amount of data can increase the risk of overfitting, while having more data points helps to counter that.
But it’s not just about having enough data — it’s also about having the right type of data for your particular use case. For example, if you’re trying to predict the outcome of a soccer game using an algorithm built for predicting stock market trends, you may end up with some pretty strange results.
Another important consideration is how you will collect and cleanse your incoming augmented data before feeding it into your AI model. If the data is in the same format as your pre-existing data, then it’s easy, and you can just merge it with your existing data. However, if you’re generating entirely new data or using a new data source, things get a little more tricky, and you’ll need to ensure consistency in data formatting.
There are, broadly speaking, two types of data augmentation: Real data augmentation and synthetic data augmentation.
Real data augmentation is when you add real, additional data to a dataset. This can be anything from text files with additional attributes (e.g. for images that have been labeled), to images of other things that are similar to the original thing, or even videos of the original thing.
For example: Adding a few extra attributes to an image file can make it easier for a machine learning model to recognize the object in question. For example, we could be adding more metadata about each image (e.g. its name and description) so that our AI model has more information about what each image represents before it starts training on those images.
This could make the model more likely to correctly identify which objects are in an image, and helps improve performance overall when it comes time for us to classify new images into one of our predefined categories like “cat” or “dog.”
Real data augmentation can also mean merging data from other sources, or simply other data files. For instance, suppose you’re building a model to score incoming leads. You could export data from your HubSpot account, but if you’re also using Salesforce, you could merge that data to build a larger training data file to fuel a potentially more accurate predictive model.
Alternatively, you might be predicting customer churn based on Snowflake data from your APAC sales team, but you’re also operating in the MENA region. You could merge these two datasets to build a larger training data file for more accurate churn modeling (we’ll explore a similar example in a walkthrough later in this article).
Besides adding additional real data, you can also add synthetic data, or fake data that simply looks real. This is helpful for complex tasks like neural style transfer, but it’s useful for any architecture, whether you’re using GANs (Generative Adversarial Networks), CNNs (Convolutional Neural Networks), or other deep neural network architectures.
For instance, we could add some fake images of pugs to a dataset of dog images if we want to be able to accurately classify pugs, without having to go out and snap a bunch of pictures. This type of data augmentation is particularly useful for improving accuracy in models when the data is difficult, expensive, or time-consuming to collect.
In this case, we’re making artificial additions to the dataset. For example, let’s say that our original set of 1000 dog breed images has just 5 pug ones. Instead of adding more real pug images from real dogs, let’s make an artificial one by duplicating one of the existing ones and distorting it slightly so that it still looks like a pug.
With various distortions, we can easily turn, say, 5 real photos into 50 synthetic ones. This way, the model will be able to classify pugs accurately, even with just a few real photos. These image augmentation methods can include a number of edits to the original image, such as changing pixel colors and using various geometric transformations, like a horizontal flip. Simple methods, like rotations, allow the model to become invariant to the rotation of the object. Invariance is key to building robust image models.
The goal is essentially to build a larger training set with augmented images, improving accuracy on computer vision tasks like image recognition, object detection, and classification problems. Libraries like Keras’ imagedatagenerator help with this.
That said, synthetic data augmentation can also be done on natural language processing tasks, such as generating text with language models in order to, in turn, build more accurate language models for specific purposes.
These techniques are commonly tested on CIFAR-10 and CIFAR-100, which are popular image datasets for testing classification tasks, augmentation policies, and more.
Let's look at a few examples of real-world industry tools for data augmentation.
The first example is CARLA, an open-source driving simulator built on Unreal Engine with a Python API used for autonomous driving research. It employs high-end graphics to provide a suitable representation of the real world conducive for reinforcement learning with sensor/camera data.
The simulator is designed to be highly realistic and can be used as a tool for teaching machine learning algorithms how to drive cars safely in dynamic environments. In order to achieve this level of realism, the authors had to make some simplifying assumptions about physics and road conditions.
Methods like this have inherent limitations, such as limiting the number of objects that could appear on screen at any given time, which means using precomputed textures instead of rendering everything from scratch in each frame. This approach allows researchers to create detailed scenes while keeping processing overhead low enough so that it doesn’t impact performance too much. To add further realism, they also added options for weather conditions.
AugLy is a data augmentation platform from Facebook AI that allows users to augment several data types, including photos, text, and audio—such as cropping images, adding mis-spellings to text, and changing the pitch of a voice recording.
Over all, there are over 100 data augmentations to choose from, including overlaying text and emojis on images, and even a tool to make it look like an image was screenshotted from another social media platform.
These are useful tools for Facebook as a social network, as they need to moderate content types like fake news and misinformation campaigns, which often involve sharing slightly altered text and images to get around filters that only check for exact matches.
That said, data augmentation is also a useful technique for any machine learning user that would like to increase the size of their training data to build more robust and accurate models.
There are a number of challenges that need to be solved in order to create effective data augmentation methods, including scalability, heterogeneous datasets, relevance, data duplication, and validation.
In terms of scalability, augmented data must be scalable so that it can be used by large numbers of models. It can take a while to set up a data augmentation system that creates a large volume of relevant, useful, augmented data, so you’ll want to ensure that this can be repeated for use in future models.
In terms of heterogeneity, different datasets have different characteristics which must be taken into account when creating augmented data. The characteristics of each dataset must be used to create relevant augmented data. In other words, data augmentation won’t be the same across datasets and use-cases.
Further, the augmented data must be relevant to the task at hand so that it will not cause confusion or negatively affect model performance. Also, it’s important to keep in mind that when creating augmented data, there must not be any unnecessary duplication of existing data. Instead, unique information should be added to create new insights.
Finally, to ensure that the benefits of the enhanced data outweigh any risks, the augmented data should be validated using appropriate metrics before being used by machine learning models. For example, image-based augmented data could negatively affect model performance if it contains excessive background noise or irrelevant objects.
The validation step allows us to highlight potential problems and mitigate risks before they become an issue for model performance. All that said, one big challenge for many teams is in the technical complexity of augmentation, particularly when using a tech stack that involves libraries like Keras and TensorFlow. Let’s follow a practical walkthrough below to see how simple data augmentation can be.
Suppose that you're modeling vehicle traffic flow using data from a region with few cars. The real-world accuracy of these models would be poor, because the training data was not broad enough for the model to generalize to any traffic condition.
There’s a world of data available that you may not already have in your training data, and data augmentation is the process of using this additional data to improve the optimization process of your models. Let’s walk through a tutorial on real data augmentation.
Adding a real dataset can be done in two ways, which are known as merge and fuzzy merge.
Let’s see how this looks in Akkio’s no-code AI flow, which makes it easy to augment data. After connecting a dataset in Akkio, you can click “Add Step” in the Flow Editor to select an additional dataset, whether it’s an Excel Table, a Google Sheets connection, or from Snowflake or Salesforce.
Next, hit “Add Step” again, and select “Merge.” You can now select the Primary Dataset and Supplementary Dataset to merge, as well as the column to match on. In the below example, two customer churn datasets are merged on the customerID column.
The default matching sensitivity is a “fuzzy merge,” which means that the datasets will be merged even if the column names aren’t completely identical. Alternatively, you can select “Exact Match Only” under the match sensitivity drop-down, which will only merge the datasets on exact column matches.
The next setting is “Merge Type,” which allows you to select between these two options:
If you’re adding new rows of data or new records, you’ll want to select the first option: “Keep all rows in primary dataset.” If you’re adding new columns of data, or attributes (such as customer phone numbers), you’ll want to select the second option: “Keep only rows which appear in both datasets.”
Just like that, you can add an additional dataset to your training data! After making a prediction, pay attention to the accuracy in the model report, as you may find greater accuracy after adding additional data.
This form of data augmentation is particularly useful when teams aren’t just using one single master file. For example, sales teams might be using multiple Salesforce databases and Google Sheets files to keep track of things, while marketing teams might be using several Excel sheets alongside HubSpot. With Akkio’s data merge functionality, teams can effortlessly merge multiple datasets and data sources to build more accurate, robust models.
Ultimately, data augmentation is a key technique to build more accurate, robust models, whether you’re trying to predict churn, detect financial fraud, or build better image classification models. Simple preprocessing with data augmentation can even help teams build state-of-the-art models, through a superior training process.
At a high level, there are two ways to go about it: Synthetic data augmentation and real data augmentation. For structured data tasks, you’ll mostly use real data augmentation, by merging data from other sources, whether it’s HubSpot, Snowflake, Salesforce, or just another Google Sheet. For building something like an image classifier, you’ll use various techniques to increase the number of input images, and improve the model’s generalization ability.
With Akkio, it’s easier than ever to merge additional data sources, and build and deploy better AI models.