In machine learning, overfitting refers to the problem of a model fitting data too well. In this case, the model performs extremely well on its training set, but does not generalize well enough when used for predictions outside of that training set.
On the other hand, underfitting describes the situation where a model is performing poorly on its training data - it doesn't learn much from that data.
The problem with underfit models is that they do not have enough information about the target variable. The goal of any machine learning algorithm is to "learn" patterns in the data based on how it was presented through examples, without explicitly defining what those patterns are.
If there are no such patterns present in our data (or if they are too weakly defined), then the machine can only come up with something that isn’t there, and make up predictions that don’t pan out in the real world. This is why feature selection, or using the right training data, is crucial to building robust supervised learning models.
Let’s explore overfitting and underfitting in detail, as well as how you can mitigate these issues when building machine learning models.
Bias and variance are two important concepts in machine learning. Bias is a measure of how much the predictions deviate from the actual data, while variance measures how scattered the predictions are.
So how do these characteristics relate to overfitting and underfitting?
Let's say we have three models to predict revenue. A good model won’t perfectly fit the training data, but it can generalize to new data points well. On the other hand, an underfitting model fails to find patterns at all, and has high bias (and either low variance or high variance), while an overfitting model fails to generalize to new data, and typically has high bias and low variance.
Both bias and variance can be uncovered by looking at error metrics. AI platforms like Akkio automatically create test data from your input data, as well as a cross-validation dataset, in order to analyze the performance of the models. For example. k-fold cross-validation can be used to determine the skill of your model on new data, and help build models with low bias.
AutoML is used to automatically make a number of models with a varying number of features, from a simple model like logistic regression or a decision tree to more complex deep learning models. Finally, the model selection process involves balancing the bias-variance trade-off to find the best model across the test set and validation set.
If you’ve built AI models, you probably know that overfitting is one of the most common problems in real world machine learning.
This type of error occurs when we build an overly complex model that tries to learn too much information from the dataset, which makes it hard to generalize to new information. Overfitting may happen when the model learns too much from too little data, so it processes noise as patterns and has a distorted view of reality.
It's like if you were learning guitar, but only ever practiced one song. You’d get very good at it, but when asked to strum a new song, you’ll find that what you learned wasn’t all that useful.
Here's an AI example: Let's say I want to predict whether a lead will convert, but every lead in my training data set is a profitable conversion. So I build a model, fit it on my data set, and the model predicts that every lead will convert. This means that my model is overfitting: it failed to learn that leads often don’t convert.
Overfitting describes the phenomenon where a machine learning model (typically a neural network) is so complex and intricate that it can account for all possible cases, but fails to generalize its predictions to unseen data points. This means that even when the model predicts correctly on the training set, it is not able to perform well in new cases.
The reason behind this is that a complex model requires a high number of parameters to capture the underlying relationships in the data. If these parameters are not carefully tuned, they may end up capturing irrelevant aspects of the data — leading to overfitting. Parameters in a network must have some degree of accuracy and precision; otherwise, you'll end up with an uninterpretable blob of numbers instead of an algorithm capable of making predictions and decisions.
In other words, if you overfit your model by providing it with too much information or too many free parameters, then your model will do a poor job at predicting future outcomes. As we mentioned earlier, this phenomenon applies equally well to humans when building models based on limited information or historical examples.
Suppose you want to predict whether a financial transaction is fraudulent, but you only have 100 transactions to work with. If you build a model that has 1,000 parameters, then your model will do a poor job of predicting future outcomes, as it’s overly complex for the data at hand. For this reason, it's important to constantly assess the general fit of your hypothesis by using more data points and making sure your model fits the data well.
There are many ways we can avoid overfitting while still using powerful models, including increasing training data, reducing complexity, regularization, cross-validation, dropouts, and feature selection.
Larger datasets reduce overfitting by enabling more data points for each feature. For example, if you have 100 features in your model but train it on 2 million examples, then there will be 200 million data points used by the final model. This number of examples is far greater than what any human could possibly observe or process manually, so there's a lot of room for each feature to capture some variability in your target variable.
That means if one of your features isn't capturing as much variation as you think it should, then there will still be plenty of remaining variation across all other features for it to explain.
In summary: Larger datasets provide more observations per feature and enable higher-order inference methods.
Reducing model complexity, in terms of model architecture, can also reduce overfitting. This is because simpler models have fewer parameters which means their effects are easier for algorithms to determine with greater accuracy.
It’s also important to use multiple independent variables whenever possible. Using more independent variables in a model reduces the chance of overfitting because each variable has its own unique correlation with all other variables in your data set.
In order for any individual variable to make an accurate prediction about another variable, there have to be genuine relationships between them across different combinations of values for each variable. The more independent variables you include in your model, the less likely it is that any individual variable will overfit.
We can also apply regularization methods such as dropout or penalization to reduce overfitting. These methods make sure that our model generalizes well and doesn’t perform so well on the training data that it would fail when tested on new data with different characteristics.
Dropout is a method where at each training step, a certain percentage of examples or neural network nodes are randomly “dropped out” of the architecture or training set. This prevents us from memorizing patterns that aren't present in real life and helps prevent overfitting.
With regularization methods like dropout, we are ensuring that our models generalize well across many different sets of input/output pairs without becoming overly sensitive to small differences between inputs/outputs within those sets.
Cross-validation is the process of partitioning our data set into separate training and validation sets. We use the training set to train our model, and the validation set to evaluate how well it performs.
This way, we can be sure that overfitting isn't occurring because our model is seeing too much of the same data during both training and validation. This allows us to get a more accurate assessment of how well our model will generalize to still unseen data.
Time-series cross-validation is a type of cross-validation specifically designed for working with time series data where we want to avoid "leakage" from future data into our training set. This is especially important when working with time series data because our model could learn patterns in the data that are specific to the future and not present in the past.
Dropouts are a form of regularization where, at each training step, we randomly drop out a certain percentage of nodes from our neural network. This prevents us from memorizing patterns that aren't present in real life and helps prevent overfitting.
We can also use dropouts in other machine learning models, such as decision trees. In this case, we would randomly drop out a certain percentage of features at each training step. This would prevent the tree from becoming too sensitive to any one feature and help prevent overfitting.
Using too many features in our model can sometimes lead to overfitting. This is because, the more features we have, the more likely it is that we will find spurious relationships between those features and our target variable.
One way to combat this is to use feature selection methods such as forward selection or backward elimination. These methods help us select the most relevant features for our model and prevent us from using features that are unrelated to our target variable.
Underfitting is a situation where the model you build doesn’t capture all of the information in your data. In other words, it means that your model is not as good at making predictions as you think it is.
There are many reasons why your AI model might underfit, and we’ll explore some of the most common ones below.
This should be obvious, but sometimes people forget that when building an AI model, you need a large enough sample size for it to work well. If you don’t have enough training examples for your model to “see” and learn from, then it will have limited knowledge of the target domain or topic.
Another reason why an AI model might underfit is if there are any mistakes in how its inputs and outputs are defined within your system or business process. For example, if during training you use inconsistent definitions for features, then your algorithm could end up with poor accuracy.
What about cases where there are no obvious mistakes? In these cases, remember that humans also make assumptions when they see patterns that aren't really present—and if there are truly no patterns in the data, then no model will be able to accurately make predictions.
Fortunately, there are several techniques that can be used to reduce underfitting in AI, such as increasing complexity, removing noise, and increasing training time. Let’s explore each of these methods in-depth.
Reaching a balance in model complexity is key. Too simple and the model will struggle to make predictions that are useful in practice. Too complex and the model may overfit, or simply not be able to scale, which could limit its practical application.
Let’s explore this idea using an example: credit scoring in financial services. In many cases, financial institutions are required by law or regulation to produce reports on their customers’ creditworthiness based on factors like income and debt repayment history.
Using traditional statistical approaches such as linear regression, we can build potentially useful models for predicting customer credit risk — but these models often fail because they lack the necessary domain knowledge about human behavior and decision-making processes in finance.
To avoid this outcome, most banks now focus on factors like past loan payment history or revenue volume. This level of sophistication is possible because modern AI platforms provide us with powerful tools for building complex machine learning models at scale — without having to write any code.
One thing that is very helpful in reducing the risk of underfitting is removing noise from your training set. Noise refers to any information that doesn't help your AI system make accurate predictions or inferences but rather confuses or distracts your system from learning useful features and patterns in your data.
Removing noise allows you to focus on only important features/patterns in your dataset while keeping irrelevant information out so that you can build a strong foundation for building powerful machine learning models.
Let's take this one step further with an example dataset: Imagine you want to build an AI-based recommendation system for hotels like Expedia or Airbnb uses today - but instead of recommending hotels based on location or price, you want recommendations based on things like whether people liked their stay (a sentiment score) or if they should book another hotel through your website after staying at one particular hotel (an intent score).
However, as soon as you begin to include more variables in your equations, you run into a problem: having too many variables in an equation may actually make it harder for your AI system to calculate accurate results.
This is where feature engineering comes into play - we can use this information to remove variables that are not important or don't help us predict the outcome that we care about. For example, a column like a user’s name or email address would be non-predictive, and therefore noise.
Related to removing noise is the idea of merging in relevant features. This process goes beyond simply removing noise and can help you improve your data with new information that can make your AI system more accurate.
For example, imagine you want to build a machine learning model to predict stock prices. A naive approach would be to use only historical prices as your input features. However, this approach would likely underfit because there are many other factors that affect stock prices, such as economic indicators, news headlines, and analyst ratings.
A better approach would be to include these additional features in your dataset so that your machine learning model can learn from them and make more accurate predictions.
Model training time is like the story of Goldilock and the Three Bears: it’s about finding the sweet spot between too short or too long.
In general, the longer you train your model on a given dataset, the better the result will be. This is particularly the case with more complex predictive models trained on a lot of data. With a small number of epochs, you’ll end with a model with poor performance. However, simple linear models don’t need a high training time.
In fact, if you overtrain your model with excessive data, it will often end up overfitting to that dataset. Therefore, reducing training time is a way to reduce overfitting, which can be done with a technique called early stopping.
Traditionally, data science professionals would wrangle with tools like Python and scikit-learn to find the right settings over multiple iterations, and build models that aren’t overfitted or underfitted. With Akkio, the right training time is automatically determined to prevent making models that underfit or overfit the data.
To recap, overfitting occurs when the model has a high correlation with the training data, resulting in models that are very accurate on the training set but perform poorly once tested on new data. This is because the model will learn to extract trends from the training set, which may or may not be present in new data.
In contrast, underfitting occurs when a model has a low correlation with the training set. This results in models that perform poorly on the training set and usually struggle when tested on new data.
The key takeaway is this: No-code AI solutions can help you avoid overfitting and underfitting by automating regularization and testing your model’s generalization performance across a wide range of values. With no-code AI, anyone can build robust models, without needing to hire data scientists or follow months of technical tutorials. Try out a free trial of Akkio to see this in practice.