The Problem Of Overfitting
Introduction
Now you've seen a couple of different learning algorithms, linear regression and logistic regression. They work well for many tasks. But sometimes in an application, the algorithm can run into a problem called overfitting, which can cause it to perform poorly.
In this section, we'll explore what overfitting is, as well as the closely related, almost opposite problem called underfitting. In the next section, I'll share with you some techniques for accuracy overfitting, particularly a method called regularization, a very useful technique to minimize overfitting and improve your learning algorithms.
What is Overfitting?
To help us understand overfitting, let's revisit our example of predicting housing prices with linear regression, where the goal is to predict the price as a function of the size of a house.
Example: Housing Prices
Suppose your dataset looks like this:
The input feature (x) is the size of the house, and (y) is the price you're trying to predict. A linear fit to this data might look like this:
But this isn't a good model. As house size increases, the prices seem to flatten out. This model does not fit the training data well. The technical term for this is underfitting, also known as high bias.
High Bias
Underfitting means the model is too simple to capture the patterns in the training data. This strong preconception that the relationship is linear causes the model to poorly fit the data.
Next, let's look at another model:
With quadratic terms (x) and (x^2), the model fits the data better:
This model generalizes better to unseen data:
Overfitting
At the other extreme, fitting a higher-order polynomial (e.g., a fourth-order polynomial) might pass through all the training points, but it overfits:
You get a wiggly curve like this:
This model fits the training data too well, resulting in overfitting or high variance. It fails to generalize to new examples:
Overfitting is like trying to fit every single data point perfectly, even if it leads to a convoluted model that does not generalize well. The goal of machine learning is to strike a balance, finding a model that is just right:
Generalization
The quadratic model, for instance, strikes a good balance, fitting the data reasonably well while also generalizing to new examples.
The Bias-Variance Tradeoff
In machine learning, we aim for models that balance underfitting (high bias) and overfitting (high variance). It's like Goldilocks trying to find the porridge that's just right:
- Too few features → underfitting/high bias
- Too many features → overfitting/high variance
- Just the right number of features → good generalization
Now, let's extend this concept to classification.
Overfitting in Classification
Consider a classification example with two features, (x_1) (tumor size) and (x_2) (age of the patient). The goal is to classify tumors as malignant or benign:
With logistic regression, the decision boundary is a straight line:
This straight line may not fit the data perfectly, which is an example of underfitting or high bias.
Now, if you add quadratic terms, the decision boundary becomes more elliptical:
This model fits better, though it doesn't classify every training example correctly. It's a good fit for new examples.
Extreme Overfitting
However, if you fit a high-order polynomial with many features, the decision boundary becomes overly complex:
This is an instance of overfitting and high variance. The model fits the training data too well, leading to poor generalization.
Conclusion
You've seen how algorithms can underfit (high bias) or overfit (high variance). In the next section, we'll look at techniques to address overfitting, such as regularization.