Now you've seen a couple of different learning algorithms, linear regression and logistic regression. They work well for many tasks. But sometimes in an application, the algorithm can run into a problem called overfitting, which can cause it to perform poorly.
In this section, we'll explore what overfitting is, as well as the closely related, almost opposite problem called underfitting. In the next section, I'll share with you some techniques for accuracy overfitting, particularly a method called regularization, a very useful technique to minimize overfitting and improve your learning algorithms.
To help us understand overfitting, let's revisit our example of predicting housing prices with linear regression, where the goal is to predict the price as a function of the size of a house.
The input feature (x) is the size of the house, and (y) is the price you're trying to predict. A linear fit to this data might look like this:
But this isn't a good model. As house size increases, the prices seem to flatten out. This model does not fit the training data well. The technical term for this is underfitting, also known as high bias.
Underfitting means the model is too simple to capture the patterns in the training data. This strong preconception that the relationship is linear causes the model to poorly fit the data.
Next, let's look at another model:
With quadratic terms (x) and (x^2), the model fits the data better:
At the other extreme, fitting a higher-order polynomial (e.g., a fourth-order polynomial) might pass through all the training points, but it overfits:
You get a wiggly curve like this:
This model fits the training data too well, resulting in overfitting or high variance. It fails to generalize to new examples:
Overfitting is like trying to fit every single data point perfectly, even if it leads to a convoluted model that does not generalize well. The goal of machine learning is to strike a balance, finding a model that is just right:
In machine learning, we aim for models that balance underfitting (high bias) and overfitting (high variance). It's like Goldilocks trying to find the porridge that's just right:
Too few features → underfitting/high bias
Too many features → overfitting/high variance
Just the right number of features → good generalization
Consider a classification example with two features, (x_1) (tumor size) and (x_2) (age of the patient). The goal is to classify tumors as malignant or benign:
With logistic regression, the decision boundary is a straight line:
This straight line may not fit the data perfectly, which is an example of underfitting or high bias.
Now, if you add quadratic terms, the decision boundary becomes more elliptical:
This model fits better, though it doesn't classify every training example correctly. It's a good fit for new examples.
You've seen how algorithms can underfit (high bias) or overfit (high variance). In the next section, we'll look at techniques to address overfitting, such as regularization.