In this part, we will build on the intuition from the previous section and develop a modified cost function for learning algorithms that apply regularization.
In the last part, we saw that regularization tries to make the parental values W1 through WN small to reduce overfitting. In this part, we'll build on that intuition and develop a modified cost function that you can use to apply regularization effectively.
Let's jump in and recall this example from the previous part, where we saw that if you fit a quadratic function to the data, it provides a good fit.
However, fitting a high-order polynomial leads to overfitting. But suppose we had a way to make the parameters W3 and W4 small, say close to 0. Here's what happens:
In practice, we often don't know which features to penalize, so we penalize all parameters W_j. By doing this, you reduce overfitting by minimizing unnecessary complexity.
Regularization Term=λ×j=1∑nWj2
Where \lambda is a regularization parameter, determining the importance of this penalty.
Let's penalize all the parameters W_1 to W_{100} and B. A common practice is to scale the regularization term by dividing \lambda by 2m, where m is the training set size.
Final Cost Function=2m1i=1∑m(hθ(x(i))−y(i))2+2mλj=1∑nWj2
This scaling ensures the same \lambda works even if the training set grows larger.
Regularization helps by striking a balance between fitting the training data and keeping the parameters small to avoid overfitting. In the next section, we will explore how to apply regularization to linear and logistic regression.