Here is a cost function we've come up with in the last part for regularized linear regression.
The first part is the usual squared error cost function, and now you have this additional regularization term, where Lambda is the regularization parameter, and you'd like to find parameters w and b that minimize the regularized cost function.
Previously we were using gradient descent for the original cost function, just the first term before we added that second regularization term, and previously, we had the following gradient descent algorithm,
which is that we repeatedly update the parameters w_j and b for j equals 1 through n according to this formula, and b is also updated similarly. Again, Alpha (α) is a very small positive number called the learning rate. In fact, the updates for a regularized linear regression look exactly the same, except that now the cost, J, is defined a bit differently. Previously, the derivative of J with respect to w_j was given by this expression:
The derivative respect to b was given by this expression. Now that we've added this additional regularization term, the only thing that changes is that the expression for the derivative with respect to w_j ends up with one additional term, this plus Lambda over m times w_j.
Recall that we don't regularize b, so we're not trying to shrink B. That's why the updated B remains the same as before, whereas the updated w changes because the regularization term causes us to try to shrink w_j.
To implement gradient descent for regularized linear regression, this is what you would have your code do.
Here is the update for w_j, for j equals 1 through n, and here's the update for b. As usual, please remember to carry out simultaneous updates for all of these parameters.
What I'd like to do in the remainder of this part is to go over some optional material to convey a slightly deeper intuition about what this formula is actually doing.
You might recognize the second term as the usual gradient descent update for unregularized linear regression. The effect of this term is that on every single iteration of gradient descent, you're multiplying w_j by a small positive number, just slightly less than one. This gives us another view on why regularization has the effect of shrinking the parameters w_j a little bit on every iteration.
Using this, you really reduce overfitting when you have a lot of features and a relatively small training set. This should let you get linear regression to work much better on many problems.
In the next part, we'll take this regularization idea and apply it to logistic regression to avoid overfitting for logistic regression as well.