Gradient Descent with Linear Regression
Introduction
Previously, you took a look at the linear regression model and then the cost function, and then the gradient descent algorithm. In this part, we're going to pull out together and use the squared error cost function for the linear regression model with gradient descent. This will allow us to train the linear regression model to fit a straight line to achieve the training data.
Let's get to it. Here's the linear regression model. To the right is the squared error cost function. Below is the gradient descent algorithm.
Key Notes
It turns out if you calculate these derivatives, these are the terms you would get.
The derivative with respect to W
is this:
Then the error term, that is the difference between the predicted and the actual values times the input feature x_i
.
The derivative with respect to b
is this formula over here, which looks the same as the equation above, except that it doesn't have that x_i
term at the end.
If you use these formulas to compute these two derivatives and implement gradient descent this way, it will work.
Now, you may be wondering, where did I get these formulas from? They're derived using calculus. If you want to see the full derivation, I'll quickly run through the derivation on the next slide. But if you don't remember or aren't interested in the calculus, don't worry about it. You can skip the materials on the next slide entirely and still be able to implement gradient descent and finish this part, and everything will work just fine. In this slide, which is one of the most mathematical slides of the entire specialization, and again is completely optional, we'll show you how to calculate the derivative terms.
Let's start with the first term. The derivative of the cost function J
with respect to w
. We'll start by plugging in the definition of the cost function J
. J(w)
is this:
Now remember also that f(w,b,x^{(i)})
is equal to this term over here, which is wx^{(i)} + b
. What we would like to do is compute the derivative, also called the partial derivative with respect to w
of this equation right here on the right.
If you have taken a calculus class before, and again it is totally fine if you haven't, you may know that by the rules of calculus, the derivative is equal to this term over here. Which is why the two here and two here cancel out, leaving us with this equation that you saw on the previous slide. This is why we had to find the cost function with the 1.5 earlier this week, as it makes the partial derivative neater. It cancels out the two that appears from computing the derivative.
For the other derivative with respect to b
, this is quite similar. I can write it out like this, and once again, plugging the definition of f(x^{(i)})
, giving this equation.
By the rules of calculus, this is equal to this where there's no x^{(i)}
anymore at the end. The 2's cancel one small, and you end up with this expression for the derivative with respect to b
. Now you have these two expressions for the derivatives. You can plug them into the gradient descent algorithm.
Examples
Here's the gradient descent algorithm for linear regression. You repeatedly carry out these updates to w
and b
until convergence.
Remember that this f(x)
is a linear regression model, so it is equal to wx + b
. This expression here is the derivative of the cost function with respect to w
. This expression is the derivative of the cost function with respect to b
.
Just as a reminder, you want to update w
and b
simultaneously on each step.
Special Topics
Now, let's get familiar with how gradient descent works. One issue we saw with gradient descent is that it can lead to a local minimum instead of a global minimum. Whether global minimum means the point that has the lowest possible value for the cost function J
of all possible points. You may recall this surface plot that looks like an outdoor park with a few hills with the process and the birds as a relaxing Hobo Hill.
This function has more than one local minimum. Remember, depending on where you initialize the parameters w
and b
, you can end up at different local minima. You can end up on any of the minimum. But it turns out when you're using a squared error cost function with linear regression, the cost function does not and will never have multiple local minima. It has a single global minimum because of this bowl shape.
The technical term for this is that this cost function is a convex function. Informally, a convex function is a bowl-shaped function and it cannot have any local minima other than the single global minimum. When you implement gradient descent on a convex function, one nice property is that as long as your learning rate is chosen appropriately, it will always converge to the global minimum.
Conclusion
Congratulations, you now know how to implement gradient descent for linear regression. We have just one last part of this section. In that part, you'll see this algorithm in action.