Gradient Descent for Multiple Linear Regression

Explanation and implementation of gradient descent for multiple linear regression using vectorization.

Introduction

You've learned about gradient descent for multiple linear regression and also vectorization. Let's put it all together to implement gradient descent for multiple linear regression with vectorization. This will be exciting.

Let's quickly review what multiple linear regression looks like. Using our previous notation, we can write it more succinctly with vector notation. We have parameters w_1 to w_n as well as b. But instead of thinking of w_1 to w_n as separate numbers, let's collect all of them into a vector W, so W is a vector of length n. We will think of the parameters as a vector W and a scalar b. Previously, we defined multiple linear regression as:

f(w,b)(x)=w1x1+w2x2+...+wnxn+bf(w, b)(x) = w_1 * x_1 + w_2 * x_2 + ... + w_n * x_n + b

Now, using vector notation:

f(w,b)(X)=WX+bf(w, b)(X) = W \cdot X + b

GDMR1

The dot here refers to the dot product.

Cost Function and Gradient Descent

Our cost function can be defined as:

J(w,b)=12mi=1m(f(w,b)(X(i))y(i))2J(w, b) = \frac{1}{2m} \sum_{i=1}^m (f(w, b)(X^{(i)}) - y^{(i)})^2

But instead of thinking of J as a function of multiple parameters w_1 through w_n and b, we write it as:

J(W,b)J(W, b)

Where W is the vector of parameters and b is a scalar.

Here's what gradient descent looks like:

GDMR2

We repeatedly update each parameter w_j as:

wj=wjαJ(W,b)wjw_j = w_j - \alpha \frac{\partial J(W, b)}{\partial w_j}

And for b:

b=bαJ(W,b)bb = b - \alpha \frac{\partial J(W, b)}{\partial b}

The partial derivatives represent the gradient of the cost function with respect to w_j and b.

Implementing Gradient Descent

With multiple features, gradient descent is slightly different. Here's what we had with one feature:

GDMR3

We had an update rule for w and a separate update rule for b.

Now, with n features, we have:

wj=wjαi=1m(f(w,b)(X(i))y(i))Xj(i)w_j = w_j - \alpha \sum_{i=1}^m (f(w,b)(X^{(i)}) - y^{(i)}) X_j^{(i)}

For each j = 1, 2, ..., n.

GDMR4

Similarly, we update b as:

b=bαi=1m(f(w,b)(X(i))y(i))b = b - \alpha \sum_{i=1}^m (f(w,b)(X^{(i)}) - y^{(i)})

This is how gradient descent works for multiple features.

GDMR5

Normal Equation

An alternative to gradient descent is the normal equation, which solves for W and b directly using linear algebra. The normal equation is:

W=(XTX)1XTyW = (X^T X)^{-1} X^T y

This method does not require iterations but can be computationally expensive when the number of features is large.

GDMR6

Although not as generalizable as gradient descent, some machine learning libraries might use this method in the backend for linear regression.

On this page

Edit on Github Question? Give us feedback