Gradient Descent Implementation

Introduction

In this section, we will dive deep into how to implement gradient descent for a logistic regression model. We aim to find optimal values for the parameters w and b by minimizing the cost function (J(w, b)), using gradient descent.

Introduction to Logistic Regression

To fit the parameters of a logistic regression model, we're going to try to find the values of the parameters w and b that minimize the cost function (J(w, b)), and we'll again apply gradient descent to do this. Let's take a look at how.

Gradient Descent Illustration 1

Once you've trained the model and found suitable parameters, you can use it to make predictions. For instance, given the input (x) of a new patient with certain tumor size and age, the model can estimate the probability of the label (y = 1) (e.g., diagnosis of a disease).

Gradient Descent Algorithm for Logistic Regression

The technique we use to minimize the cost function is gradient descent. Below is the cost function we wish to minimize:

Cost Function

To minimize the cost (J(w, b)), we'll use the gradient descent algorithm with the following update rule for each parameter:

wj:=wjαJ(w,b)wjw_j := w_j - \alpha \cdot \frac{\partial J(w, b)}{\partial w_j}

where (\alpha) is the learning rate and

J(w,b)wj\frac{\partial J(w, b)}{\partial w_j}

is the gradient of the cost function with respect to (w_j).

Gradient Descent Update Rule

The gradient descent updates are applied iteratively to optimize the parameters.

Derivative of the Cost Function with Respect to Parameters

The derivative of (J(w, b)) with respect to (w_j) is calculated as follows:

J(w,b)wj=1mi=1m(f(x(i))y(i))xj(i)\frac{\partial J(w, b)}{\partial w_j} = \frac{1}{m} \sum_{i=1}^m (f(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)}

where

f(x(i))f(x^{(i)})

is the sigmoid function applied to

wx(i)+bw \cdot x^{(i)} + b

Derivative with respect to w_j

Similarly, the derivative with respect to the bias (b) is:

J(w,b)b=1mi=1m(f(x(i))y(i))\frac{\partial J(w, b)}{\partial b} = \frac{1}{m} \sum_{i=1}^m (f(x^{(i)}) - y^{(i)})

Derivative with respect to b

Gradient Descent Update Rule

With the above derivatives in mind, the gradient descent update rules for logistic regression become:

wj:=wjα1mi=1m(f(x(i))y(i))xj(i)b:=bα1mi=1m(f(x(i))y(i))w_j := w_j - \alpha \cdot \frac{1}{m} \sum_{i=1}^m (f(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} b := b - \alpha \cdot \frac{1}{m} \sum_{i=1}^m (f(x^{(i)}) - y^{(i)})

Gradient Descent for Logistic Regression

Linear vs. Logistic Regression

You might notice that the update rules look similar to those used in linear regression. The key difference lies in the definition of (f(x)). For linear regression:

f(x)=wx+bf(x) = w \cdot x + b

Whereas for logistic regression:

f(x)=11+e(wx+b)f(x) = \frac{1}{1 + e^{-(w \cdot x + b)}}

Comparison of f(x)

Thus, while the gradient descent algorithm looks the same for both, the underlying functions are different, making the two algorithms distinct.

Additional Tips: Feature Scaling

When implementing gradient descent, feature scaling can help speed up convergence. Scaling all features to a similar range (e.g., between -1 and 1) helps the algorithm reach the optimal parameters faster.

Feature Scaling

Conclusion

In this section, you've learned how to implement gradient descent for logistic regression. The next step is to use the scikit-learn library, which simplifies logistic regression implementations, as well as explore vectorized implementations to further optimize the performance of your gradient descent algorithm.

Congratulations on reaching the end of this section. You're now equipped to implement logistic regression using gradient descent!

On this page

Edit on Github Question? Give us feedback