The cost function gives us a way to measure how well a specific set of parameters fits the training data. It allows us to evaluate and improve our parameter choices. In this part, we'll explore why the squared error cost function is not suitable for logistic regression. We'll also introduce an alternative cost function designed to optimize logistic regression models.
Here’s what a sample training set for a logistic regression model might look like:
In this scenario, each row might represent a patient’s medical record, including various features like tumor size, patient age, and other factors. These are our features X_1 through X_n. As this is a binary classification problem, the target label y takes only two values: 0 or 1.
In linear regression, we typically use the squared error cost function. Here's the cost function:
This cost function works well for linear regression because it is convex—meaning gradient descent can easily converge to the global minimum. The process of gradient descent would look something like this:
However, if we apply the same cost function to logistic regression, we get a non-convex cost function. This creates multiple local minima, which makes gradient descent less effective.
For logistic regression, the squared error cost function is not ideal. We need a different cost function that ensures convexity, allowing gradient descent to reliably find the global minimum.
We now aim to build a new cost function for logistic regression. This involves redefining the cost function J(w, b).
Let’s denote the loss for a single training example as L(f(x), y). The loss function inputs the prediction f(x) and the true label y, guiding how well the model is performing on an individual example. By adjusting this loss function, we can ensure the overall cost function remains convex.
Next, let’s examine the second part of the loss function when y = 0:
In this case, the loss is -log(1 - f(x)). If the prediction f(x) is close to 0, the loss is minimal. But if f(x) is far from 0, the loss increases, and can approach infinity as f(x) nears 1.
For example, if the model predicts a high probability (say 99.9%) of a malignant tumor, but the true label is 0 (non-malignant), the loss would be extremely high.
In this part, we’ve discussed why the squared error cost function is not suitable for logistic regression. We then introduced a new loss function that ensures the overall cost function is convex, making gradient descent reliable for finding the global minimum.
Proving that this function is convex, it's beyond the scope of this cost. You may remember that the cost function is a function of the entire training set and is, therefore, the average or 1 over m times the sum of the loss function on the individual training examples. The cost on a certain set of parameters, w and b, is equal to 1 over m times the sum of all the training examples of the loss on the training examples. If you can find the value of the parameters, w and b, that minimizes this, then you'd have a pretty good set of values for the parameters w and b for logistic regression.
In the upcoming notebook, you'll get to take a look at how the squared error cost function doesn't work very well for classification, because you see that the surface plot results in a very wiggly costs surface with many local minima. Then you'll take a look at the new logistic loss function. As you can see here, this produces a nice and smooth convex surface plot that does not have all those local minima. Please take a look at the cost and the plots after this part. We've seen a lot in this part. In the next part, let's go back and take the loss function for a single train example and use that to define the overall cost function for the entire training set. We'll also figure out a simpler way to write out the cost function, which will then later allow us to run gradient descent to find good parameters for logistic regression.