Logistic Regression
Overview
Let's talk about logistic regression, which is probably the single most widely used classification algorithm in the world. This is something that I use all the time in my work.
Problem Setup
Let's continue with the example of classifying whether a tumor is malignant.
Whereas before, we're going to use the label 1 or yes to the positive class to represent malignant tumors, and zero or no and negative examples to represent benign tumors. Here's a graph of the dataset where the horizontal axis is the tumor size, and the vertical axis takes on only values of 0 and 1, because this is a classification problem.
You saw in the last part that linear regression is not a good algorithm for this problem. In contrast, what logistic regression we end up doing is fit a curve that looks like this, S-shaped curve to this dataset. For this example, if a patient comes in with a tumor of this size, which I'm showing on the x-axis, then the algorithm will output 0.7 suggesting that it is closer or maybe more likely to be malignant than benign.
Will say more later what 0.7 actually means in this context. But the output label y is never 0.7; it is only ever 0 or 1. To build out the logistic regression algorithm, there's an important mathematical function I like to describe, which is called the Sigmoid function, sometimes also referred to as the logistic function.
The Sigmoid Function
The Sigmoid function looks like this:
Notice that the x-axis of the graph on the left and right are different. In the graph to the left, the x-axis is the tumor size, so it is all positive numbers. Whereas in the graph on the right, you have 0 down here, and the horizontal axis takes on both negative and positive values and is labeled as Z. I'm showing here just a range from -3 to +3. The Sigmoid function outputs a value between 0 and 1. If I use ( g(z) ) to denote this function, then the formula of ( g(z) ) is given by:
Where here ( e ) is a mathematical constant that takes on a value of about 2.7, and so
is that mathematical constant to the power of (-z). Notice if ( z ) were really big, say 100, then
is ( e^-100 ), which is a tiny number. So this ends up being:
This is why when ( z ) is large, ( g(z) ) that is the Sigmoid function of ( z ) is going to be very close to 1.
Conversely, you can also check for yourself that when ( z ) is a very large negative number, then ( g(z) ) becomes:
That's why the Sigmoid function has this shape, where it starts very close to zero and slowly builds up or grows to the value of one. Also, in the Sigmoid function, when ( z ) is equal to 0, then
, and so:
So that's why it passes the vertical axis at 0.5.
Building the Logistic Regression Algorithm
Now, let's use this to build up to the logistic regression algorithm. We're going to do this in two steps. In the first step, I hope you remember that a straight line function, like a linear regression function can be defined as
Let's store this value in a variable which I'm going to call ( z ), and this will turn out to be the same ( z ) as the one you saw on the previous slide, but we'll get to that in a minute.
The next step then is to take this value of ( z ) and pass it to the Sigmoid function, also called the logistic function, ( g ).
Now, ( g(z) ) then outputs a value computed by this formula:
There's going to be a value between 0 and 1. When you take these two equations and put them together, they then give you the logistic regression model ( f(x) ), which is equal to ( g(wx + b) ). Or equivalently, ( g(z) ), which is equal to this formula over here:
This is the logistic regression model, and what it does is it inputs a feature or set of features ( X ) and outputs a number between 0 and 1.
Interpreting the Output
Next, let's take a look at how to interpret the output of logistic regression. We'll return to the tumor classification example. The way I encourage you to think of logistic regression's output is to think of it as outputting the probability that the class or the label ( y ) will be equal to 1 given a certain input ( x ). For example, in this application, where ( x ) is the tumor size and ( y ) is either 0 or 1, if you have a patient come in and she has a tumor of a certain size ( x ), and if based on this input ( x ), the model outputs 0.7, then what that means is that the model is predicting or the model thinks there's a 70 percent chance that the true label ( y ) would be equal to 1 for this patient.
In other words, the model is telling us that it thinks the patient has a 70 percent chance of the tumor turning out to be malignant. Now, let me ask you a question. See if you can get this right. We know that ( y ) has to be either 0 or 1, so if ( y ) has a 70 percent chance of being 1, what is the chance that it is 0? So ( y ) has got to be either 0 or 1, and thus the probability of it being 0 or 1—these two numbers have to add up to one or to a 100 percent chance.
That's why if the chance of ( y ) being 1 is 0.7 or 70 percent chance, then the chance of it being 0 has got to be 0.3 or 30 percent chance.
If someday you read research papers or blog posts about logistic regression, sometimes you see this notation that:
What the semicolon here is used to denote is just that ( w ) and ( b ) are parameters that affect this computation of what is the probability of ( y ) being equal to 1 given the input feature ( x ). For the purpose of this class, don't worry too much about what this vertical line and what the semicolon mean. You don't need to remember or follow any of this mathematical notation for this class. I'm mentioning this only because you may see this in other places.
Conclusion
In the notebook that follows this part, you also get to see how the Sigmoid function is implemented in code. You can see a plot that uses the Sigmoid function so as to do better on the classification tasks that you saw in the previous notebook. Remember that the code will be provided to you, so you just have to run it. I hope you take a look and get familiar with the code.
Congrats on getting here. You now know what the logistic regression model is, as well as the mathematical formula that defines logistic regression. For a long time, a lot of Internet advertising was actually driven by a slight variation of logistic regression. This was very lucrative for some large companies, and this is basically the algorithm that decided what ad was shown to you and many others on some large websites. Now, there's even more to learn about this algorithm. In the next part, we'll take a look at the details of logistic regression. We'll look at some visualizations and also examine something called the decision boundary. This will give you a few different ways to map the numbers that this model outputs, such as 0.3, or 0.7, or 0.65 to a prediction of whether ( y ) is actually 0 or 1. Let's go on to the next part to learn more about logistic regression.