Classification is the problem of identifying which of a set of categories an observation belongs to. In case of only two categories it is called a binary classification problem. Let's see a simple example of it.
Imagine that you have a set of sentences which you want to classify as "happy" and "angry". And you identified that the sentences contain only two words: aack and beep. For each of the sentences (data point in the given dataset) you count the number of those two words (x1 and x2) and compare them with each other. If there are more "beep" (x2>x1), the sentence should be classified as "angry", if not (x2<=x1), it is a "happy" sentence. Which means that there will be some straight line separating those two classes.
Let's take a very simple set of 4 sentenses:
"Beep!"
"Aack?"
"Beep aack..."
"!?"
Here both x1 and x2 will be either 0 or 1. You can plot those points in a plane, and see the points (observations) belong to two classes, "angry" (red) and "happy" (blue), and a straight line can be used as a decision boundary to separate those two classes. An example of such a line is plotted.
This particular line is chosen using common sense, just looking at the visual representation of the observations. Such classification problem is called a problem with two linearly separable classes.
The line x1−x2+0.5=0 (or x2=x1+0.5) can be used as a separating line for the problem. All of the points (x1,x2) above this line, such that x1−x2+0.5<0 (or x2>x1+0.5), will be considered belonging to the red class, and below this line x1−x2+0.5>0 (x2<x1+0.5) - belonging to the blue class. So the problem can be rephrased: in the expression w1x1+w2x2+b=0 find the values for the parameters w1, w2 and the threshold b, so that the line can serve as a decision boundary.
In this simple example you could solve the problem of finding the decision boundary just looking at the plot: w1=1, w2=−1, b=0.5. But what if the problem is more complicated? You can use a simple neural network model to do that! Let's implement it for this example and then try it for more complicated problem.
You already have constructed and trained a neural network model with one perceptron. Here a similar model can be used, but with an activation function. Then a single perceptron basically works as a threshold function.
The neural network components are shown in the following scheme:
Similarly to the previous lab, the input layer contains two nodes x1 and x2. Weight vector W=[w1w2] and bias (b) are the parameters to be updated during the model training. First step in the forward propagation is the same as in the previous lab. For every training example x(i)=[x1(i)x2(i)]:
z(i)=w1x1(i)+w2x2(i)+b=Wx(i)+b.(1)
But now you cannot take a real number z(i) into the output as you need to perform classification. It could be done with a discrete approach: compare the result with zero, and classify as 0 (blue) if it is below zero and 1 (red) if it is above zero. Then define cost function as a percentage of incorrectly identified classes and perform backward propagation.
This extra step in the forward propagation is actually an application of an activation function. It would be possible to implement the discrete approach described above (with unit step function) for this problem, but it turns out that there is a continuous approach that works better and is commonly used in more complicated neural networks. So you will implement it here: single perceptron with sigmoid activation function.
Sigmoid activation function is defined as
a=σ(z)=1+e−z1.(2)
Then a threshold value of 0.5 can be used for predictions: 1 (red) if a>0.5 and 0 (blue) otherwise. Putting it all together, mathematically the single perceptron neural network with sigmoid activation function can be expressed as:
z(i)a(i)=Wx(i)+b,=σ(z(i)).
If you have m training examples organised in the columns of (2×m) matrix X, you can apply the activation function element-wise. So the model can be written as:
ZA=WX+b,=σ(Z),
where b is broadcasted to the vector of a size (1×m).
When dealing with classification problems, the most commonly used cost function is the log loss, which is described by the following equation:
Note that the obtained expressions (7) are exactly the same as in the section 3.2 of the previous lab, when multiple linear regression model was discussed. Thus, they can be rewritten in a matrix form:
Let's get the dataset you will work on. The following code will create m=30 data points (x1,x2), where x1,x2∈{0,1} and save them in the NumPy array X of a shape (2×m) (in the columns of the array). The labels (0: blue, 1: red) will be calculated so that y=1 if x1=0 and x2=1, in the rest of the cases y=0. The labels will be saved in the array Y of a shape (1×m).
Implementation of the described neural network will be very similar to the previous lab. The differences will be only in the functions forward_propagation and compute_cost!
Implement the function initialize_parameters(), initializing the weights array of shape (ny×nx)=(1×1) with random values and the bias vector of shape (ny×1)=(1×1) with zeros.
You can see that after about 40 iterations the cost function does keep decreasing, but not as much. It is a sign that it might be reasonable to stop training there. The final model parameters can be used to find the boundary line and for making predictions. Let's visualize the boundary line.