In the week 2 assignment, you implemented the gradient descent method to build a linear regression model, predicting sales given a TV marketing budget. In this lab, you will construct a neural network corresponding to the same simple linear regression model. Then you will train the network, implementing the gradient descent method. After that you will increase the complexity of the neural network to build a multiple linear regression model, predicting house prices based on their size and quality.
Note: The same models were discussed in Course 1 "Linear Algebra" week 3 assignment, but model training with backward propagation was omitted.
where y^ is a prediction of dependent variable y based on independent variable x using a line equation with the slope w and intercept b.
Given a set of training data points (x1,y1), ..., (xm,ym), you will find the "best" fitting line - such parameters w and b that the differences between original values yi and predicted values y^i=wxi+b are minimum.
Input Node
The simplest neural network model that describes the above problem can be realized by using one perceptron. The input and output layers will have one node each (x for input and y^=z for output):
Weight (w) and bias (b) are the parameters that will get updated when you train the model. They are initialized to some random values or set to 0 and updated as the training progresses.
For each training example x(i), the prediction y^(i) can be calculated as:
z(i)y^(i)=wx(i)+b,=z(i),(2)
where i=1,…,m.
You can organise all training examples as a vector X of size (1×m) and perform scalar multiplication of X (1×m) by a scalar w, adding b, which will be broadcasted to a vector of size (1×m):
ZY^=wX+b,=Z,(3)
This set of calculations is called forward propagation.
For each training example you can measure the difference between original values y(i) and predicted values y^(i) with the loss functionL(w,b)=21(y^(i)−y(i))2. Division by 2 is taken just for scaling purposes, you will see the reason below, calculating partial derivatives. To compare the resulting vector of the predictions Y^ (1×m) with the vector Y of original values y(i), you can take an average of the loss function values for each of the training examples:
L(w,b)=2m1i=1∑m(y^(i)−y(i))2.(4)
This function is called the sum of squares cost function. The aim is to optimize the cost function during the training, which will minimize the differences between original values y(i) and predicted values y^(i).
When your weights were just initialized with some random values, and no training was done yet, you can't expect good results. You need to calculate the adjustments for the weight and bias, minimizing the cost function. This process is called backward propagation.
According to the gradient descent algorithm, you can calculate partial derivatives as:
You can see how the additional division by 2 in the equation (4) helped to simplify the results of the partial derivatives. Then update the parameters iteratively using the expressions
wb=w−α∂w∂L,=b−α∂b∂L,(6)
where α is the learning rate. Then repeat the process until the cost function stops decreasing.
The general methodology to build a neural network is to:
Define the neural network structure ( # of input units, # of hidden units, etc).
Initialize the model's parameters
Loop:
Implement forward propagation (calculate the perceptron output),
Implement backward propagation (to get the required corrections for the parameters),
Update parameters.
Make predictions.
You often build helper functions to compute steps 1-3 and then merge them into one function nn_model(). Once you've built nn_model() and learnt the right parameters, you can make predictions on new data.
##ingwww.kaggle.com/code/devzohaib/simple-linear-regression/notebook), saved in a file data/tvmarketing.csv. It has two fields: TV marketing expenses (TV) and sales amount (Sales).
Print some part of the dataset.
And plot it:
The fields TV and Sales have different units. Remember that in the week 2 assignment to make gradient descent algorithm efficient, you needed to normalize each of them: subtract the mean value of the array from each of the elements in the array and divide them by the standard deviation.
Column-wise normalization of the dataset can be done for all of the fields at once and is implemented in the following code:
Plotting the data, you can see that it looks similar after normalization, but the values on the axes have changed:
Save the fields into variables X_norm and Y_norm and reshape them to row vectors:
Setup the neural network in a way which will allow to extend this simple case of a model with a single perceptron and one input node to more complicated structures later.
#:
n_x: the size of the input layer
n_y: the size of the output layer
using shapes of arrays X and Y.
##ing initialize_parameters(), initializing the weights array of shape (ny×nx)=(1×1) with random values and the bias vector of shape (ny×1)=(1×1) with zeros.
where Wx is the dot product of the input vector x=[x1x2] and the parameters vector W=[w1w2], scalar parameter b is the intercept. The goal of the training process is to find the "best" parameters w1, w2 and b such that the differences between original values yi and predicted values y^i are minimum for the given training examples.
Input Nodes
To describe the multiple regression problem, you can still use a model with one perceptron, but this time you need two input nodes, as shown in the following scheme:
The perceptron output calculation for every training example x(i)=[x1(i)x2(i)] can be written with dot product:
z(i)=w1x1(i)+w2x2(i)+b=Wx(i)+b,(8)
where weights are in the vector W=[w1w2] and bias b is a scalar. The output layer will have the same single node y^=z.
Organise all training examples in a matrix X of a shape (2×m), putting x1(i) and x2(i) into columns. Then matrix multiplication of W (1×2) and X (2×m) will give a (1×m) vector
where b is broadcasted to the vector of size (1×m). These are the calculations to perform in the forward propagation step. Cost function will remain the same (see equation (4) in the section 1.2):
L(w,b)=2m1i=1∑m(y^(i)−y(i))2.
To implement the gradient descent algorithm, you can calculate cost function partial derivatives as:
After performing the forward propagation as shown in (9), the variable Y^ will contain the predictions in the array of size (1×m). The original values y(i) will be kept in the array Y of the same size. Thus, (Y^−Y) will be a (1×m) array containing differences (y^(i)−y(i)). Matrix X of size (2×m) has all x1(i) values in the first row and x2(i) in the second row. Thus, the sums in the first two equations of (10) can be calculated as matrix multiplication of (Y^−Y) of a shape (1×m) and XT of a shape (m×2), resulting in the (1×2) array:
∂W∂L=[∂w1∂L∂w2∂L]=m1(Y^−Y)XT.(11)
Similarly for ∂b∂L:
∂b∂L=m1(Y^−Y)1.(12)
where 1 is just a (m×1) vector of ones.
See how linear algebra and calculus work together to make calculations so nice and tidy! You can now update the parameters using matrix form of W:
Wb=W−α∂W∂L,=b−α∂b∂L,(13)
where α is a learning rate. Repeat the process in a loop until the cost function stops decreasing.
##ing for a Kaggle dataset House Prices, saved in a file data/house_prices_train.csv. You will use two fields - ground living area (GrLivArea, square feet) and rates of the overall quality of material and finish (OverallQual, 1-10) to predict sales price (SalePrice, dollars).
To open the dataset you can use pandas function read_csv:
Select the required fields and save them in the variables X_multi, Y_multi:
Preview the data:
Normalize the data:
Convert results to the NumPy arrays, transpose X_multi_norm to get an array of a shape (2×m) and reshape Y_multi_norm to bring it to the shape (1×m):
Regression
Now... you do not need to change anything in your neural network implementation! Go through the code in section 2 and see that if you pass new datasets X_multi_norm and Y_multi_norm, the input layer size nx will get equal to 2 and the rest of the implementation will remain exactly the same, even the backward propagation!