In the week 2 assignment, you implemented the gradient descent method to build a linear regression model, predicting sales given a TV marketing budget. In this lab, you will construct a neural network corresponding to the same simple linear regression model. Then you will train the network, implementing the gradient descent method. After that you will increase the complexity of the neural network to build a multiple linear regression model, predicting house prices based on their size and quality.
Note: The same models were discussed in Course 1 "Linear Algebra" week 3 assignment, but model training with backward propagation was omitted.
Packages
Let's first import all the required packages.
import numpy as npimport matplotlib.pyplot as plt# A library for data manipulation and analysis.import pandas as pd# Output of plotting commands is displayed inline within the Jupyter notebook.%matplotlib inline # Set a seed so that the results are consistent.np.random.seed(3)
where y^ is a prediction of dependent variable y based on independent variable x using a line equation with the slope w and intercept b.
Given a set of training data points (x1,y1), ..., (xm,ym), you will find the "best" fitting line - such parameters w and b that the differences between original values yi and predicted values y^i=wxi+b are minimum.
Input Node
The simplest neural network model that describes the above problem can be realized by using one perceptron. The input and output layers will have one node each (x for input and y^=z for output):
Weight (w) and bias (b) are the parameters that will get updated when you train the model. They are initialized to some random values or set to 0 and updated as the training progresses.
For each training example x(i), the prediction y^(i) can be calculated as:
z(i)y^(i)=wx(i)+b,=z(i),(2)
where i=1,…,m.
You can organise all training examples as a vector X of size (1×m) and perform scalar multiplication of X (1×m) by a scalar w, adding b, which will be broadcasted to a vector of size (1×m):
ZY^=wX+b,=Z,(3)
This set of calculations is called forward propagation.
For each training example you can measure the difference between original values y(i) and predicted values y^(i) with the loss functionL(w,b)=21(y^(i)−y(i))2. Division by 2 is taken just for scaling purposes, you will see the reason below, calculating partial derivatives. To compare the resulting vector of the predictions Y^ (1×m) with the vector Y of original values y(i), you can take an average of the loss function values for each of the training examples:
L(w,b)=2m1i=1∑m(y^(i)−y(i))2.(4)
This function is called the sum of squares cost function. The aim is to optimize the cost function during the training, which will minimize the differences between original values y(i) and predicted values y^(i).
When your weights were just initialized with some random values, and no training was done yet, you can't expect good results. You need to calculate the adjustments for the weight and bias, minimizing the cost function. This process is called backward propagation.
According to the gradient descent algorithm, you can calculate partial derivatives as:
You can see how the additional division by 2 in the equation (4) helped to simplify the results of the partial derivatives. Then update the parameters iteratively using the expressions
wb=w−α∂w∂L,=b−α∂b∂L,(6)
where α is the learning rate. Then repeat the process until the cost function stops decreasing.
The general methodology to build a neural network is to:
Define the neural network structure ( # of input units, # of hidden units, etc).
Initialize the model's parameters
Loop:
Implement forward propagation (calculate the perceptron output),
Implement backward propagation (to get the required corrections for the parameters),
Update parameters.
Make predictions.
You often build helper functions to compute steps 1-3 and then merge them into one function nn_model(). Once you've built nn_model() and learnt the right parameters, you can make predictions on new data.
##ingwww.kaggle.com/code/devzohaib/simple-linear-regression/notebook), saved in a file data/tvmarketing.csv. It has two fields: TV marketing expenses (TV) and sales amount (Sales).
The fields TV and Sales have different units. Remember that in the week 2 assignment to make gradient descent algorithm efficient, you needed to normalize each of them: subtract the mean value of the array from each of the elements in the array and divide them by the standard deviation.
Column-wise normalization of the dataset can be done for all of the fields at once and is implemented in the following code:
adv_norm = (adv - np.mean(adv))/np.std(adv)
Plotting the data, you can see that it looks similar after normalization, but the values on the axes have changed:
Setup the neural network in a way which will allow to extend this simple case of a model with a single perceptron and one input node to more complicated structures later.
#:
n_x: the size of the input layer
n_y: the size of the output layer
using shapes of arrays X and Y.
def layer_sizes(X, Y): """ Arguments: X -- input dataset of shape (input size, number of examples) Y -- labels of shape (output size, number of examples) Returns: n_x -- the size of the input layer n_y -- the size of the output layer """ n_x = X.shape[0] n_y = Y.shape[0] return (n_x, n_y)(n_x, n_y) = layer_sizes(X_norm, Y_norm)print("The size of the input layer is: n_x = " + str(n_x))print("The size of the output layer is: n_y = " + str(n_y))
##ing initialize_parameters(), initializing the weights array of shape (ny×nx)=(1×1) with random values and the bias vector of shape (ny×1)=(1×1) with zeros.
def initialize_parameters(n_x, n_y): """ Returns: params -- python dictionary containing your parameters: W -- weight matrix of shape (n_y, n_x) b -- bias value set as a vector of shape (n_y, 1) """ W = np.random.randn(n_y, n_x) * 0.01 b = np.zeros((n_y, 1)) parameters = {"W": W, "b": b} return parametersparameters = initialize_parameters(n_x, n_y)print("W = " + str(parameters["W"]))print("b = " + str(parameters["b"]))
def forward_propagation(X, parameters): """ Argument: X -- input data of size (n_x, m) parameters -- python dictionary containing your parameters (output of initialization function) Returns: Y_hat -- The output """ W = parameters["W"] b = parameters["b"] # Forward Propagation to calculate Z. Z = np.matmul(W, X) + b Y_hat = Z return Y_hatY_hat = forward_propagation(X_norm, parameters)print("Some elements of output vector Y_hat:", Y_hat[0, 0:5])
Your weights were just initialized with some random values, so the model has not been trained yet.
Define a cost function (4) which will be used to train the model:
L(w,b)=2m1i=1∑m(y^(i)−y(i))2
def compute_cost(Y_hat, Y): """ Computes the cost function as a sum of squares Arguments: Y_hat -- The output of the neural network of shape (n_y, number of examples) Y -- "true" labels vector of shape (n_y, number of examples) Returns: cost -- sum of squares scaled by 1/(2*number of examples) """ # Number of examples. m = Y_hat.shape[1] # Compute the cost function. cost = np.sum((Y_hat - Y)**2)/(2*m) return costprint("cost = " + str(compute_cost(Y_hat, Y_norm)))
def backward_propagation(Y_hat, X, Y): """ Implements the backward propagation, calculating gradients Arguments: Y_hat -- the output of the neural network of shape (n_y, number of examples) X -- input data of shape (n_x, number of examples) Y -- "true" labels vector of shape (n_y, number of examples) Returns: grads -- python dictionary containing gradients with respect to different parameters """ m = X.shape[1] # Backward propagation: calculate partial derivatives denoted as dW, db for simplicity. dZ = Y_hat - Y dW = 1/m * np.dot(dZ, X.T) db = 1/m * np.sum(dZ, axis = 1, keepdims = True) grads = {"dW": dW, "db": db} return gradsgrads = backward_propagation(Y_hat, X_norm, Y_norm)print("dW = " + str(grads["dW"]))print("db = " + str(grads["db"]))
Update parameters as shown in (6):
wb=w−α∂w∂L,=b−α∂b∂L.
def update_parameters(parameters, grads, learning_rate=1.2): """ Updates parameters using the gradient descent update rule Arguments: parameters -- python dictionary containing parameters grads -- python dictionary containing gradients learning_rate -- learning rate parameter for gradient descent Returns: parameters -- python dictionary containing updated parameters """ # Retrieve each parameter from the dictionary "parameters". W = parameters["W"] b = parameters["b"] # Retrieve each gradient from the dictionary "grads". dW = grads["dW"] db = grads["db"] # Update rule for each parameter. W = W - learning_rate * dW b = b - learning_rate * db parameters = {"W": W, "b": b} return parametersparameters_updated = update_parameters(parameters, grads)print("W updated = " + str(parameters_updated["W"]))print("b updated = " + str(parameters_updated["b"]))
def nn_model(X, Y, num_iterations=10, learning_rate=1.2, print_cost=False): """ Arguments: X -- dataset of shape (n_x, number of examples) Y -- labels of shape (n_y, number of examples) num_iterations -- number of iterations in the loop learning_rate -- learning rate parameter for gradient descent print_cost -- if True, print the cost every iteration Returns: parameters -- parameters learnt by the model. They can then be used to make predictions. """ n_x = layer_sizes(X, Y)[0] n_y = layer_sizes(X, Y)[1] parameters = initialize_parameters(n_x, n_y) # Loop for i in range(0, num_iterations): # Forward propagation. Inputs: "X, parameters". Outputs: "Y_hat". Y_hat = forward_propagation(X, parameters) # Cost function. Inputs: "Y_hat, Y". Outputs: "cost". cost = compute_cost(Y_hat, Y) # Backpropagation. Inputs: "Y_hat, X, Y". Outputs: "grads". grads = backward_propagation(Y_hat, X, Y) # Gradient descent parameter update. Inputs: "parameters, grads, learning_rate". Outputs: "parameters". parameters = update_parameters(parameters, grads, learning_rate) # Print the cost every iteration. if print_cost: print ("Cost after iteration %i: %f" %(i, cost)) return parameters
You can see that after a few iterations the cost function does not change anymore (the model converges).
Note: This is a very simple model. In reality the models do not converge that quickly.
The final model parameters can be used for making predictions, but don't forget about normalization and denormalization.
def predict(X, Y, parameters, X_pred): # Retrieve each parameter from the dictionary "parameters". W = parameters["W"] b = parameters["b"] # Use the same mean and standard deviation of the original training array X. if isinstance(X, pd.Series): X_mean = np.mean(X) X_std = np.std(X) X_pred_norm = ((X_pred - X_mean)/X_std).reshape((1, len(X_pred))) else: X_mean = np.array(np.mean(X)).reshape((len(X.axes[1]),1)) X_std = np.array(np.std(X)).reshape((len(X.axes[1]),1)) X_pred_norm = ((X_pred - X_mean)/X_std) # Make predictions. Y_pred_norm = np.matmul(W, X_pred_norm) + b # Use the same mean and standard deviation of the original training array Y. Y_pred = Y_pred_norm * np.std(Y) + np.mean(Y) return Y_pred[0]X_pred = np.array([50, 120, 280])Y_pred = predict(adv["TV"], adv["Sales"], parameters_simple, X_pred)print(f"TV marketing expenses:\n{X_pred}")print(f"Predictions of sales:\n{Y_pred}")
Let's plot the linear regression line and some predictions. The regression line is red and the predicted points are blue.
where Wx is the dot product of the input vector x=[x1x2] and the parameters vector W=[w1w2], scalar parameter b is the intercept. The goal of the training process is to find the "best" parameters w1, w2 and b such that the differences between original values yi and predicted values y^i are minimum for the given training examples.
Input Nodes
To describe the multiple regression problem, you can still use a model with one perceptron, but this time you need two input nodes, as shown in the following scheme:
The perceptron output calculation for every training example x(i)=[x1(i)x2(i)] can be written with dot product:
z(i)=w1x1(i)+w2x2(i)+b=Wx(i)+b,(8)
where weights are in the vector W=[w1w2] and bias b is a scalar. The output layer will have the same single node y^=z.
Organise all training examples in a matrix X of a shape (2×m), putting x1(i) and x2(i) into columns. Then matrix multiplication of W (1×2) and X (2×m) will give a (1×m) vector
where b is broadcasted to the vector of size (1×m). These are the calculations to perform in the forward propagation step. Cost function will remain the same (see equation (4) in the section 1.2):
L(w,b)=2m1i=1∑m(y^(i)−y(i))2.
To implement the gradient descent algorithm, you can calculate cost function partial derivatives as:
After performing the forward propagation as shown in (9), the variable Y^ will contain the predictions in the array of size (1×m). The original values y(i) will be kept in the array Y of the same size. Thus, (Y^−Y) will be a (1×m) array containing differences (y^(i)−y(i)). Matrix X of size (2×m) has all x1(i) values in the first row and x2(i) in the second row. Thus, the sums in the first two equations of (10) can be calculated as matrix multiplication of (Y^−Y) of a shape (1×m) and XT of a shape (m×2), resulting in the (1×2) array:
∂W∂L=[∂w1∂L∂w2∂L]=m1(Y^−Y)XT.(11)
Similarly for ∂b∂L:
∂b∂L=m1(Y^−Y)1.(12)
where 1 is just a (m×1) vector of ones.
See how linear algebra and calculus work together to make calculations so nice and tidy! You can now update the parameters using matrix form of W:
Wb=W−α∂W∂L,=b−α∂b∂L,(13)
where α is a learning rate. Repeat the process in a loop until the cost function stops decreasing.
##ing for a Kaggle dataset House Prices, saved in a file data/house_prices_train.csv. You will use two fields - ground living area (GrLivArea, square feet) and rates of the overall quality of material and finish (OverallQual, 1-10) to predict sales price (SalePrice, dollars).
To open the dataset you can use pandas function read_csv:
df = pd.read_csv('data/house_prices_train.csv')
Select the required fields and save them in the variables X_multi, Y_multi:
Convert results to the NumPy arrays, transpose X_multi_norm to get an array of a shape (2×m) and reshape Y_multi_norm to bring it to the shape (1×m):
X_multi_norm = np.array(X_multi_norm).TY_multi_norm = np.array(Y_multi_norm).reshape((1, len(Y_multi_norm)))print ('The shape of X: ' + str(X_multi_norm.shape))print ('The shape of Y: ' + str(Y_multi_norm.shape))print ('I have m = %d training examples!' % (X_multi_norm.shape[1]))
Regression
Now... you do not need to change anything in your neural network implementation! Go through the code in section 2 and see that if you pass new datasets X_multi_norm and Y_multi_norm, the input layer size nx will get equal to 2 and the rest of the implementation will remain exactly the same, even the backward propagation!
X_pred_multi = np.array([[1710, 7], [1200, 6], [2200, 8]]).TY_pred_multi = predict(X_multi, Y_multi, parameters_multi, X_pred_multi)print(f"Ground living area, square feet:\n{X_pred_multi[0]}")print(f"Rates of the overall quality of material and finish, 1-10:\n{X_pred_multi[1]}")print(f"Predictions of sales price, $:\n{np.round(Y_pred_multi)}")