Feature Scaling Part 1
Introduction to Feature Scaling
So welcome back. Let's take a look at some techniques that make great inter sense work much better. In this part, you see a technique called feature scaling
that will enable gradient descent to run much faster.
As a concrete example, let's predict the price of a house using two features (x_1) the size of the house and (x_2) the number of bedrooms.
Let's say that (x_1) typically ranges from 300 to 2000 square feet. And (x_2) in the data set ranges from 0 to 5 bedrooms. So for this example, (x_1) takes on a relatively large range of values and (x_2) takes on a relatively small range of values.
Now let's take an example of a house that has a size of 2000 square feet has five bedrooms and a price of 500k or $500,000. For this one training example, what do you think are reasonable values for the size of parameters (w_1) and (w_2)? Well, let's look at one possible set of parameters. Say (w_1) is 50 and (w_2) is 0.1 and (b) is 50 for the purposes of discussion.
So in this case the estimated price in thousands of dollars is (100,000k) here plus (0.5k) plus (50k). Which is slightly over (100) million dollars. So that's clearly very far from the actual price of $500,000. And so this is not a very good set of parameter choices for (w_1) and (w_2).
Now let's take a look at another possibility. Say (w_1) and (w_2) were the other way around. (w_1) is 0.1 and (w_2) is 50 and (b) is still also 50. In this choice of (w_1) and (w_2), (w_1) is relatively small and (w_2) is relatively large, 50 is much bigger than 0.1.
So here the predicted price is (0.1 \times 2000 + 50 \times 5 + 50). The first term becomes (200k), the second term becomes (250k), and the plus (50). So this version of the model predicts a price of $500,000 which is a much more reasonable estimate and happens to be the same price as the true price of the house.
So hopefully you might notice that when a possible range of values of a feature is large, like the size and square feet which goes all the way up to 2000. It's more likely that a good model will learn to choose a relatively small parameter value, like 0.1.
Likewise, when the possible values of the feature are small, like the number of bedrooms, then a reasonable value for its parameters will be relatively large like 50.
Visualization of Features
So how does this relate to grading descent? Well, let's take a look at the scatter plot of the features where the size square feet is the horizontal axis (x_1) and the number of bedrooms (x_2) is on the vertical axis.
If you plot the training data, you notice that the horizontal axis is on a much larger scale or much larger range of values compared to the vertical axis.
Next, let's look at how the cost function might look in a contour plot.
You might see a contour plot where the horizontal axis has a much narrower range, say between zero and one, whereas the vertical axis takes on much larger values, say between 10 and 100. So the contours form ovals or ellipses and they're short on one side and longer on the other. And this is because a very small change to (w_1) can have a very large impact on the estimated price and that's a very large impact on the cost (J).
Because (w_1) tends to be multiplied by a very large number, the size and square feet. In contrast, it takes a much larger change in (w_2) in order to change the predictions much. And thus small changes to (w_2), don't change the cost function nearly as much.
Implications for Gradient Descent
So where does this leave us? This is what might end up happening if you were to run great in descent, if you were to use your training data as is. Because the contours are so tall and skinny gradient descent may end up bouncing back and forth for a long time before it can finally find its way to the global minimum.
In situations like this, a useful thing to do is to scale the features. This means performing some transformation of your training data so that (x_1) say might now range from 0 to 1 and (x_2) might also range from 0 to 1.
So the data points now look more like this and you might notice that the scale of the plot on the bottom is now quite different than the one on top.
The key point is that the re-scale (x_1) and (x_2) are both now taking comparable ranges of values to each other. And if you run gradient descent on a cost function to find on this, re-scaled (x_1) and (x_2) using this transformed data, then the contours will look more like this more like circles and less tall and skinny.
And gradient descent can find a much more direct path to the global minimum.
So to recap, when you have different features that take on very different ranges of values, it can cause gradient descent to run slowly but re-scaling the different features so they all take on comparable range of values. because speed, upgrade and dissent significantly
. How do you actually do this? Let's take a look at that in the next part.`;