Let's look at how you can implement feature scaling, to take features that take on very different ranges of values and skill them to have comparable ranges of values to each other.
How do you actually scale features? Well, if (x_1) ranges from 3-2,000, one way to get a scale version of (x_1) is to take each original x_1 value and divide by 2,000, the maximum of the range.
The scale (x_1) will range from 0.15 up to one.
Similarly, since (x_2) ranges from 0-5, you can calculate a scale version of (x_2) by taking each original (x_2) and dividing by five, which is again the maximum.
So the scale (x_2) will now range from 0-1. If you plot the scale to (x_1) and (x_2) on a graph, it might look like this.
In addition to dividing by the maximum, you can also do what's called mean normalization.
What this looks like is, you start with the original features and then you re-scale them so that both of them are centered around zero.
Whereas before they only had values greater than zero, now they have both negative and positive values that may be usually between negative one and plus one.
To calculate the mean normalization of (x_1), first find the average, also called the mean of (x_1) on your training set, and let's call this mean M_1, with this being the Greek alphabets Mu. For example, you may find that the average of feature 1, (\mu_1) is 600 square feet.
Let's take each x_1, subtract the mean Mu_1, and then let's divide by the difference 2,000 - 3 which is the maximum value minus the minimum value.
In contrast, the average for the number of bedrooms may be 3. So similarly, you can calculate the mean normalization for the second feature.
And it would be:
scaled x2=max(x2)−min(x2)x2−μ2=5−0x2−3
And now both features, (x_1) and (x_2), are centered around zero and ready for gradient descent.
So to recap, we looked at two techniques for scaling features. The first was dividing by the maximum value which allows the features to range between zero and one. The second was mean normalization which centers the features around zero. These techniques can help improve the performance of gradient descent and allow the algorithm to converge much more quickly.
In the next part, we'll look at some implementation tips for scaling features in practice.