When running gradient descent, how can you tell if it is converging? Specifically, you want to determine whether it's helping you find parameters close to the global minimum of the cost function. By learning to recognize what a well-running implementation of gradient descent looks like, we will also, in a later section, be better able to choose a good learning rate ( \alpha ) (α). Let's take a closer look.
As a reminder, here's the gradient descent rule. One of the key choices is the choice of the learning rate ( \alpha ) (α). To ensure that gradient descent is functioning correctly, I often plot the cost function ( J ), which is calculated on the training set, against the number of iterations.
I plot the value of ( J ) at each iteration of gradient descent.
In this plot, the horizontal axis represents the number of iterations of gradient descent that you've run so far. It's essential to note that the curve shows the iterations of gradient descent, not a parameter like ( w ) or ( b ). This curve is referred to as a learning curve.
Concretely, if you look at a specific point on the curve:
This point indicates that after running gradient descent for 100 iterations (100 simultaneous updates of the parameters), you have learned values for ( w ) and ( b ). If you compute the cost ( J ) for those parameters obtained after 100 iterations, you will get this corresponding value for ( J ).
This point represents the value of ( J ) for the parameters obtained after 200 iterations of gradient descent. By examining this graph, you can see how your cost ( J ) changes after each iteration of gradient descent.
If gradient descent is working properly, then the cost \( J \) should decrease after every single iteration.
If ( J ) ever increases after one iteration, it usually indicates that either ( \alpha ) (α) is poorly chosen (typically too large), or there may be a bug in the code.
The number of iterations required for gradient descent to converge can vary significantly between different applications. In one scenario, it may converge after just 30 iterations, while in another, it might take 1,000 or even 100,000 iterations. It turns out to be very challenging to determine in advance how many iterations gradient descent needs to converge, which is why creating a learning curve like this is beneficial.
To assist in determining when your model is done training, you can employ an automatic convergence test.
Here is the Greek alphabet epsilon (ε). Let's define epsilon (ε) as a variable representing a small number, such as ( 0.001 ) or ( 10^-3 ). If the cost ( J ) decreases by less than this value of epsilon on one iteration, then you are likely on the flattened part of the curve, and you can declare convergence.
Remember, convergence ideally means that you have found parameters ( w ) and ( b ) that are close to the minimum possible value of ( J ). I usually find that choosing the right threshold for epsilon is quite difficult. Personally, I tend to look at graphs like this one instead of relying solely on automatic convergence tests.
By analyzing the solid figure, you may gain advanced warning if gradient descent is not functioning correctly. You've now seen what a learning curve should look like when gradient descent is operating effectively. Let’s take these insights into the next part and examine how to choose an appropriate learning rate or ( \alpha ) (α).