Picture of the authorMindect

Checking Gradient Descent for Convergence

Introduction to Gradient Descent Convergence

When running gradient descent, how can you tell if it is converging? Specifically, you want to determine whether it's helping you find parameters close to the global minimum of the cost function. By learning to recognize what a well-running implementation of gradient descent looks like, we will also, in a later section, be better able to choose a good learning rate ( \alpha ) (α). Let's take a closer look.

Gradient Descent Rule

As a reminder, here's the gradient descent rule. One of the key choices is the choice of the learning rate ( \alpha ) (α). To ensure that gradient descent is functioning correctly, I often plot the cost function ( J ), which is calculated on the training set, against the number of iterations.

Plotting the Cost Function

I plot the value of ( J ) at each iteration of gradient descent.

CGDC2

In this plot, the horizontal axis represents the number of iterations of gradient descent that you've run so far. It's essential to note that the curve shows the iterations of gradient descent, not a parameter like ( w ) or ( b ). This curve is referred to as a learning curve.

Analyzing the Learning Curve

There are several types of learning curves used in machine learning, which you will see later in this documentation.

Understanding Iteration Points

Concretely, if you look at a specific point on the curve:

CGDC3

This point indicates that after running gradient descent for 100 iterations (100 simultaneous updates of the parameters), you have learned values for ( w ) and ( b ). If you compute the cost ( J ) for those parameters obtained after 100 iterations, you will get this corresponding value for ( J ).

CGDC4

This point represents the value of ( J ) for the parameters obtained after 200 iterations of gradient descent. By examining this graph, you can see how your cost ( J ) changes after each iteration of gradient descent.

If gradient descent is working properly, then the cost \( J \) should decrease after every single iteration.

CGDC5

If ( J ) ever increases after one iteration, it usually indicates that either ( \alpha ) (α) is poorly chosen (typically too large), or there may be a bug in the code.

Identifying Convergence

Another useful insight you can gain from this curve is:

CGDC4

By the time you reach around 300 iterations, the cost ( J ) appears to level off and stops decreasing significantly.

By 400 iterations, it seems the curve has flattened out, indicating that gradient descent has more or less converged.

CGDC6

Variation in Convergence

The number of iterations required for gradient descent to converge can vary significantly between different applications. In one scenario, it may converge after just 30 iterations, while in another, it might take 1,000 or even 100,000 iterations. It turns out to be very challenging to determine in advance how many iterations gradient descent needs to converge, which is why creating a learning curve like this is beneficial.

Automatic Convergence Test

To assist in determining when your model is done training, you can employ an automatic convergence test.

Here is the Greek alphabet epsilon (ε). Let's define epsilon (ε) as a variable representing a small number, such as ( 0.001 ) or ( 10^-3 ). If the cost ( J ) decreases by less than this value of epsilon on one iteration, then you are likely on the flattened part of the curve, and you can declare convergence.

CGDC7

Conclusion on Convergence

Remember, convergence ideally means that you have found parameters ( w ) and ( b ) that are close to the minimum possible value of ( J ). I usually find that choosing the right threshold for epsilon is quite difficult. Personally, I tend to look at graphs like this one instead of relying solely on automatic convergence tests.

By analyzing the solid figure, you may gain advanced warning if gradient descent is not functioning correctly. You've now seen what a learning curve should look like when gradient descent is operating effectively. Let’s take these insights into the next part and examine how to choose an appropriate learning rate or ( \alpha ) (α).

On this page

Edit on Github Question? Give us feedback