Optimization Using Gradient Descent in One Variable
To understand how to optimize functions using gradient descent, start from simple examples - functions of one variable. In this lab, you will implement the gradient descent method for functions with single and multiple minima, experiment with the parameters and visualize the results. This will allow you to understand the advantages and disadvantages of the gradient descent method.
Packages
Run the following cell to load the packages you'll need.
Function with One Global Minimum
Function (defined for ) is a function of one variable which has only one minimum point (called global minimum). However, sometimes that minimum point cannot be found analytically - solving the equation . It can be done using a gradient descent method.
To implement gradient descent, you need to start from some initial point . Aiming to find a point, where the derivative equals zero, you want to move "down the hill". Calculate the derivative (called a gradient) and step to the next point using the expression:
where is a parameter called a learning rate. Repeat the process iteratively. The number of iterations is usually also a parameter.
Subtracting you move "down the hill" against the increase of the function - toward the minimum point. So, generally defines the direction of movement. Parameter serves as a scaling factor.
Now it's time to implement the gradient descent method and experiment with the parameters!
First, define function and its derivative :
Function has one global minimum. Let's plot the function:
Gradient descent can be implemented in the following function:
Note that there are three parameters in this implementation: num_iterations
, learning_rate
, initial point x_initial
. Model parameters for such methods as gradient descent are usually found experimentially. For now, just assume that you know the parameters that will work in this model - you will see the discussion of that later. To optimize the function, set up the parameters and call the defined function gradient_descent
:
The code in following cell will help you to visualize and understand the gradient descent method deeper. After the end of the animation, you can click on the plot to choose a new initial point and investigate how the gradient descent method will be performed.
You can see that it works successfully here, bringing it to the global minimum point!
What if some of the parameters will be changed? Will the method always work? Uncomment the lines in the cell below and rerun the code to investigate what happens if other parameter values are chosen. Try to investigate and analyse the results. You can read some comments below.
Notes related to this animation:
- Gradient descent is performed with some pauses between the iterations for visualization purposes. The actual implementation is much faster.
- The animation stops when minimum point is reached with certain accuracy (it might be a smaller number of steps than
num_iterations
) - to avoid long runs of the code and for teaching purposes. - Please wait for the end of the animation before making any code changes or rerunning the cell. In case of any issues, you can try to restart the Kernel and rerun the notebook.
Comments related to the choice of the parameters in the animation above:
-
Choosing
num_iterations = 25
,learning_rate = 0.1
,x_initial = 1.6
you get to the minimum point successfully. Even a little bit earlier - on the iteration 21, so for this choice of the learning rate and initial point, the number of iterations could have been taken less than25
to save some computation time. -
Increasing the
learning_rate
to0.3
you can see that the method converges even faster - you need less number of iterations. But note that the steps are larger and this may cause some problems. -
Increasing the
learning_rate
further to0.5
the method doesn't converge anymore! You steped too far away from the minimum point. So, be careful - increasinglearning_rate
the method may converge significantly faster... or not converge at all. -
To be "safe", you may think, why not to decrease
learning_rate
?! Take it0.04
, keeping the rest of the parameters the same. The model will not run enough number of iterations to converge! -
Increasing
num_iterations
, say to75
, the model will converge but slowly. This would be more "expensive" computationally. -
What if you get back to the original parameters
num_iterations = 25
,learning_rate = 0.1
, but choose some otherx_initial
, e.g.0.05
? The function is steeper at that point, thus the gradient is larger in absolute value, and the first step is larger. But it will work - you will get to the minimum point. -
If you take
x_initial = 0.03
the function is even steeper, making the first step significantly larger. You are risking "missing" the minimum point. -
Taking
x_initial = 0.02
the method doesn't converge anymore...
This is a very simple example, but hopefully, it gives you an idea of how important is the choice of the initial parameters.
Function with Multiple Minima
Now you can take a slightly more complicated example - a function in one variable, but with multiple minima. Such an example was shown in the videos, and you can plot the function with the following code:
Function f_example_2
and its derivative dfdx_example_2
are pre-defined and uploaded into this notebook. At this stage, while you are mastering the optimization method, do not worry about the corresponding expressions, just concentrate on the gradient descent and the related parameters for now.
Use the following code to run gradient descent with the same learning_rate
and num_iterations
, but with a different starting point:
The results are different. Both times the point did fall into one of the minima, but in the first run it was a global minimum, while in the second run it got "stuck" in a local one. To see the visualization of what is happening, run the code below. You can uncomment the lines to try different sets of parameters or click on the plot to choose the initial point (after the end of the animation).
You can see that gradient descent method is robust - it allows you to optimize a function with a small number of calculations, but it has some drawbacks. The efficiency of the method depends a lot on the choice of the initial parameters, and it is a challenge in machine learning applications to choose the "right" set of parameters to train the model!