In this lab you will implement Newton's method optimizing some functions in one and two variables. You will also compare it with the gradient descent, experiencing advantages and disadvantages of each of the methods.
Packages
Run the following cell to load the packages you'll need.
You will use Newton's method to optimize a function f(x). Aiming to find a point, where the derivative equals to zero, you need to start from some initial point x0, calculate first and second derivatives (f′(x0) and f′′(x0)) and step to the next point using the expression:
x1=x0−f′′(x0)f′(x0),(1)
Repeat the process iteratively. Number of iterations n is usually also a parameter.
Let's optimize function f(x)=ex−log(x) (defined for x>0) using Newton's method. To implement it in the code, define function f(x)=ex−log(x), its first and second derivatives f′(x)=ex−x1, f′′(x)=ex+x21:
Plot the function to visualize the global minimum:
Implement Newton's method described above:
In addition to the first and second derivatives, there are two other parameters in this implementation: number of iterations num_iterations, initial point x. To optimize the function, set up the parameters and call the defined function gradient_descent:
You can see that starting from the initial point x0=1.6 Newton's method converges after 6 iterations. You could actually exit the loop when there is no significant change of x each step (or when first derivative is close to zero).
What if gradient descent was used starting from the same intial point?
Gradient descent method has an extra parameter learning_rate. If you take it equal to 0.1 in this example, the method will start to converge after about 15 iterations (aiming for an accuracy of 4-5 decimal places). If you increase it to 0.2, gradient descent will converge within about 12 iterations, which is still slower than Newton's method.
So, those are disadvantages of gradient descent method in comparison with Newton's method: there is an extra parameter to control and it converges slower. However it has an advantage - in each step you do not need to calculate second derivative, which in more complicated cases is quite computationally expensive to find. So, one step of gradient descent method is easier to make than one step of Newton's method.
This is the reality of numerical optimization - convergency and actual result depend on the initial parameters. Also, there is no "perfect" algorithm - every method will have advantages and disadvantages.
In case of a function in two variables, Newton's method will require even more computations. Starting from the intial point (x0,y0), the step to the next point shoud be done using the expression:
[x1y1]=[x0y0]−H−1(x0,y0)∇f(x0,y0),(2)
where H−1(x0,y0) is an inverse of a Hessian matrix at point (x0,y0) and ∇f(x0,y0) is the gradient at that point.
Let's implement that in the code. Define the function f(x,y) like in the videos, its gradient and Hessian:
Newton's method (2) is implemented in the following function:
Now run the following code to find the minimum:
In this example starting from the initial point (4,4) it will converge after about 9 iterations. Try to compare it with the gradient descent now:
Obviously, gradient descent will converge much slower than Newton's method. And trying to increase learning rate, it might not even work at all. This illustrates again the disadvantages of gradient descent in comparison with Newton's method. However, note, that Newton's method required calculation of an inverted Hessian matrix, which is a very computationally expensive calculation to perform when you have, say, a thousand of parameters.