What is the conjugate gradient method?

Artificial intelligence has seen a huge rise in the past decades. Much of this rise is due to data-driven intelligence also known as machine learning. The machine learning models, in general, rely on optimization algorithms to minimize the errors in the given data. In layman’s terms, we can see the machine learning model as the mountaineer trying to reach the ground. The optimization algorithms are there to give the mountaineer a direction.

Optimization

In machine learning, optimization is the process of improving the model's performance by minimizing the error. Gradient descent is one such method used in training machine learning models. It works by moving in the direction of the negative gradient. This method has a significant drawback: it assumes that moving toward negative gradients leads to lesser errors. Thus, it can get stuck in local minima. Using our analogy above, the mountaineer can not see behind a peak and may end up following a path that will not lead to the ground.

Conjugate gradient

The conjugate gradient is an optimization method that works on the gradient principle. The method tries to minimize the error by taking the problem in a linear system. Hence, the method is essentially solving the equation $A.x = b$ with the following assumptions:

The function being minimized can be represented as a sum of quadratic and linear functions.
The matrix $A$ is symmetric i.e. can be represented as $A^T = A$ .

Building on these assumptions, the equation to be can be treated as $A.x - b = 0$ . The symmetry makes it easier to minimize the function. Unlike the gradient descent, the conjugate method moves only in the orthogonal direction. This avoids the zigzag motion and converges smoothly compared to the gradient descent (or ascent) method.

Limitations

The conjugate gradient method has proven powerful in optimizing systems of equations. However, it is not considered the best choice for machine learning models. Firstly, the method tends to overfit because the goal of machine learning is not optimizing specific data. Secondly, the machine learning problems are usually in stochastic settings, and optimization algorithms like SGDstochastic gradient descent perform better in these settings.

Note: You can read more about optimization algorithms in this Answer.

Code

The following code finds the minimum value for $x$ in a simple equation $Ax^2 + bx$ with the conjugate gradient method using numpy and scipy libraries.

Code explanation

Line 4: We set the values for the inputs $A$ and $b$ .
Line 6–8: The conjugate gradient function in scipy.optimize requires the equation to be minimized as a parameter. We provide that equation in this function.
Line 10: We initialize the starting value for $x$ . This value is updated during the optimization process.
Line 12–13: We call the scipy conjugate gradient function and it outputs the minimum value of $x$ for our equation $Ax^2 + bx$ .

Conclusion

The conjugate gradient algorithm is a powerful method for applications involving systems of equations. Despite their drawbacks, the gradient descent based methods are considered preferable in most machine learning problems.