Understanding the equivalence of SGD with momentum equations

Stochastic Gradient Descent (SGD) with momentum is a popular variant of the basic SGD algorithm, which accelerates the convergence toward the minimum of the loss function, especially in directions with persistent gradients.

To understand how different formulations of SGD with momentum are equivalent, let’s first define the basic equations and then delve into their equivalency.

Basic equations of SGD with momentum

The SGD with momentum algorithm updates the parameters $θ$ of the model by combining the gradient of the loss function $∇_θJ(θ)$ with the previous update step. The basic equations are as follows:

Momentum update:
- $v_t=γv_{t−1}+η∇_θJ(θ)$ : In this equation, $v_t$ is the current update, $γ$ is the momentum coefficient (usually between 0 and 1), η is the learning rate, and $∇_θJ(θ)$ is the gradient of the loss function.
Parameter update:
- $θ=θ−v_t$ : This updates the parameters in the direction of the negative gradient, accelerated by the momentum.

Equivalence of different formulations

Different formulations of SGD with momentum might look different but are essentially equivalent in functionality. Let's consider two common formulations and show their equivalence:

Formulation 1:

The graph above compares the parameter updates over iterations for the two different formulations of SGD with momentum. In this demonstration:

Formulation 1 uses the equation $vt=γv_{t−1}+η∇_θJ(θ)$ and then updates the parameter with $θ=θ−v_t$ .
Formulation 2 uses $v_t=γv_{t−1}+∇_θJ(θ)$ and updates the parameter with $θ=θ−ηv_t$ .

The graph shows that both formulations result in the same trajectory for the parameter updates over iterations, demonstrating their functional equivalence. The key takeaway is that despite the slight difference in how the learning rate $(η)$ and momentum coefficient $(γ)$ are applied, the overall effect on the parameter update process is the same. This equivalence holds true under the assumption of a constant learning rate and momentum coefficient, and it illustrates how momentum helps in smoothing and accelerating the convergence in gradient-based optimization.

Demonstration of SGD with momentum

Let's understand the SGD with momentum with the help of the following code:

import numpy as np
# Sample data
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])  # Features
y = np.array([3, 5, 7, 9])  # Labels
# Initialize parameters
theta = np.zeros(X.shape[1])
learning_rate = 0.01
momentum = 0.9
iterations = 1000
velocity = np.zeros_like(theta)
# Stochastic Gradient Descent with Momentum
for epoch in range(iterations):
    for i in range(len(y)):
        # Compute prediction
        prediction = np.dot(X[i], theta)
        
        # Compute the gradient
        gradient = (prediction - y[i]) * X[i]
        
        # Update velocity
        velocity = momentum * velocity - learning_rate * gradient
        
        # Update parameters
        theta += velocity
print("Parameters (theta):", theta)

Code explanation

Lines 4–5: Create a sample dataset for computing the SGD.
Lines 8–12: Initialize different parameters including the theta, learning_rate, momentum, iterations and velocity.
Lines 15–27: This segment calculates SGD with momentum with iterations time. Here we use the formulation 1 where learning_rate is multiplied by the gradient. However, both formulations yeild the same results, as discussed.
Line 29: We print the parameters theta updated after a specific amount of iterations.

Conclusion

Both formulations of SGD with momentum are equivalent in how they affect parameter updates. The choice between them often depends on personal preference or specific implementation details in different libraries. The key idea of momentum is to combine the current gradient direction with the previous update direction.

This approach smooths out the updates and can lead to faster convergence.