Understanding the equivalence of SGD with momentum equations
Stochastic Gradient Descent (SGD) with momentum is a popular variant of the basic SGD algorithm, which accelerates the convergence toward the minimum of the loss function, especially in directions with persistent gradients.
To understand how different formulations of SGD with momentum are equivalent, let’s first define the basic equations and then delve into their equivalency.
Basic equations of SGD with momentum
The SGD with momentum algorithm updates the parameters
Momentum update:
: In this equation, is the current update, is the momentum coefficient (usually between 0 and 1), η is the learning rate, and is the gradient of the loss function.
Parameter update:
: This updates the parameters in the direction of the negative gradient, accelerated by the momentum.
Equivalence of different formulations
Different formulations of SGD with momentum might look different but are essentially equivalent in functionality. Let's consider two common formulations and show their equivalence:
Formulation 1:
Formulation 2:
To understand how these are equivalent, let’s expand the formulation 2:
The update
is calculated as the weighted sum of the previous update and the current gradient. The parameter update step multiplies
with the learning rate η.
Expanding the update step of the formulation 2:
This shows that the effect of the learning rate
The graph above compares the parameter updates over iterations for the two different formulations of SGD with momentum. In this demonstration:
Formulation 1 uses the equation
and then updates the parameter with . Formulation 2 uses
and updates the parameter with .
The graph shows that both formulations result in the same trajectory for the parameter updates over iterations, demonstrating their functional equivalence. The key takeaway is that despite the slight difference in how the learning rate
Demonstration of SGD with momentum
Let's understand the SGD with momentum with the help of the following code:
import numpy as np# Sample dataX = np.array([[1, 2], [2, 3], [3, 4], [4, 5]]) # Featuresy = np.array([3, 5, 7, 9]) # Labels# Initialize parameterstheta = np.zeros(X.shape[1])learning_rate = 0.01momentum = 0.9iterations = 1000velocity = np.zeros_like(theta)# Stochastic Gradient Descent with Momentumfor epoch in range(iterations):for i in range(len(y)):# Compute predictionprediction = np.dot(X[i], theta)# Compute the gradientgradient = (prediction - y[i]) * X[i]# Update velocityvelocity = momentum * velocity - learning_rate * gradient# Update parameterstheta += velocityprint("Parameters (theta):", theta)
Code explanation
Lines 4–5: Create a sample dataset for computing the SGD.
Lines 8–12: Initialize different parameters including the
theta,learning_rate,momentum,iterationsandvelocity.Lines 15–27: This segment calculates SGD with momentum with
iterationstime. Here we use the formulation 1 wherelearning_rateis multiplied by thegradient. However, both formulations yeild the same results, as discussed.Line 29: We print the parameters
thetaupdated after a specific amount ofiterations.
Conclusion
Both formulations of SGD with momentum are equivalent in how they affect parameter updates. The choice between them often depends on personal preference or specific implementation details in different libraries. The key idea of momentum is to combine the current gradient direction with the previous update direction.
This approach smooths out the updates and can lead to faster convergence.
Free Resources