What is the Adam algorithm?

The Adam algorithm works in such a way that it converges quickly and efficiently to a minimum cost function while also being robust to noisy gradients and having low memory requirements. It combines the advantages of two other popular optimization algorithms, namely, the stochastic gradient descent (SGD) algorithm and the root mean square propagation (RMSprop) algorithm. It does so by computing an exponentially weighted average of the past gradients and past squared gradients of the parameters. It then makes use of these estimates to update the parameters in a way that considers both the direction and the magnitude of the gradient.

Why use the Adam optimization algorithm?

For many reasons, the Adam algorithm is a common choice in deep learning:

Low memory requirement: The Adam algorithm is well known for having a low memory requirement because it only requires storing the running estimates of the first and second moments of the gradients for each parameter. This is very different from other optimization algorithms that store the full history of gradients.
Robust to noisy gradients: Nosy gradient can simply be referred to as a gradient having a high level of random variation or noise. The Adam algorithm is also well known for running an estimate of the second moment of the gradients to mitigate the influence of noisy gradients. It also helps to prevent the optimization process from getting stuck in poor local optima(the local optima that are not as good as other local optima or the global optimum). This is why it is mostly used in problems involving noisy gradients.
Dynamic learning rates: The learning rates in the Adam optimization algorithm are dynamically adjusted based on the gradients. This is very much the opposite of using a fixed learning rate throughout the optimization process.

Limitations of the Adam algorithm

Although it is a widely-used optimization algorithm, the Adam optimization algorithm has some limitations that need to be considered when choosing a suitable algorithm for a specific task. Some of these limitations are discussed below:

Overfitting on small dataset: The Adam algorithm sometimes overfits a model on a training dataset, especially when the training data is small. This ultimately leads to poor generalization performance on test data.
Sensitivity to hyperparameters: The Adam algorithm can be very sensitive to the choice of hyperparameters (learning and decay rates) of the running averages. When not correctly chosen, the optimization process may converge slowly. It is, therefore, advisable to carefully monitor the choices made during the hyperparameter tuning techniques.
High cost of computation: There is a high computational cost ( amount of time, processing power etc.) of training a deep learning model using the Adam optimization algorithm.

Implementing the Adam optimization algorithm

Let’s take a look at the code below on how to implement Adam’s optimization algorithm.

import os
os.environ['PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION'] = 'python'
import tensorflow as tf
# Generate some random training data
x_train = tf.random.normal(shape=[100, 1])
y_train = 3 * x_train + tf.random.normal(shape=[100, 1], stddev=0.1)
# Define the model architecture
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(units=1, input_shape=[1])
])
# Define the optimizer and compile the model
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
model.compile(optimizer=optimizer, loss='mse')
# Train the model for 100 epochs
history = model.fit(x_train, y_train, epochs=100, verbose=0)
# Plot the training loss over time
import matplotlib.pyplot as plt
plt.plot(history.history['loss'])
plt.title('Training loss over time')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()

Code explanation

Line 1: We import the TensorFlow library to use its machine learning and deep learning capabilities.
Lines 4–5: We generate 100 samples of random training data with one input feature and one output target.
Lines 8–10: We define a simple linear regression model with a single dense layer using the Sequential API from Keras.
Lines 13–14: We specify the Adam optimization algorithm with a learning rate of 0.01 and compile the model using the mean squared error (MSE) loss function.
Line 17: We train the model for 100 epochs on the training data using the fit method, which update the model’s weights and biases based on the Adam optimizer and the training data.
Lines 20–25: We plot the training loss over time using the history object returned by the fit method and the plot function from the matplotlib library. This provides a visual representation of how well the model fits the training data over the course of the training process.

Free Resources

What is the Adam algorithm?

How does the Adam algorithm work?

Why use the Adam optimization algorithm?

Limitations of the Adam algorithm

Implementing the Adam optimization algorithm

Code explanation