The Adam algorithm (Adaptive moment estimation) is a popular optimization algorithm that is used in a neural network during training in order to update its parameters (in this case, the weight of the network). This algorithm is also being adopted in natural language processing and computer vision.
The Adam algorithm works in such a way that it converges quickly and efficiently to a minimum cost function while also being robust to noisy gradients and having low memory requirements. It combines the advantages of two other popular optimization algorithms, namely, the stochastic gradient descent (SGD) algorithm and the root mean square propagation (RMSprop) algorithm. It does so by computing an exponentially weighted average of the past gradients and past squared gradients of the parameters. It then makes use of these estimates to update the parameters in a way that considers both the direction and the magnitude of the gradient.
For many reasons, the Adam algorithm is a common choice in deep learning:
Low memory requirement: The Adam algorithm is well known for having a low memory requirement because it only requires storing the running estimates of the first and second moments of the gradients for each parameter. This is very different from other optimization algorithms that store the full history of gradients.
Robust to noisy gradients: Nosy gradient can simply be referred to as a gradient having a high level of random variation or noise. The Adam algorithm is also well known for running an estimate of the second moment of the gradients to mitigate the influence of noisy gradients. It also helps to prevent the optimization process from getting stuck in poor local optima(the local optima that are not as good as other local optima or the global optimum). This is why it is mostly used in problems involving noisy gradients.
Dynamic learning rates: The learning rates in the Adam optimization algorithm are dynamically adjusted based on the gradients. This is very much the opposite of using a fixed learning rate throughout the optimization process.
Although it is a widely-used optimization algorithm, the Adam optimization algorithm has some limitations that need to be considered when choosing a suitable algorithm for a specific task. Some of these limitations are discussed below:
Overfitting on small dataset: The Adam algorithm sometimes overfits a model on a training dataset, especially when the training data is small. This ultimately leads to poor generalization performance on test data.
Sensitivity to hyperparameters: The Adam algorithm can be very sensitive to the choice of hyperparameters (learning and decay rates) of the running averages. When not correctly chosen, the optimization process may converge slowly. It is, therefore, advisable to carefully monitor the choices made during the hyperparameter tuning techniques.
High cost of computation: There is a high computational cost ( amount of time, processing power etc.) of training a deep learning model using the Adam optimization algorithm.
Let’s take a look at the code below on how to implement Adam’s optimization algorithm.
import osos.environ['PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION'] = 'python'import tensorflow as tf# Generate some random training datax_train = tf.random.normal(shape=[100, 1])y_train = 3 * x_train + tf.random.normal(shape=[100, 1], stddev=0.1)# Define the model architecturemodel = tf.keras.models.Sequential([tf.keras.layers.Dense(units=1, input_shape=[1])])# Define the optimizer and compile the modeloptimizer = tf.keras.optimizers.Adam(learning_rate=0.01)model.compile(optimizer=optimizer, loss='mse')# Train the model for 100 epochshistory = model.fit(x_train, y_train, epochs=100, verbose=0)# Plot the training loss over timeimport matplotlib.pyplot as pltplt.plot(history.history['loss'])plt.title('Training loss over time')plt.xlabel('Epoch')plt.ylabel('Loss')plt.show()
Line 1: We import the TensorFlow
library to use its machine learning and deep learning capabilities.
Lines 4–5: We generate 100 samples of random training data with one input feature and one output target.
Lines 8–10: We define a simple linear regression model with a single dense layer using the Sequential API from Keras
.
Lines 13–14: We specify the Adam optimization algorithm with a learning rate
of 0.01 and compile the model using the mean squared error (MSE) loss function.
Line 17: We train the model for 100 epochs
on the training data using the fit
method, which update the model’s weights and biases based on the Adam optimizer and the training data.
Lines 20–25: We plot the training loss over time using the history object returned by the fit
method and the plot
function from the matplotlib
library. This provides a visual representation of how well the model fits the training data over the course of the training process.