Adam Optimisation

TL;DR

Adam adapts each parameter’s learning rate using moving averages of the gradient and its second raw moment.
It combines ideas from RMSprop and SGD with momentum to improve convergence.
Widely used in deep learning because it is computationally efficient and has low memory requirements.

Definition

Adam (Adaptive Moment Estimation) is an optimization algorithm for training deep learning models. It is a variant of stochastic gradient descent (SGD) that uses moving averages of the gradient and the second raw moment of the gradients to compute adaptive learning rates for each parameter.

Explanation

Adam updates model parameters by computing the gradient of the loss with respect to each parameter, maintaining running (moving) averages of the gradient and of the squared gradient (the second raw moment), and using those averages to scale the parameter-wise learning rates. Conceptually, Adam combines RMSprop—which scales the learning rate using a moving average of the squared gradient—and SGD with momentum—which uses a moving average of the gradient to provide momentum—so that the optimizer adapts step sizes per parameter while benefiting from momentum-like behavior. This adaptation helps prevent gradients from becoming excessively large or small and reduces oscillation or divergence during training.

Examples

Example of Adam optimization in action:

# Initialize the model's parameters
params = initialize_params(n_inputs, n_hidden, n_outputs)

# Set the learning rate and the number of training iterations
learning_rate = 0.01
n_iterations = 1000

# Initialize the Adam optimizer
optimizer = AdamOptimizer(params, learning_rate)

# Train the model for n_iterations
for i in range(n_iterations):

    # Forward propagate the input
    outputs = forward_propagate(inputs, params)

    # Calculate the loss
    loss = calculate_loss(outputs, targets)

    # Backpropagate the error
    gradients = backpropagate(outputs, targets, params)

    # Update the model's parameters
    optimizer.update_params(gradients)

# Evaluate the trained model on the test data
accuracy = evaluate(test_data, params)

Use cases

Typically used in deep learning applications to optimize a model’s parameters during training.

Notes or pitfalls

Adam adapts learning rates to help prevent the model from converging too slowly, oscillating, or diverging.
It is noted for computational efficiency and low memory requirements compared to some alternatives.

Stochastic gradient descent (SGD)
RMSprop
SGD with momentum