Adam Optimisation
- Adam adapts each parameter’s learning rate using moving averages of the gradient and its second raw moment.
- It combines ideas from RMSprop and SGD with momentum to improve convergence.
- Widely used in deep learning because it is computationally efficient and has low memory requirements.
Definition
Section titled “Definition”Adam (Adaptive Moment Estimation) is an optimization algorithm for training deep learning models. It is a variant of stochastic gradient descent (SGD) that uses moving averages of the gradient and the second raw moment of the gradients to compute adaptive learning rates for each parameter.
Explanation
Section titled “Explanation”Adam updates model parameters by computing the gradient of the loss with respect to each parameter, maintaining running (moving) averages of the gradient and of the squared gradient (the second raw moment), and using those averages to scale the parameter-wise learning rates. Conceptually, Adam combines RMSprop—which scales the learning rate using a moving average of the squared gradient—and SGD with momentum—which uses a moving average of the gradient to provide momentum—so that the optimizer adapts step sizes per parameter while benefiting from momentum-like behavior. This adaptation helps prevent gradients from becoming excessively large or small and reduces oscillation or divergence during training.
Examples
Section titled “Examples”Example of Adam optimization in action:
# Initialize the model's parametersparams = initialize_params(n_inputs, n_hidden, n_outputs)
# Set the learning rate and the number of training iterationslearning_rate = 0.01n_iterations = 1000
# Initialize the Adam optimizeroptimizer = AdamOptimizer(params, learning_rate)
# Train the model for n_iterationsfor i in range(n_iterations):
# Forward propagate the input outputs = forward_propagate(inputs, params)
# Calculate the loss loss = calculate_loss(outputs, targets)
# Backpropagate the error gradients = backpropagate(outputs, targets, params)
# Update the model's parameters optimizer.update_params(gradients)
# Evaluate the trained model on the test dataaccuracy = evaluate(test_data, params)Use cases
Section titled “Use cases”- Typically used in deep learning applications to optimize a model’s parameters during training.
Notes or pitfalls
Section titled “Notes or pitfalls”- Adam adapts learning rates to help prevent the model from converging too slowly, oscillating, or diverging.
- It is noted for computational efficiency and low memory requirements compared to some alternatives.
Related terms
Section titled “Related terms”- Stochastic gradient descent (SGD)
- RMSprop
- SGD with momentum