AdamBelief Optimizer: Big Improvement over Adam Optimizer?

AdamBelief more precisely standardizes the moving averaged gradient before using it to update the parameter. Thus, I expect that there would be less variance in the parameter’s magnitude during the training and as a consequence, the trained model can generalize better.

6 min readDec 18, 2020

Disclaimer: this blog post is to check my understanding of a new paper. If you find something unclear or incorrect, please let me know in the comment section.

1. What is an Optimizer in Deep Neural Network and Why it Matters?

One of the core engines of modern deep learning is gradient descent — which has been shown to be able to train deep network models for a variety of challenging problems in computer vision and natural language processing. The popularity of this technique stems from two factors: 1) it requires only the first-order moment of a loss function. Other gradient-based approaches such as the Newton method can converge faster but requires a higher-order moment of the loss function which might not be always feasible to compute. 2) it is easy to implement with automatic differentiation. Thus, nowadays, researchers need not worry about having to manually analyze the function to compute gradients.

In the gradient descent training scheme, the rate at which parameters get updated turns out to be critical in the model’s performance. The simplest form of this update can be summarized by the following simple iterative equation,

SGD uses this form of update and seems to be good for a lot of problems. However, as η is fixed, this scheme does not take into account where a parameter lies on the network and how it contributes to the final output. Therefore, there has been a large body of literature on how to properly do the update. Depending on the form of this update equation, we have different variants of what is called optimizer in the deep neural network context. As Adambelief [1] suggests, we can categorize these methods into two groups: adaptive learning rate (Adam, Adagrad …) and accelerated scheme (SGD with momentum, Nesterov).

2. Adam Optimizer v.s. Adambelief Optimizer

Adam optimizer was first introduced in [2], has become so popular among practitioners, and got more than 60.000 citations at the time this article is published. Adambelief introduced in [1] is claimed to improve Adam in terms of generalization capability. Follows is the comparison between the two methods

2.1. What is the main difference?

As we can see, the only difference lies in the following term.

2.2. How Authors Interpret these Terms?

With Adam

According to Adam [2] and [3], the authors define the term on the left-hand side as the SNR (the signal-to-noise ratio).
The SNR is a measure that compares the level of the desired signal (here the gradient i.e. the direction of the objective function curve) to the level of background noise (the second-order gradient i.e. the noise around this direction).
Square-root scaling is taken from the RMSProp (RMS means Root Mean Square) algorithm. The idea is that since the gradients are accumulated over multiple mini-batches, we need the gradient at each step to remain stable. Since one mini-batch can have a data sample radically different from another mini-batch, to limit the variation of the gradient the EMA of the first moment of the gradient is scaled by the RMS of the second moment. One could think of this technique as a sort of normalization whereby the gradient is divided by a sort of standard deviation.
Automatic annealing
With a smaller SNR, the effective step size will be closer to zero, meaning that there exists a lot of noise compared to the actual signal, hence a greater uncertainty about whether the direction of the first-order gradient corresponds to the direction of the optimum. This is a desirable property since the steps are smaller, hence limiting the divergence from a local optimum.
The SNR typically becomes closer to 0 towards an optimum, leading to small effective steps in parameter space. This enables a more robust and faster convergence to the optimum.

With AdamBelief

Authors in [1] consider 1/sqrt(st) of the right-hand-side term as the “belief” on how much the current gradient gt deviates from the “prediction” of gradient mt. I kinda not agree with this interpretation since if it’s a belief, shouldn’t its range of value be between 0 and 1? However, there is no such constraint here. In section 4, I’ll give my interpretation on this term.

3. Is AdamBelief Significantly Outperforms Adam?

The author has done thorough experiments in both vision and NLP settings showing the superior performance of AdamBelief over Adam.
In the image classification experiments, they found that

Datasets: Cifar10, Cifar 100
Network models: VGG11, ResNet34, DenseNet121
AdaBelief achieves fast convergence as in adaptive methods such as Adam while achieving better accuracy than SGD and other methods.
Datasets: ImagerNet
Network model: ResNet18
AdaBelief outperforms other adaptive methods and achieves comparable accuracy to SGD (70.08 v.s. 70.23), which closes the generalization gap between adaptive methods and SGD. Experiments validate the fast convergence and good generalization performance of AdaBelief.

LSTM on language modeling

Generative adversarial networks

4. My Thoughts

Why is AdamBelief Better than Adam?

First of all, I think that the main difference between Adam and AdamBelief is that AdamBelief more precisely standardizes gradients before the update. Let's look at the following equations in the AdamBelief:

The first one is the exponential moving average of the gradient while the latter can be considered as the exponential moving variance of the gradient. Therefore, if ignoring the bias correction, the following term can be considered as standardizing the moving averaged gradient. This is consistent with Adam [2] which says that it is a sort of normalization whereby the gradient is divided by a sort of standard deviation.

Or in other words, AdamBelief more precisely standardizes the moving averaged gradient before using it to update the parameter. Thus, I expect that there would be less variance in the parameter’s magnitude during the training and as a consequence, the trained model can generalize better.

The authors make a couple of claims to explain why AdamBelief is better including its capability of using curvature information. While this is not incorrect, I believe that the main reason for its superior performance lies in the “more precise” way of standardizing the gradient before the parameter update.

What could be Improved?

One of my concerns with AdamBelief is that towards the end of the training process, s_t can become very small as the variance in gradient is decreases, causing bigsteps in the update.

Should We Start Using AdamBelief instead of Adam?

The answer is simply YES. The modification is tiny while the performance gain is pretty clear.

5. Your Thoughts?

If you find something unclear or incorrect, please let me know in the comment section. Any comment is highly appreciated.

References

[1] Zhuang, Juntang, et al. “Adabelief optimizer: Adapting stepsizes by the belief in observed gradients.” Advances in Neural Information Processing Systems 33 (2020).

[2] Kingma, Diederik P., and Jimmy Ba. “Adam: A method for stochastic optimization.” arXiv preprint arXiv:1412.6980 (2014).

[3] https://towardsdatascience.com/understanding-adam-how-loss-functions-are-minimized-3a75d36ebdfc