AdaBelief works on the concept of “belief” in the current gradient direction. If it results in good performance, then that direction is trusted, and large updates are applied. Otherwise, it’s distrusted and the step size is reduced.

The authors of AdaBelief introduced the optimizer to:

  • Converge fast, as in adaptive methods.
  • Have good generalization like SGD.
  • Be stable during training.

Let’s look at a Flax training state that applies the AdaBelief optimizer.

Get hands-on with 1200+ tech skills courses.