Open
Description
I'm observing sensitivity wrt LR restarts in a typical SGDR schedule with cosine annealing as in Loschilov & Hutter. RAdam still seems to be doing better than AdamW so far, but the jumps imply possible numerical instability at LR discontinuities.
Here's the training loss compared to AdamW (PyTorch 1.2.0 version):
What's the recommendation here? Should I use warmup in every cycle rather than just in the beginning? I thought RAdam was supposed to obviate the need for warmup. Is this a bug?