Skip to content

Sensitivity wrt LR restarts #8

Open
@depthwise

Description

@depthwise

I'm observing sensitivity wrt LR restarts in a typical SGDR schedule with cosine annealing as in Loschilov & Hutter. RAdam still seems to be doing better than AdamW so far, but the jumps imply possible numerical instability at LR discontinuities.

Here's the training loss compared to AdamW (PyTorch 1.2.0 version):
radam_jumps

Here's the validation loss:
radam_val

What's the recommendation here? Should I use warmup in every cycle rather than just in the beginning? I thought RAdam was supposed to obviate the need for warmup. Is this a bug?

Metadata

Metadata

Labels

questionFurther information is requested

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions