Replies: 3 comments 4 replies
-
@amaarora I don't think it can be cleanly contained within an optimizer. It requires two forward pass with manipulation of the gradient in between to calc the perturbation. Since closures don't work with grad scaler, that breaks the optimizer abstraction and requires a custom train loop. Additionally there are some other questions I have regarding the grads when using DDP and grad clipping. In the new |
Beta Was this translation helpful? Give feedback.
-
@AlejandroRigau I've been watching related papers and impl. I'm overal not too happy with the state of most PyTorch impl in that they tend to ignore proper AMP usage completely (and I'm not going to add anything which wiill prevent use of AMP and also add extra overhead itself).. GSAM looks like a decent impl to tweak and add AMP support to (I think there is a PR for most of the support, but needed improvement last I looked, that could be diff now) ... https://github.com/juntang-zhuang/GSAM Also, @tmabraham has been looking at this a lot (and trying to convince me to add it), he was going to try some impl soon I think... he was discussing an alternative solution (MESA) with me recently as well... |
Beta Was this translation helpful? Give feedback.
-
@rwightman Any updates about this topic? I would be happy to work on a PR. |
Beta Was this translation helpful? Give feedback.
-
Paper: https://arxiv.org/abs/2010.01412v3
@rwightman I am sure as you are already aware - SAM has been at the center of recent papers in CV. ViT, MLPs and NFNets all seem to benefit (also BN counterparts).
There's an open-source implementation here - https://github.com/google-research/sam
If you do agree, I am happy to start working on a PR to get this optimizer in TIMM.
References:
Beta Was this translation helpful? Give feedback.
All reactions