Log all grad norms#50
Conversation
|
Very useful PR, let's finish it to merge it to main in the near future. @dhia680 could you please fix the distributed optimizer case? (or at least add an assertion in the validate_args to make sure we don't use distributed opt whenever this is enabled). Also, as we discussed now the whole code breaks if we use distributed optimizer. I fixed this in my fork, if you agree I can incorporate the fix to this PR directly: AleHD@1e2ff27 (commit contains a lot of changes in other files, ignore that). Also, I'm against the idea of modifying the submit-llama.sh with these experimental features (and more so if we have to disable important stuff like the distributed optimizer). What do you say if we revert those changes and keep this PR shorter? |
|
Thanks for the comment @AleHD ! |
|
Hey! Could you elaborate about the Regarding the |
Add an option for logging individual gradients' norms.
These additional computations use a kernel from TE --> quite efficient.
A couple of all_reduce operations are also needed (with no significant comm overhead).
We only gather norms, not tensors :)
P.S: Current implementation does not support using a distributed optimizer.