Log all grad norms by dhia680 · Pull Request #50 · swiss-ai/Megatron-LM

dhia680 · 2025-02-27T12:33:04Z

Add an option for logging individual gradients' norms.
These additional computations use a kernel from TE --> quite efficient.
A couple of all_reduce operations are also needed (with no significant comm overhead).
We only gather norms, not tensors :)

P.S: Current implementation does not support using a distributed optimizer.

AleHD · 2025-03-27T14:28:00Z

Very useful PR, let's finish it to merge it to main in the near future. @dhia680 could you please fix the distributed optimizer case? (or at least add an assertion in the validate_args to make sure we don't use distributed opt whenever this is enabled). Also, as we discussed now the whole code breaks if we use distributed optimizer. I fixed this in my fork, if you agree I can incorporate the fix to this PR directly: AleHD@1e2ff27 (commit contains a lot of changes in other files, ignore that).

Also, I'm against the idea of modifying the submit-llama.sh with these experimental features (and more so if we have to disable important stuff like the distributed optimizer). What do you say if we revert those changes and keep this PR shorter?

dhia680 · 2025-03-27T14:46:48Z

Thanks for the comment @AleHD !
After looking at your commit, I think we can simply check if --use-distributed-optimizer flag is passed when doing the assertion.
Further, I see you commented the all_reduce over grad_stats_parallel_group. Is that intentional ?
Finally, I agree with keeping job scripts as they originally were. I changed them just to highlight how the feature could be used. I'll reset them.

AleHD · 2025-03-27T15:37:57Z

Hey! Could you elaborate about the --use-distributed-optimizer flag? Do you mean to solve my comment "or at least add an assertion in the validate_args to make sure we don't use distributed opt whenever this is enabled"? Or are you talking to an alternative to my "if ensure_individual_correct" in the fork?

Regarding the all_reduce, yes it was intentional. Doing a single all_reduce(individ_norms_tensor) instead of several all_reduce(individ_norms[i]) should have better performance. Also, I think data_parallel_groups[i] is None for all if we don't use distributed optimizer (so I added that assertion), in which case both codes should be equivalent but the one in the fork faster

haeggee and others added 7 commits February 12, 2025 16:58

dirty start for individ grad norm logs

d72a79d

start correcting

4795a45

handle parallel groups

c329924

force add arg

e866448

now it works. no distrib optim

37376f6

works well (w/o dist optim)

08d76fa

update baselines

af0cfa1

AleHD self-requested a review March 27, 2025 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log all grad norms#50

Log all grad norms#50
dhia680 wants to merge 7 commits into
mainfrom
log-all-grad-norms

dhia680 commented Feb 27, 2025 •

edited

Loading

Uh oh!

AleHD commented Mar 27, 2025

Uh oh!

dhia680 commented Mar 27, 2025 •

edited

Loading

Uh oh!

AleHD commented Mar 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dhia680 commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AleHD commented Mar 27, 2025

Uh oh!

dhia680 commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AleHD commented Mar 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dhia680 commented Feb 27, 2025 •

edited

Loading

dhia680 commented Mar 27, 2025 •

edited

Loading