Skip to content

Log all grad norms#50

Open
dhia680 wants to merge 7 commits into
mainfrom
log-all-grad-norms
Open

Log all grad norms#50
dhia680 wants to merge 7 commits into
mainfrom
log-all-grad-norms

Conversation

@dhia680
Copy link
Copy Markdown
Member

@dhia680 dhia680 commented Feb 27, 2025

Add an option for logging individual gradients' norms.
These additional computations use a kernel from TE --> quite efficient.
A couple of all_reduce operations are also needed (with no significant comm overhead).
We only gather norms, not tensors :)

P.S: Current implementation does not support using a distributed optimizer.

@AleHD AleHD self-requested a review March 27, 2025 14:19
@AleHD
Copy link
Copy Markdown
Collaborator

AleHD commented Mar 27, 2025

Very useful PR, let's finish it to merge it to main in the near future. @dhia680 could you please fix the distributed optimizer case? (or at least add an assertion in the validate_args to make sure we don't use distributed opt whenever this is enabled). Also, as we discussed now the whole code breaks if we use distributed optimizer. I fixed this in my fork, if you agree I can incorporate the fix to this PR directly: AleHD@1e2ff27 (commit contains a lot of changes in other files, ignore that).

Also, I'm against the idea of modifying the submit-llama.sh with these experimental features (and more so if we have to disable important stuff like the distributed optimizer). What do you say if we revert those changes and keep this PR shorter?

@dhia680
Copy link
Copy Markdown
Member Author

dhia680 commented Mar 27, 2025

Thanks for the comment @AleHD !
After looking at your commit, I think we can simply check if --use-distributed-optimizer flag is passed when doing the assertion.
Further, I see you commented the all_reduce over grad_stats_parallel_group. Is that intentional ?
Finally, I agree with keeping job scripts as they originally were. I changed them just to highlight how the feature could be used. I'll reset them.

@AleHD
Copy link
Copy Markdown
Collaborator

AleHD commented Mar 27, 2025

Hey! Could you elaborate about the --use-distributed-optimizer flag? Do you mean to solve my comment "or at least add an assertion in the validate_args to make sure we don't use distributed opt whenever this is enabled"? Or are you talking to an alternative to my "if ensure_individual_correct" in the fork?

Regarding the all_reduce, yes it was intentional. Doing a single all_reduce(individ_norms_tensor) instead of several all_reduce(individ_norms[i]) should have better performance. Also, I think data_parallel_groups[i] is None for all if we don't use distributed optimizer (so I added that assertion), in which case both codes should be equivalent but the one in the fork faster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants