Skip to content

Save O_PROJ on Fuji 70B-v2 for TRN2 #1147

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

apoorvtintin
Copy link
Contributor

Saving out-projection improves training throughput while still fitting in the mesh defined by neuron-(trn2|trn2n).48xlarge-64.

@apoorvtintin apoorvtintin requested review from ruomingp, markblee and a team as code owners May 1, 2025 22:43
@apoorvtintin
Copy link
Contributor Author

Rebased the PR to fix failing CI

@apoorvtintin
Copy link
Contributor Author

I see the CI fails for test TestEvaluateFromFile.test_evaluate_from_eval_set with error
#22 453.1 axlearn/open_api/common.py:440: KeyError

This is unrelated to the changes in this PR, I already rebased the PR to 12th May. Can I please get some guidance on how I can fix this? Thank you!

@markblee
Copy link
Contributor

I see the CI fails for test TestEvaluateFromFile.test_evaluate_from_eval_set with error #22 453.1 axlearn/open_api/common.py:440: KeyError

This is unrelated to the changes in this PR, I already rebased the PR to 12th May. Can I please get some guidance on how I can fix this? Thank you!

I'm disabling this test here: #1184 cc @gyin94

@apoorvtintin
Copy link
Contributor Author

Rebased to disable flaky test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants