Skip to content

Dynamically generate FP8 configs for Fuji #1186

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
May 29, 2025

Conversation

andersensam
Copy link
Contributor

This PR aims to simplify the usage of FP8 for Fuji models. It removes the necessary edits previously required to enable FP8 support and now dynamically generates FP8 configs alongside single-host configs. This also extends FP8 support beyond GCP instance types.

Usage:

python3 -m axlearn.common.launch_trainer_main \
        --module=text.gpt.c4_trainer --config=fuji-7B-v2-flash-fp8 \
        --trainer_dir=<trainer_dir> \
        --data_dir=gs://axlearn-public/tensorflow_datasets \
        --jax_backend=gpu \
        --mesh_selector=gpu-a3-megagpu-8g-256 \
        --trace_at_steps=5

@andersensam andersensam requested review from ruomingp, markblee and a team as code owners May 15, 2025 22:52
@andersensam
Copy link
Contributor Author

Golden configs still need to be updated -- will run additional testing later and update accordingly

Copy link
Member

@hanzhi713 hanzhi713 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@markblee markblee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be great if we can limit the number of golden configs.

@markblee markblee enabled auto-merge May 29, 2025 20:23
@markblee markblee added this pull request to the merge queue May 29, 2025
Merged via the queue into apple:main with commit fc19043 May 29, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants