Skip to content

Mixture of experts pretraining benchmark#780

Merged
ShriyaRishab merged 15 commits intomlcommons:masterfrom
ZhiyuLi-goog:lizhiyu/moe
Feb 3, 2025
Merged

Mixture of experts pretraining benchmark#780
ShriyaRishab merged 15 commits intomlcommons:masterfrom
ZhiyuLi-goog:lizhiyu/moe

Conversation

@ZhiyuLi-goog
Copy link
Contributor

@ZhiyuLi-goog ZhiyuLi-goog commented Jan 6, 2025

Description

Add MoE benchmark to mlcommons repo.

todo list

TPU

  • docker image verification
  • run workload in small scale
  • run workload in large scale

GPU

General

cc @suexu1025 @ShriyaPalsamudram

@ZhiyuLi-goog ZhiyuLi-goog requested a review from a team as a code owner January 6, 2025 10:57
@github-actions
Copy link

github-actions bot commented Jan 6, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

ZhiyuLi-goog and others added 7 commits January 10, 2025 03:27
* fix(moe): Added weight decay parameter

* fix(moe): Added proper handling of device count per node

* refactor(moe): Data preprocessing cleanup

* fix(moe): This container has more stable convergence

* fix(gpu): data preprocessing

* build(gpu): Fix container image to specific version of NeMo and Megatron
ShriyaRishab
ShriyaRishab previously approved these changes Jan 10, 2025
@ZhiyuLi-goog
Copy link
Contributor Author

Thank you @ShriyaPalsamudram for the reviewing.

Could you help me merge the PR when you think it is in good shape since I don't have authorization.

Not yet covered nemo 2.0 GPU guides and probably we can add it later in a separate PR:

  • update nemo 2.0 GPU guides: @hXl3s could you help update it @JustinPan-goog could you take a try and review the update guides? Thank you both!

@ZhiyuLi-goog ZhiyuLi-goog changed the title [Draft] MoE Benchmark MoE Benchmark Jan 10, 2025
@ZhiyuLi-goog ZhiyuLi-goog changed the title MoE Benchmark mixture_of_experts_pretraining Jan 10, 2025
@ZhiyuLi-goog ZhiyuLi-goog changed the title mixture_of_experts_pretraining Mixture of experts pretraining benchmark Jan 10, 2025
@JustinPan-goog
Copy link

For sure, I will give the current GPU guide a try over the weekend!

Thank you @ShriyaPalsamudram for the reviewing.

Could you help me merge the PR when you think it is in good shape since I don't have authorization.

Not yet covered nemo 2.0 GPU guides and probably we can add it later in a separate PR:

  • update nemo 2.0 GPU guides: @hXl3s could you help update it @JustinPan-goog could you take a try and review the update guides? Thank you both!

ShriyaRishab
ShriyaRishab previously approved these changes Jan 13, 2025
* docs(moe): GPU running and slurm docs

* docs: Fixed markup
--output_dir <path to save checkpoint> --hf_token <your token to HF repository>
```

This script will download specified checkpoint from huggingface repository, preprocess it and save
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a step to verify checksums of the converted checkpoint to ensure correctness?

Is this converted checkpoint available for download directly from mlcommons drive? If yes, can those instructions be shared here as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no checkpoint in any valid form. The one inside the mlcommon is neither in raw HF (maybe thats a good point, we should mirror HF checkpoint inside S3 bucket?) nor Nemo compatible

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have both the raw HF checkpoint as well as the NeMo compatible version in the S3 bucket so it lives on until we need it to and to make it easier for submitters to access all artifacts in the same place.

This script will download specified checkpoint from huggingface repository, preprocess it and save
into specified directory

To preprocess dataset, use dataset_preprocessing.py script
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as checkpoint

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preprocessed dataset can be downloaded from mlcommons bucket. I added annotation while keeping manual preprocessing step in the documentation

@JustinPan-goog
Copy link

The checkpoint_download.py script when using HFMixtralImporter, seems to be experiencing compatibility issues with certain NeMo/Megatron versions

  1. The script initially failed with an ImportError: cannot import name '__version__' from 'nemo'. A PR was merged yesterday to add a try-except block: NVIDIA-NeMo/NeMo@7d74e71#diff-e5559e6e42d963c2b10dcbb8c739bd185285a21f8c8f1c038f64529b9cf8aff0.

  2. After checking out the commit 7d74e71, a new error emerged: ImportError: cannot import name 'AttnBackend' from 'megatron.core.transformer.enums'. I presume I should update Megatron version as well, but would like to double check this update with @hXl3s

@ShriyaRishab
Copy link
Contributor

@JustinPan-goog has your issue been resolved?

@JustinPan-goog
Copy link

@JustinPan-goog has your issue been resolved?

Thanks for asking, @hXl3s has ack'ed the issue, but no solution yet.

@ShriyaRishab ShriyaRishab merged commit 8043c9d into mlcommons:master Feb 3, 2025
1 check passed
@github-actions github-actions bot locked and limited conversation to collaborators Feb 3, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants