add checkpoint #945

zhangsmallshark · 2024-12-16T19:57:38Z

support checkpoint for domino

training/DeepSpeed-Domino/domino/training.py

training/DeepSpeed-Domino/megatron/checkpointing.py

hwchen2017

Works as expected. I have one question I'd like to confirm. Do we need to save the status of data loader to avoid reusing data samples?

GuanhuaWang · 2025-01-22T23:56:54Z

Looks good to me.

cc @tjruwase

GuanhuaWang · 2025-01-28T22:24:42Z

Works as expected. I have one question I'd like to confirm. Do we need to save the status of data loader to avoid reusing data samples?

@hwchen2017 , this is a good question, if this is standard in pytorch or megatron, we should keep, otherwise we can skip it.

tjruwase · 2025-01-29T14:31:47Z

training/DeepSpeed-Domino/megatron/initialize.py

@@ -16,7 +16,7 @@
 from megatron import get_tensorboard_writer
 from megatron.core import mpu, tensor_parallel
 from megatron.arguments import parse_args, validate_args
-from megatron.checkpointing import load_args_from_checkpoint
+# from megatron.checkpointing import load_args_from_checkpoint


Delete instead of comment.

training/DeepSpeed-Domino/megatron/checkpointing.py

tjruwase · 2025-01-29T14:41:09Z

Works as expected. I have one question I'd like to confirm. Do we need to save the status of data loader to avoid reusing data samples?

This is a reason why args is saved as part of checkpoint. I recommend following the pattern in Megatron-DeepSpeed
https://github.com/microsoft/Megatron-DeepSpeed/blob/f4157bea69f3df8c6cb66f2ebcda66ba03d1288e/megatron/checkpointing.py#L602-L611

GuanhuaWang · 2025-02-04T23:31:37Z

@zhangsmallshark please address above comments. Thanks!

zhangsmallshark · 2025-02-10T16:38:08Z

@GuanhuaWang I fixed it. Please check it.

loadams · 2025-02-10T19:58:25Z

@zhangsmallshark - could you sign off with DCO on this PR? It replaces the CLA we had before. To fix it, the steps should be here

zhangsmallshark · 2025-02-12T14:45:11Z

@zhangsmallshark - could you sign off with DCO on this PR? It replaces the CLA we had before. To fix it, the steps should be here

I am working on it.

* enable reward model offloading option * fixed code formatting * more formatting fixes * Pre-commit formatting fix --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Logan Adams <[email protected]> Signed-off-by: zhangsmallshark <[email protected]>

Not all pretrained LLMs use `<|endoftext|>` as the `eot_token`, therefore it's inappropriate to fix it. Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Signed-off-by: zhangsmallshark <[email protected]>

* add domino * use transformer from deepspeed * clean args * mega opt * add opt & timer * add opt * fix loss * folder name * Change arguent in pretrain script * Add readme for domino * Update readme for domino * Fixing usage issues * update dataset * megatron dependencies * path * Update README.md * remove imports * update import * Update README.md * Minor example script changes * train bash * require * Update README.md --------- Co-authored-by: chengming-zhang <[email protected]> Co-authored-by: Zheyu SHEN <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Signed-off-by: zhangsmallshark <[email protected]>

…for Domino (deepspeedai#939) Signed-off-by: zhangsmallshark <[email protected]>

* add benchmarking for offloading states * fix api names Signed-off-by: zhangsmallshark <[email protected]>

Signed-off-by: zhangsmallshark <[email protected]>

* Add label_smoothing while calculating step2 DPO loss in DeepSpeed-Chat. * Add training scripts for step2 DPO in DeepSpeed-Chat. * Remove unused packages and format the code of step2 DPO in DeepSpeed-Chat. * Update training scripts of step2 DPO in DeepSpeed-Chat. * Follow upstream fixes. * Update README.md for Step2 DPO finetuning. * Add opt 350M training log demo for step 2 dpo finetuning in DeepSpeed-Chat. * Address the formatting issue in step2 dpo finetuning in DeepSpeed-Chat. --------- Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: zhangsmallshark <[email protected]>

Signed-off-by: zhangsmallshark <[email protected]>

…pspeedai#954) Signed-off-by: zhangsmallshark <[email protected]>

zhangsmallshark · 2025-02-12T15:06:27Z

I think I fixed it. I have tried:

git rebase HEAD~10 --signoff
git push --force-with-lease origin master

loadams · 2025-02-12T16:06:06Z

I think I fixed it. I have tried:

git rebase HEAD~10 --signoff git push --force-with-lease origin master

It looks like you'd need to merge the DeepSpeedExamples master branch back in now. If you can't get it to work, we can override it too if you need to revert your most recent push.

zhangsmallshark · 2025-02-12T17:15:44Z

You can override it. Thanks.

GuanhuaWang · 2025-02-12T23:05:41Z

@zhangsmallshark , please resolve this branch conflict as we discussed, thanks

zhangsmallshark requested review from tjruwase and awan-10 as code owners December 16, 2024 19:57

GuanhuaWang requested review from GuanhuaWang and hwchen2017 December 16, 2024 19:59

tjruwase removed the request for review from awan-10 December 18, 2024 11:22

hwchen2017 reviewed Dec 19, 2024

View reviewed changes

training/DeepSpeed-Domino/domino/training.py Outdated Show resolved Hide resolved

hwchen2017 reviewed Dec 19, 2024

View reviewed changes

training/DeepSpeed-Domino/megatron/checkpointing.py Show resolved Hide resolved

hwchen2017 reviewed Dec 19, 2024

View reviewed changes

hwchen2017 approved these changes Jan 22, 2025

View reviewed changes

tjruwase reviewed Jan 29, 2025

View reviewed changes

training/DeepSpeed-Domino/megatron/checkpointing.py Outdated Show resolved Hide resolved

kfertakis and others added 12 commits February 12, 2025 09:04

Update DeepSpeed version requirement to >=0.16.0 in requirements.txt …

600db32

…for Domino (deepspeedai#939) Signed-off-by: zhangsmallshark <[email protected]>

Example and benchmark of APIs to offload states (deepspeedai#942)

2b906c9

* add benchmarking for offloading states * fix api names Signed-off-by: zhangsmallshark <[email protected]>

add checkpoint

937f452

Signed-off-by: zhangsmallshark <[email protected]>

fix args

6b8109e

Signed-off-by: zhangsmallshark <[email protected]>

remove-redundant-code (deepspeedai#947)

0be49e3

Signed-off-by: zhangsmallshark <[email protected]>

Update references to torchvision (deepspeedai#949)

b436bd4

Signed-off-by: zhangsmallshark <[email protected]>

save args

abfedb9

Signed-off-by: zhangsmallshark <[email protected]>

Cleanup CODEOWNERS (deepspeedai#953)

d6047a5

Signed-off-by: zhangsmallshark <[email protected]>

fix: the json format of the training imagenet configuration file (dee…

df387f5

…pspeedai#954) Signed-off-by: zhangsmallshark <[email protected]>

zhangsmallshark force-pushed the master branch from 563bc80 to df387f5 Compare February 12, 2025 15:05

Merge branch 'master' into master

9e128ed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add checkpoint #945

add checkpoint #945

Uh oh!

zhangsmallshark commented Dec 16, 2024

Uh oh!

Uh oh!

Uh oh!

hwchen2017 left a comment

Uh oh!

GuanhuaWang commented Jan 22, 2025

Uh oh!

GuanhuaWang commented Jan 28, 2025 •

edited

Loading

Uh oh!

tjruwase Jan 29, 2025

Uh oh!

Uh oh!

tjruwase commented Jan 29, 2025

Uh oh!

GuanhuaWang commented Feb 4, 2025

Uh oh!

zhangsmallshark commented Feb 10, 2025

Uh oh!

loadams commented Feb 10, 2025

Uh oh!

zhangsmallshark commented Feb 12, 2025

Uh oh!

zhangsmallshark commented Feb 12, 2025

Uh oh!

loadams commented Feb 12, 2025

Uh oh!

zhangsmallshark commented Feb 12, 2025

Uh oh!

GuanhuaWang commented Feb 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

add checkpoint #945

Are you sure you want to change the base?

add checkpoint #945

Uh oh!

Conversation

zhangsmallshark commented Dec 16, 2024

Uh oh!

Uh oh!

Uh oh!

hwchen2017 left a comment

Choose a reason for hiding this comment

Uh oh!

GuanhuaWang commented Jan 22, 2025

Uh oh!

GuanhuaWang commented Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tjruwase Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tjruwase commented Jan 29, 2025

Uh oh!

GuanhuaWang commented Feb 4, 2025

Uh oh!

zhangsmallshark commented Feb 10, 2025

Uh oh!

loadams commented Feb 10, 2025

Uh oh!

zhangsmallshark commented Feb 12, 2025

Uh oh!

zhangsmallshark commented Feb 12, 2025

Uh oh!

loadams commented Feb 12, 2025

Uh oh!

zhangsmallshark commented Feb 12, 2025

Uh oh!

GuanhuaWang commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

GuanhuaWang commented Jan 28, 2025 •

edited

Loading

GuanhuaWang commented Feb 12, 2025 •

edited

Loading