-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add checkpoint #945
base: master
Are you sure you want to change the base?
add checkpoint #945
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works as expected. I have one question I'd like to confirm. Do we need to save the status of data loader to avoid reusing data samples?
Looks good to me. cc @tjruwase |
@hwchen2017 , this is a good question, if this is standard in pytorch or megatron, we should keep, otherwise we can skip it. |
@@ -16,7 +16,7 @@ | |||
from megatron import get_tensorboard_writer | |||
from megatron.core import mpu, tensor_parallel | |||
from megatron.arguments import parse_args, validate_args | |||
from megatron.checkpointing import load_args_from_checkpoint | |||
# from megatron.checkpointing import load_args_from_checkpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete instead of comment.
This is a reason why |
@zhangsmallshark please address above comments. Thanks! |
@GuanhuaWang I fixed it. Please check it. |
@zhangsmallshark - could you sign off with DCO on this PR? It replaces the CLA we had before. To fix it, the steps should be here |
I am working on it. |
* enable reward model offloading option * fixed code formatting * more formatting fixes * Pre-commit formatting fix --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Logan Adams <[email protected]> Signed-off-by: zhangsmallshark <[email protected]>
Not all pretrained LLMs use `<|endoftext|>` as the `eot_token`, therefore it's inappropriate to fix it. Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Signed-off-by: zhangsmallshark <[email protected]>
* add domino * use transformer from deepspeed * clean args * mega opt * add opt & timer * add opt * fix loss * folder name * Change arguent in pretrain script * Add readme for domino * Update readme for domino * Fixing usage issues * update dataset * megatron dependencies * path * Update README.md * remove imports * update import * Update README.md * Minor example script changes * train bash * require * Update README.md --------- Co-authored-by: chengming-zhang <[email protected]> Co-authored-by: Zheyu SHEN <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Signed-off-by: zhangsmallshark <[email protected]>
…for Domino (deepspeedai#939) Signed-off-by: zhangsmallshark <[email protected]>
* add benchmarking for offloading states * fix api names Signed-off-by: zhangsmallshark <[email protected]>
Signed-off-by: zhangsmallshark <[email protected]>
Signed-off-by: zhangsmallshark <[email protected]>
Signed-off-by: zhangsmallshark <[email protected]>
* Add label_smoothing while calculating step2 DPO loss in DeepSpeed-Chat. * Add training scripts for step2 DPO in DeepSpeed-Chat. * Remove unused packages and format the code of step2 DPO in DeepSpeed-Chat. * Update training scripts of step2 DPO in DeepSpeed-Chat. * Follow upstream fixes. * Update README.md for Step2 DPO finetuning. * Add opt 350M training log demo for step 2 dpo finetuning in DeepSpeed-Chat. * Address the formatting issue in step2 dpo finetuning in DeepSpeed-Chat. --------- Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: zhangsmallshark <[email protected]>
Signed-off-by: zhangsmallshark <[email protected]>
Signed-off-by: zhangsmallshark <[email protected]>
Signed-off-by: zhangsmallshark <[email protected]>
…pspeedai#954) Signed-off-by: zhangsmallshark <[email protected]>
563bc80
to
df387f5
Compare
I think I fixed it. I have tried: git rebase HEAD~10 --signoff |
It looks like you'd need to merge the DeepSpeedExamples master branch back in now. If you can't get it to work, we can override it too if you need to revert your most recent push. |
You can override it. Thanks. |
@zhangsmallshark , please resolve this branch conflict as we discussed, thanks |
support checkpoint for domino