-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Description
Thanks to the Meagtron-LM project for providing the parallel mode of megatron-fsdp, we have achieved excellent results on internal small GBS SFT tasks. For example, on a 200B model similar to DeepSeek V2 (GBS=128, 32K context or larger), the performance of the m-fsdp mode is about twice that of the n-D parallel mode (due to avoiding the high bubble rate caused by PP parallelism in small GBS).
As we know, m-fsdp saves ckpts in fsdp_dtensor format, but our subsequent evaluation tasks use torch_dist or HF format ckpts. I know there is a tool in tools/checkpoint/checkpoint_inspector.py that can convert ckpts from torch_dist format to fsdp_dtensor format, but what about the reverse? Or is there a tool that can convert to HF format?
I have searched the Meagtron-LM project, the Megatron-Bridge project, and the internet, but have not found such a method or tool. Can anyone help me or offer some suggestions?
Releated issues or PRs:
- [Dev] docs(megatron-fsdp): add Megatron-FSDP user guide #2397
- [QUESTION] Does custom_fsdp model support finetuned from a non-fsdp checkpoint #1578
- Support Megatron FSDP /
fsdp_dtensorcheckpoints for exporting to HF NVIDIA-NeMo/Megatron-Bridge#1211
By the way, could someone please review the two PRs I submitted?