How to convert checkpoint from fsdp_dtensor to torch_dist or huggingface in megatron-fsdp mode?

Thanks to the Meagtron-LM project for providing the parallel mode of megatron-fsdp, we have achieved excellent results on internal small GBS SFT tasks. For example, on a 200B model similar to DeepSeek V2 (GBS=128, 32K context or larger), the performance of the m-fsdp mode is about twice that of the n-D parallel mode (due to avoiding the high bubble rate caused by PP parallelism in small GBS).

As we know, m-fsdp saves ckpts in fsdp_dtensor format, but our subsequent evaluation tasks use torch_dist or HF format ckpts. I know there is a tool in tools/checkpoint/checkpoint_inspector.py that can convert ckpts from torch_dist format to fsdp_dtensor format, but what about the reverse? Or is there a tool that can convert to HF format?

I have searched the [Meagtron-LM](https://github.com/NVIDIA/Megatron-LM) project, the [Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) project, and the internet, but have not found such a method or tool. Can anyone help me or offer some suggestions?

Releated issues or PRs:
1. https://github.com/NVIDIA/Megatron-LM/pull/2397
2. https://github.com/NVIDIA/Megatron-LM/issues/1578
3. https://github.com/NVIDIA-NeMo/Megatron-Bridge/issues/1211

By the way, could someone please review the two PRs I submitted?
1. https://github.com/NVIDIA/Megatron-LM/pull/2613
2. https://github.com/NVIDIA/Megatron-LM/pull/2612

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to convert checkpoint from fsdp_dtensor to torch_dist or huggingface in megatron-fsdp mode? #2805

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to convert checkpoint from fsdp_dtensor to torch_dist or huggingface in megatron-fsdp mode? #2805

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions