[FSDP] [feature]Support save/load checkpointing by ppraneth · Pull Request #406 · THUDM/slime

ppraneth · 2025-09-29T12:10:25Z

Hi everyone,

This pr add load/save feature in FSDP and also closes #402

Here are the main changes I've made:

Checkpointing Logic: I added all the logic for saving and loading into a new file, slime/backends/fsdp_utils/checkpoint.py, to keep things organized. This handles saving the model, tokenizer, and optimizer state.
Actor Integration: I updated the FSDPTrainRayActor in slime/backends/fsdp_utils/actor.py to use the new functions. It now correctly loads the state from a checkpoint at the start of a run and saves progress when save_model is called.
New Arguments: To control this, I added a few command-line arguments to slime/backends/fsdp_utils/arguments.py: --save, --load, --save-safe-serialization and a safety flag --overwrite-checkpoints so you don't accidentally overwrite your work.
Bug Fixes: While building this, I also fixed a couple of bugs in the init process to prevent the model from being loaded twice and to make sure the global_step counter is restored correctly.

Please check it out and let me know if any changes are needed. I'm happy to make them! Thanks for taking a look.

PopSoda2002 · 2025-09-29T20:51:12Z

Can you share some test case and result?

slime/backends/fsdp_utils/checkpoint.py

ppraneth · 2025-09-30T03:33:56Z

Can you share some test case and result?

I’ve written a test case, but I’m not sure if it’s correct. Since I don’t have access to GPUs, could you please check it and run it?

PopSoda2002 · 2025-09-30T04:54:25Z

Contributor

synced in slack

leng-yue · 2025-10-03T21:35:02Z

Thanks for your great contribution :)

I’d suggest switching to FSDP’s distributed checkpointing, as it could make saving/loading the optimizer and LLM weights noticeably faster and reduce peak memory use. Happy to prototype this if you want.

We can also let FSDP write safetensors directly (see the PyTorch blog on HuggingFace safetensors support) Example: link.

PopSoda2002 · 2025-10-05T18:51:16Z

Thanks for your great contribution :)

I’d suggest switching to FSDP’s distributed checkpointing, as it could make saving/loading the optimizer and LLM weights noticeably faster and reduce peak memory use. Happy to prototype this if you want.

We can also let FSDP write safetensors directly (see the PyTorch blog on HuggingFace safetensors support) Example: link.

Hi thanks for your suggestion! I think we should follow this tutorial and example too, unless there would be OOM for large model with current logic @ppraneth

slime/backends/fsdp_utils/actor.py

leng-yue · 2025-10-06T01:11:22Z

Thanks for your great contribution :)
I’d suggest switching to FSDP’s distributed checkpointing, as it could make saving/loading the optimizer and LLM weights noticeably faster and reduce peak memory use. Happy to prototype this if you want.
We can also let FSDP write safetensors directly (see the PyTorch blog on HuggingFace safetensors support) Example: link.

Hi thanks for your suggestion! I think we should follow this tutorial and example too, unless there would be OOM for large model with current logic @ppraneth

For reference, with the current loading logic a 235B model is likely to OOM on most machines. The bf16 weights alone are ~470 GB (235B × 2 bytes), and with typical 8 GPUs the effective memory can exceed ~3.7 TB.

Once this PR lands, I’m happy to follow up with a change that switches us to FSDP distributed checkpointing (writing safetensors directly) to make load/save faster and reduce peak memory.

ppraneth · 2025-10-06T02:48:16Z

Thanks for your great contribution :)
I’d suggest switching to FSDP’s distributed checkpointing, as it could make saving/loading the optimizer and LLM weights noticeably faster and reduce peak memory use. Happy to prototype this if you want.
We can also let FSDP write safetensors directly (see the PyTorch blog on HuggingFace safetensors support) Example: link.

Hi thanks for your suggestion! I think we should follow this tutorial and example too, unless there would be OOM for large model with current logic @ppraneth

I will try to implement it

leng-yue

LGTM

PopSoda2002 · 2025-10-13T19:05:06Z

tests/test1.py

LGTM! But can you remove these tests like tests/test1.py and tests/test2.py, however, you can use these tests to test locally and paste the results in the PR

…into support-fsdp_save

ppraneth · 2025-10-18T13:04:04Z

@PopSoda2002 Testing is done

fzyzcjy · 2025-10-25T10:10:56Z

this is pretty useful for long runs

ppraneth · 2025-10-28T08:08:35Z

@zhuzilin Can you check this

fzyzcjy · 2025-10-28T14:09:30Z

@ppraneth hi could you please handle the lint error, and may I know whether it is ready to use (e.g. is it tested and looks good) or still has some ongoing work/fix?

fzyzcjy · 2025-10-28T14:11:19Z

slime/backends/fsdp_utils/checkpoint.py

+    if not optimizer_state_dict or not optimizer_state_dict.get("state", {}):
+        raise ValueError(f"Optimizer state dictionary is empty for iteration {iteration}")
+
+    use_safetensors = getattr(args, "save_safe_serialization", False)


tiny nit

Suggested change

use_safetensors = getattr(args, "save_safe_serialization", False)

use_safetensors = args.save_safe_serialization

fzyzcjy

super tiny nits

fzyzcjy · 2025-10-28T14:12:21Z

slime/backends/fsdp_utils/checkpoint.py

+    if use_safetensors:
+        try:
+            from torch.distributed.checkpoint import HuggingFaceStorageWriter
+
+            model_writer = HuggingFaceStorageWriter(
+                path=model_subdir, fqn_to_index_mapping={k: 0 for k in model_state_dict.keys()}
+            )
+            dist_cp.save(state_dict=model_state_dict, storage_writer=model_writer)
+        except ImportError as e:
+            raise ImportError(
+                "Safetensors library is required when save_safe_serialization is True, but it is not installed."
+            ) from e
+    else:
+        model_writer = dist_cp.FileSystemWriter(model_subdir)
+        dist_cp.save(state_dict={"model": model_state_dict}, storage_writer=model_writer)


tiny nit:

if use_safetensors: model_writer = HuggingFaceStorageWriter(..) else: model_writer = dist_cp.FileSystemWriter(..) dist_cp.save(..)

fzyzcjy · 2025-10-28T14:14:22Z

slime/backends/fsdp_utils/checkpoint.py

+    # Load model
+    if is_safetensors:
+        try:
+            from torch.distributed.checkpoint import HuggingFaceStorageReader
+
+            model_storage_reader = HuggingFaceStorageReader(path=model_subdir)
+            dist_cp.load(state_dict=model_state_dict, storage_reader=model_storage_reader)
+        except ImportError as e:
+            raise ImportError(
+                "Safetensors library is required to load safetensors checkpoint files, but it is not installed."
+            ) from e
+    else:
+        model_state_dict = {"model": model_state_dict}
+        model_storage_reader = dist_cp.FileSystemReader(model_subdir)
+        dist_cp.load(state_dict=model_state_dict, storage_reader=model_storage_reader)
+        model_state_dict = model_state_dict["model"]


fzyzcjy · 2025-10-28T14:14:46Z

slime/backends/fsdp_utils/checkpoint.py

+        model_state_dict = model_state_dict["model"]
+
+    # Load optimizer (always standard format)
+    optim_state_dict = {"optim": optimizer_state_dict}


nit: wondering whether we can remove that extra "optim" nested key

fzyzcjy · 2025-10-28T14:15:43Z

slime/backends/fsdp_utils/checkpoint.py

+        # Broadcast to all ranks
+        state_t = torch.tensor([0, 0], dtype=torch.int64, device="cpu")
+        if dist.get_rank() == 0:
+            state_t[0] = loaded_iteration
+            state_t[1] = global_step
+        dist.broadcast(state_t, src=0)


tiny nit: dist.broadcast_object_list

zhuzilin · 2025-10-30T03:07:18Z

close as solved by #633

Save

22fe177

MrAta reviewed Sep 30, 2025

View reviewed changes

slime/backends/fsdp_utils/checkpoint.py Outdated Show resolved Hide resolved

fixes

a0778c5

zhuzilin force-pushed the main branch from f8d4cd3 to f291a8f Compare September 30, 2025 11:08

PopSoda2002 and others added 6 commits September 30, 2025 20:19

test

91c9d66

fixe test script

69d1c40

fixe test script

6b43b69

fix test script

f8900d1

fix test script

00415f7

fix test script

efeb909

ppraneth added 4 commits October 4, 2025 12:22

Use 4B model

639a14f

Merge branch 'main' into support-fsdp_save

fcb6659

fix test script

8c32634

fix test script

f4d783d

PopSoda2002 reviewed Oct 5, 2025

View reviewed changes

slime/backends/fsdp_utils/actor.py Show resolved Hide resolved

ppraneth added 8 commits October 6, 2025 09:17

test

f3479a6

test checkpoint

b97e2db

test checkpoint

b6f1cea

test checkpoint

4bd7f5a

test

1aa2ac5

test

340af06

test

6b9f20d

test

4c22a8d

ppraneth added 2 commits October 11, 2025 16:30

fix

30983e7

fix-test

3f9aa04

ppraneth requested a review from leng-yue October 11, 2025 12:08

leng-yue approved these changes Oct 13, 2025

View reviewed changes

Merge branch 'main' into support-fsdp_save

394308b

PopSoda2002 approved these changes Oct 13, 2025

View reviewed changes

ppraneth added 13 commits October 14, 2025 22:37

update the test file

fbce84a

Merge branch 'support-fsdp_save' of https://github.com/ppraneth/slime …

1801c31

…into support-fsdp_save

test-file-fix

1784ec3

test-file-fix

21fcfd2

Merge branch 'main' into support-fsdp_save

4d70b72

fix

21015e0

test

a54c195

test

ed04e78

test

f94b783

test

9a04929

Test Fixes

9306144

Test Fixes-1

4186153

Test Fixes-

1cd1f9d

ppraneth added 2 commits October 18, 2025 18:34

remove file

aa62419

Merge branch 'main' into support-fsdp_save

e235f55

Merge branch 'main' into support-fsdp_save

fcb6226

fzyzcjy reviewed Oct 28, 2025

View reviewed changes

zhuzilin closed this Oct 30, 2025

	use_safetensors = getattr(args, "save_safe_serialization", False)
	use_safetensors = args.save_safe_serialization

Conversation

ppraneth commented Sep 29, 2025

Uh oh!

PopSoda2002 commented Sep 29, 2025

Uh oh!

Uh oh!

ppraneth commented Sep 30, 2025

Uh oh!

PopSoda2002 commented Sep 30, 2025

Uh oh!

leng-yue commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PopSoda2002 commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

leng-yue commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ppraneth commented Oct 6, 2025

Uh oh!

leng-yue left a comment

Choose a reason for hiding this comment

Uh oh!

PopSoda2002 Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

ppraneth commented Oct 18, 2025

Uh oh!

fzyzcjy commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ppraneth commented Oct 28, 2025

Uh oh!

fzyzcjy commented Oct 28, 2025

Uh oh!

fzyzcjy Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fzyzcjy left a comment

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

zhuzilin commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

leng-yue commented Oct 3, 2025 •

edited

Loading

PopSoda2002 commented Oct 5, 2025 •

edited

Loading

leng-yue commented Oct 6, 2025 •

edited

Loading

fzyzcjy commented Oct 25, 2025 •

edited

Loading

fzyzcjy Oct 28, 2025 •

edited

Loading