Skip to content

[cleanup] Remove FSDP1 support + make 'fsdp' default to fsdp2#1659

Open
erictang000 wants to merge 4 commits into
NovaSky-AI:mainfrom
erictang000:remove_fsdp1
Open

[cleanup] Remove FSDP1 support + make 'fsdp' default to fsdp2#1659
erictang000 wants to merge 4 commits into
NovaSky-AI:mainfrom
erictang000:remove_fsdp1

Conversation

@erictang000
Copy link
Copy Markdown
Collaborator

Summary

Removes the legacy FSDP1 backend and renames FSDP2 → FSDP, leaving a single FSDP strategy backed by PyTorch's composable fully_shard API. trainer.strategy="fsdp2" is kept as a deprecated alias that emits a DeprecationWarning and normalizes to "fsdp", so existing user scripts and YAMLs continue to work.

Motivation: FSDP2 was already the default everywhere, and the SFT path already rejected FSDP1. The dual-backend code carried a lot of dead weight — branching in FSDPStrategy._fsdp_init_model, an fsdp_version() dispatcher, parallel offload_fsdp_* / offload_fsdp2_* helpers, FSDP1-only LoRA prefixes, three
_handle.reshard(True) workarounds in the worker, and ~14 parametrized tests that ran twice (doubling the CI matrix for no gain).

Changes

Core code (skyrl/backends/skyrl_train/)

  • distributed/fsdp_utils.py — Deleted fsdp_version(), get_fsdp_state_ctx(), offload_fsdp_model_to_cpu(), load_fsdp_model_to_gpu(), get_sharding_strategy(), and get_fsdp_wrap_policy(). Removed FSDP1 imports (FullyShardedDataParallel, _lazy_init). Simplified layered_summon_lora_params() and
    collect_lora_params() to FSDP2-only paths (no more summon_full_params, no more _fsdp_wrapped_module prefixes).
  • distributed/fsdp_strategy.py — Deleted the if self.fsdp_strategy == "fsdp": FSDP1 init branch and the MixedPrecision / CPUOffload imports. Replaced get_fsdp_state_ctx(...) callsites with direct state_dict calls (FSDP2 returns DTensors natively). _unwrap_model no longer needs the FSDP1 _fsdp_wrapped_module path.
    save_hf_model now unconditionally uses fsdp2_get_full_state_dict.
  • workers/fsdp/fsdp_worker.py — Removed three _handle.reshard(True) FSDP1-internal workarounds, two FSDP.set_state_dict_type(...) calls in FSDPWeightExtractor, and the now-unused FSDP1 imports. Strategy assertion tightened to == "fsdp".

The fsdp2_*-prefixed helpers (apply_fsdp2, fsdp2_load_full_state_dict, fsdp2_get_full_state_dict, fsdp2_clip_grad_norm_, offload_fsdp2_model_to_cpu, load_fsdp2_model_to_gpu) are intentionally kept — their names map directly to the PyTorch torch.distributed.fsdp.fully_shard API surface they wrap.

Strategy normalization & deprecation alias

  • validate_cfg() and validate_sft_cfg() now normalize strategy="fsdp2""fsdp" with a DeprecationWarning before any downstream validation runs.
  • Removed the FSDP1-only cpu_offload assertion in validate_cfg().
  • FSDPBackendOverrides.strategy default and the backend assertion list flipped from "fsdp2" to "fsdp".

Configs & defaults

  • TrainerConfig.strategy: "fsdp2""fsdp"
  • ppo_base_config.yaml: strategy: fsdp2strategy: fsdp (also dropped # fsdp2 only qualifier on reshard_after_forward)
  • sft_config.py: _VALID_STRATEGIES = ("megatron", "fsdp")
  • examples/train/gsm8k/gsm8k-grpo-skypilot.yaml: same flip

Tests

  • Deleted tests/backends/skyrl_train/gpu/gpu_ci/distributed/test_fsdp_strategy.py (only contained test_fsdp1_wrap_policy) and the now-empty directory.
  • Updated 14 parametrized tests in tests/backends/skyrl_train/gpu/: dropped FSDP1 ("fsdp" rows in the old scheme), renamed "fsdp2" rows to "fsdp", updated test IDs.
  • Updated the import_worker() test helper.
  • Bulk-renamed strategy = "fsdp2" assignments and trainer.strategy=fsdp2 overrides across ~10 test files and ~30 example shell/Python scripts.
  • Deleted examples/train/training_backends/fsdp/run_fsdp2.sh (now a duplicate of run_fsdp.sh).
  • Added TestFSDP2StrategyAlias::test_fsdp2_normalized_to_fsdp_with_warning in tests/train/test_sft_config.py to lock in the deprecation alias behavior.

Documentation

  • docs/content/docs/examples/training_backends.mdx: collapsed the "FSDP and FSDP2" section into a single "FSDP" section.
  • docs/content/docs/configuration/config.mdx: "We support three backends: FSDP1, FSDP2, and Megatron" → "two backends: FSDP and Megatron". Kept a "(formerly known as FSDP2)" pointer for searchability.
  • Renamed FSDP2FSDP in docs/content/docs/{examples/megatron,recipes/overview,tinker/*}.mdx and in trainer.strategy=fsdp2 snippets across all tutorial/example pages.
  • skyrl-train/README.md: "Training Backends: FSDP, FSDP2, and Megatron" → "FSDP and Megatron".
  • examples/train/sft/README.md: backend description updated.

Out of scope

  • The [project.optional-dependencies] fsdp = [...] extras group in pyproject.toml keeps its name (already correctly aligned with the canonical strategy).
  • File names fsdp_utils.py, fsdp_strategy.py, fsdp_worker.py and the fsdp_config Hydra group are unchanged — they were already correct.
  • No Megatron changes.

Test plan

  • uv run --extra dev --extra skyrl-train python -m pytest tests/train/test_sft_config.py tests/train/test_trainer.py -v — 20 passed (incl. new alias test)
  • Programmatic check: default TrainerConfig.strategy == "fsdp"; strategy="fsdp2" triggers DeprecationWarning in both validate_cfg and validate_sft_cfg
  • All 7 touched modules import cleanly
  • ruff check clean on every modified source file
  • grep -rn "fsdp_version\|FSDP1\|get_fsdp_state_ctx\|get_sharding_strategy\|offload_fsdp_model_to_cpu\|load_fsdp_model_to_gpu" returns no matches in skyrl/, tests/, examples/, docs/
  • GPU CI: gpu_ci_run_skyrl_train.sh (parametrized tests now run only the FSDP path, not duplicated)
  • Smoke train: bash examples/train/gsm8k/run_gsm8k.sh trainer.strategy=fsdp
  • Alias smoke: bash examples/train/gsm8k/run_gsm8k.sh trainer.strategy=fsdp2 — should warn and run
  • Docs build (Vercel preview)

Breaking changes / migration

  • trainer.strategy="fsdp" now means what "fsdp2" used to mean. There is no migration for users on FSDP2 — their configs work unchanged in spirit; if they had strategy=fsdp2 literally pinned, it still works (with a deprecation warning) and resolves to FSDP. Users who explicitly relied on FSDP1 will see different behavior and
    should review the FSDP2 cpu_offload / reshard_after_forward semantics in the updated config docs.

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request consolidates the FSDP backends by removing the legacy FSDP1 implementation and renaming the FSDP2 (composable fully_shard API) strategy to "fsdp". The changes include extensive updates to documentation, example scripts, and configuration files to reflect the new naming convention, along with the addition of deprecation warnings for the "fsdp2" alias. Furthermore, the FSDPStrategy and associated utilities were refactored to remove FSDP1-specific logic. Review feedback highlighted potential issues with key matching and prefixing in the LoRA parameter collection logic, as well as a suggestion to update an error message for consistency with the new naming.

Comment thread skyrl/backends/skyrl_train/distributed/fsdp_utils.py
Comment thread skyrl/backends/skyrl_train/distributed/fsdp_utils.py
Comment thread skyrl/backends/skyrl_train/distributed/fsdp_strategy.py
@erictang000 erictang000 changed the title [cleanup] Remove FSDP1 support [cleanup] Remove FSDP1 support + make 'fsdp' default to fsdp2 May 13, 2026
@erictang000
Copy link
Copy Markdown
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant