[cleanup] Remove FSDP1 support + make 'fsdp' default to fsdp2#1659
Open
erictang000 wants to merge 4 commits into
Open
[cleanup] Remove FSDP1 support + make 'fsdp' default to fsdp2#1659erictang000 wants to merge 4 commits into
erictang000 wants to merge 4 commits into
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request consolidates the FSDP backends by removing the legacy FSDP1 implementation and renaming the FSDP2 (composable fully_shard API) strategy to "fsdp". The changes include extensive updates to documentation, example scripts, and configuration files to reflect the new naming convention, along with the addition of deprecation warnings for the "fsdp2" alias. Furthermore, the FSDPStrategy and associated utilities were refactored to remove FSDP1-specific logic. Review feedback highlighted potential issues with key matching and prefixing in the LoRA parameter collection logic, as well as a suggestion to update an error message for consistency with the new naming.
Collaborator
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Removes the legacy FSDP1 backend and renames FSDP2 → FSDP, leaving a single FSDP strategy backed by PyTorch's composable
fully_shardAPI.trainer.strategy="fsdp2"is kept as a deprecated alias that emits aDeprecationWarningand normalizes to"fsdp", so existing user scripts and YAMLs continue to work.Motivation: FSDP2 was already the default everywhere, and the SFT path already rejected FSDP1. The dual-backend code carried a lot of dead weight — branching in
FSDPStrategy._fsdp_init_model, anfsdp_version()dispatcher, paralleloffload_fsdp_*/offload_fsdp2_*helpers, FSDP1-only LoRA prefixes, three_handle.reshard(True)workarounds in the worker, and ~14 parametrized tests that ran twice (doubling the CI matrix for no gain).Changes
Core code (
skyrl/backends/skyrl_train/)distributed/fsdp_utils.py— Deletedfsdp_version(),get_fsdp_state_ctx(),offload_fsdp_model_to_cpu(),load_fsdp_model_to_gpu(),get_sharding_strategy(), andget_fsdp_wrap_policy(). Removed FSDP1 imports (FullyShardedDataParallel,_lazy_init). Simplifiedlayered_summon_lora_params()andcollect_lora_params()to FSDP2-only paths (no moresummon_full_params, no more_fsdp_wrapped_moduleprefixes).distributed/fsdp_strategy.py— Deleted theif self.fsdp_strategy == "fsdp":FSDP1 init branch and theMixedPrecision/CPUOffloadimports. Replacedget_fsdp_state_ctx(...)callsites with direct state_dict calls (FSDP2 returns DTensors natively)._unwrap_modelno longer needs the FSDP1_fsdp_wrapped_modulepath.save_hf_modelnow unconditionally usesfsdp2_get_full_state_dict.workers/fsdp/fsdp_worker.py— Removed three_handle.reshard(True)FSDP1-internal workarounds, twoFSDP.set_state_dict_type(...)calls inFSDPWeightExtractor, and the now-unused FSDP1 imports. Strategy assertion tightened to== "fsdp".The
fsdp2_*-prefixed helpers (apply_fsdp2,fsdp2_load_full_state_dict,fsdp2_get_full_state_dict,fsdp2_clip_grad_norm_,offload_fsdp2_model_to_cpu,load_fsdp2_model_to_gpu) are intentionally kept — their names map directly to the PyTorchtorch.distributed.fsdp.fully_shardAPI surface they wrap.Strategy normalization & deprecation alias
validate_cfg()andvalidate_sft_cfg()now normalizestrategy="fsdp2"→"fsdp"with aDeprecationWarningbefore any downstream validation runs.cpu_offloadassertion invalidate_cfg().FSDPBackendOverrides.strategydefault and the backend assertion list flipped from"fsdp2"to"fsdp".Configs & defaults
TrainerConfig.strategy:"fsdp2"→"fsdp"ppo_base_config.yaml:strategy: fsdp2→strategy: fsdp(also dropped# fsdp2 onlyqualifier onreshard_after_forward)sft_config.py:_VALID_STRATEGIES = ("megatron", "fsdp")examples/train/gsm8k/gsm8k-grpo-skypilot.yaml: same flipTests
tests/backends/skyrl_train/gpu/gpu_ci/distributed/test_fsdp_strategy.py(only containedtest_fsdp1_wrap_policy) and the now-empty directory.tests/backends/skyrl_train/gpu/: dropped FSDP1 ("fsdp"rows in the old scheme), renamed"fsdp2"rows to"fsdp", updated test IDs.import_worker()test helper.strategy = "fsdp2"assignments andtrainer.strategy=fsdp2overrides across ~10 test files and ~30 example shell/Python scripts.examples/train/training_backends/fsdp/run_fsdp2.sh(now a duplicate ofrun_fsdp.sh).TestFSDP2StrategyAlias::test_fsdp2_normalized_to_fsdp_with_warningintests/train/test_sft_config.pyto lock in the deprecation alias behavior.Documentation
docs/content/docs/examples/training_backends.mdx: collapsed the "FSDP and FSDP2" section into a single "FSDP" section.docs/content/docs/configuration/config.mdx: "We support three backends: FSDP1, FSDP2, and Megatron" → "two backends: FSDP and Megatron". Kept a "(formerly known as FSDP2)" pointer for searchability.FSDP2→FSDPindocs/content/docs/{examples/megatron,recipes/overview,tinker/*}.mdxand intrainer.strategy=fsdp2snippets across all tutorial/example pages.skyrl-train/README.md: "Training Backends: FSDP, FSDP2, and Megatron" → "FSDP and Megatron".examples/train/sft/README.md: backend description updated.Out of scope
[project.optional-dependencies] fsdp = [...]extras group inpyproject.tomlkeeps its name (already correctly aligned with the canonical strategy).fsdp_utils.py,fsdp_strategy.py,fsdp_worker.pyand thefsdp_configHydra group are unchanged — they were already correct.Test plan
uv run --extra dev --extra skyrl-train python -m pytest tests/train/test_sft_config.py tests/train/test_trainer.py -v— 20 passed (incl. new alias test)TrainerConfig.strategy == "fsdp";strategy="fsdp2"triggersDeprecationWarningin bothvalidate_cfgandvalidate_sft_cfgruff checkclean on every modified source filegrep -rn "fsdp_version\|FSDP1\|get_fsdp_state_ctx\|get_sharding_strategy\|offload_fsdp_model_to_cpu\|load_fsdp_model_to_gpu"returns no matches inskyrl/,tests/,examples/,docs/gpu_ci_run_skyrl_train.sh(parametrized tests now run only the FSDP path, not duplicated)bash examples/train/gsm8k/run_gsm8k.sh trainer.strategy=fsdpbash examples/train/gsm8k/run_gsm8k.sh trainer.strategy=fsdp2— should warn and runBreaking changes / migration
trainer.strategy="fsdp"now means what"fsdp2"used to mean. There is no migration for users on FSDP2 — their configs work unchanged in spirit; if they hadstrategy=fsdp2literally pinned, it still works (with a deprecation warning) and resolves to FSDP. Users who explicitly relied on FSDP1 will see different behavior andshould review the FSDP2
cpu_offload/reshard_after_forwardsemantics in the updated config docs.🤖 Generated with Claude Code