Skip to content

fix: use fsspec for DeepSpeed checkpoint path validation to support remote URIs#21636

Open
bhimrazy wants to merge 6 commits into
Lightning-AI:masterfrom
bhimrazy:fix/deepspeed-remote-filesystem-uri
Open

fix: use fsspec for DeepSpeed checkpoint path validation to support remote URIs#21636
bhimrazy wants to merge 6 commits into
Lightning-AI:masterfrom
bhimrazy:fix/deepspeed-remote-filesystem-uri

Conversation

@bhimrazy
Copy link
Copy Markdown
Collaborator

@bhimrazy bhimrazy commented Apr 1, 2026

What does this PR do?

Fixes #21635.

_validate_checkpoint_directory used pathlib.Path to validate DeepSpeed checkpoint paths, which mangles remote URI schemes (s3://s3:/) and only checks the local filesystem — causing FileNotFoundError for any remote path (S3, GCS, HDFS, CFS).

Replaces with get_filesystem() + fs.isdir()/fs.isfile() from cloud_io, the same pattern used by ModelCheckpoint, TorchCheckpointIO, and CheckpointConnector.

…emote URIs

Fixes Lightning-AI#21635. _validate_checkpoint_directory was wrapping paths in
pathlib.Path, which mangles remote URI schemes (e.g. s3:// to s3:/)
and uses local-only .is_dir()/.is_file() checks that always return
False for remote paths like S3, GCS, or HDFS.

Replace with get_filesystem() + fs.isdir()/fs.isfile() from
cloud_io, which is the established pattern used by ModelCheckpoint,
TorchCheckpointIO, and CheckpointConnector.
@github-actions github-actions Bot added the fabric lightning.fabric.Fabric label Apr 1, 2026
@github-actions github-actions Bot added the pl Generic label for PyTorch Lightning package label Apr 13, 2026
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
3 pipeline(s) require an authorized user to comment /azp run to run.

@bhimrazy bhimrazy force-pushed the fix/deepspeed-remote-filesystem-uri branch from ca4deb5 to 805c47d Compare April 13, 2026 07:23
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
3 pipeline(s) require an authorized user to comment /azp run to run.

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
3 pipeline(s) require an authorized user to comment /azp run to run.

Adds a test for Case 2 (user passes a file path inside the checkpoint
subfolder, e.g. s3://bucket/ckpt/checkpoint/model_states.pt) to ensure
the grandparent path suggestion works correctly with remote URIs.
@bhimrazy bhimrazy marked this pull request as ready for review April 13, 2026 07:29
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
3 pipeline(s) require an authorized user to comment /azp run to run.

1 similar comment
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
3 pipeline(s) require an authorized user to comment /azp run to run.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87%. Comparing base (0e20e15) to head (c55666f).

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #21636   +/-   ##
=======================================
  Coverage      87%      87%           
=======================================
  Files         270      270           
  Lines       23973    23976    +3     
=======================================
+ Hits        20748    20751    +3     
  Misses       3225     3225           

…versal

Replace fs._parent() (private fsspec API) with posixpath.dirname()
(Python stdlib) for computing parent paths. posixpath handles
/‑separated paths correctly for both local paths and remote URIs
on all platforms, without relying on fsspec internals.

Also adds 3 unit tests covering remote URI validation and a shared
_make_s3_mock_fs() helper to reduce duplication.
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
3 pipeline(s) require an authorized user to comment /azp run to run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DeepSpeed _validate_checkpoint_directory fails with remote filesystem URIs (e.g. HDFS, S3, CFS)

1 participant