Skip to content

Fix _atomic_save staging local checkpoints through $TMPDIR#21744

Open
c-pozzi wants to merge 7 commits into
Lightning-AI:masterfrom
c-pozzi:fix/atomic-save-local-tmpdir-21253
Open

Fix _atomic_save staging local checkpoints through $TMPDIR#21744
c-pozzi wants to merge 7 commits into
Lightning-AI:masterfrom
c-pozzi:fix/atomic-save-local-tmpdir-21253

Conversation

@c-pozzi
Copy link
Copy Markdown
Contributor

@c-pozzi c-pozzi commented May 27, 2026

What does this PR do?

Fixes #21253.

_atomic_save writes local checkpoints through $TMPDIR instead of straight to the target. On SLURM/HPC where $TMPDIR is on a small partition, this fails with "no space left on device" even though the destination disk has space.

The staging happens inside fsspec's LocalFileSystem.transactiontempfile.mkstemp() is called without a dir= argument, so the temp file lands in $TMPDIR. Full trace in the investigation comment.

Fix

For local paths, skip fs.transaction and stage next to the destination instead:

fd, staging = tempfile.mkstemp(dir=os.path.dirname(filepath), ...)
# write, then:
f.flush(); os.fsync(f.fileno())     # contents durable
os.replace(staging, target)          # atomic rename on same filesystem
# best-effort dir fsync so the rename survives a crash

os.replace is atomic on the same filesystem, so this preserves the no-partial-files guarantee that the transaction was providing. os.fsync on the file (and best-effort on the parent dir) make the save crash-durable, not just atomic-at-the-namespace level. Object storage paths (S3, GCS, Azure) are unchanged.

Tests

Four new tests in test_cloud_io.py:

  • staging happens in the destination dir, not $TMPDIR (regression for Directly save checkpoint to specified disk location #21253)
  • staging file is cleaned up if the rename fails
  • a failed save preserves the prior checkpoint (the atomicity guarantee)
  • missing parent dirs still raise FileNotFoundError (parity with current behavior)

Symlink semantics

When the destination path is itself a symlink, os.replace follows POSIX rename(2) and replaces the symlink with the new file (the link target is untouched). This is consistent with what the current code does on same-FS renames; the only case where behavior differs is when the destination was a symlink pointing across filesystems — that's the exact bug being fixed here, so anyone hitting it was already broken. Parent directories that are symlinks continue to work correctly (the OS resolves them transparently).

Before submitting


Co-authored with Claude Code.

@c-pozzi c-pozzi marked this pull request as ready for review May 27, 2026 09:48
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 27, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87%. Comparing base (1120456) to head (fd1177f).
⚠️ Report is 1 commits behind head on master.
✅ All tests successful. No failed tests found.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #21744   +/-   ##
=======================================
  Coverage      87%      87%           
=======================================
  Files         270      270           
  Lines       23975    24000   +25     
=======================================
+ Hits        20750    20773   +23     
- Misses       3225     3227    +2     

The test verified the EXDEV/PermissionError workaround in _atomic_save that handled fsspec's cross-device staging. Since Lightning-AI#21744 stages local checkpoints next to the destination, that code path is unreachable for local files — the test's mocks (os.rename, os.chmod) never fire on the new code path, leaving the test inert.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Directly save checkpoint to specified disk location

3 participants