Skip to content

[codex] Port HF cache cleanup#49

Merged
ishandhanani merged 1 commit intomainfrom
codex/port-hf-cache-cleanup
Apr 20, 2026
Merged

[codex] Port HF cache cleanup#49
ishandhanani merged 1 commit intomainfrom
codex/port-hf-cache-cleanup

Conversation

@ishandhanani
Copy link
Copy Markdown
Collaborator

Summary

Ports the HF cache cleanup and single-node model pre-download work from ishandhanani/srt-slurm PR #251 into NVIDIA/srt-slurm.

This adds pre-worker HuggingFace cache handling in SweepOrchestrator:

  • collect HF_* and HUGGING_FACE_* env vars from backend mode environments
  • remove stale .lock files from shared HF_HOME
  • pre-download HF models on one worker node before starting all workers
  • keep pre-download best-effort so workers can still retry at startup

Why

HF-backed runs can start multiple workers that all try to populate the same shared model cache at once. That can leave stale locks or trigger lock contention. Pre-caching the model once before worker startup avoids the fan-out download race.

Validation

  • uv run --with ruff ruff check src/srtctl tests/test_hf_cache.py
  • uv run --with pytest pytest tests/test_hf_cache.py tests/test_configs.py::TestHuggingFaceModelSupport tests/test_benchmarks.py -q

@ishandhanani ishandhanani marked this pull request as ready for review April 20, 2026 20:14
@ishandhanani ishandhanani merged commit 9569d9a into main Apr 20, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant