[codex] Port HF cache cleanup by ishandhanani · Pull Request #49 · NVIDIA/srt-slurm

ishandhanani · 2026-04-20T20:11:11Z

Summary

Ports the HF cache cleanup and single-node model pre-download work from ishandhanani/srt-slurm PR #251 into NVIDIA/srt-slurm.

This adds pre-worker HuggingFace cache handling in SweepOrchestrator:

collect HF_* and HUGGING_FACE_* env vars from backend mode environments
remove stale .lock files from shared HF_HOME
pre-download HF models on one worker node before starting all workers
keep pre-download best-effort so workers can still retry at startup

Why

HF-backed runs can start multiple workers that all try to populate the same shared model cache at once. That can leave stale locks or trigger lock contention. Pre-caching the model once before worker startup avoids the fan-out download race.

Validation

uv run --with ruff ruff check src/srtctl tests/test_hf_cache.py
uv run --with pytest pytest tests/test_hf_cache.py tests/test_configs.py::TestHuggingFaceModelSupport tests/test_benchmarks.py -q

Port HF cache cleanup

01d2a59

ishandhanani marked this pull request as ready for review April 20, 2026 20:14

ishandhanani requested review from alec-flowers, csahithi and nlevin-ui as code owners April 20, 2026 20:14

ishandhanani merged commit 9569d9a into main Apr 20, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Port HF cache cleanup#49

[codex] Port HF cache cleanup#49
ishandhanani merged 1 commit intomainfrom
codex/port-hf-cache-cleanup

ishandhanani commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ishandhanani commented Apr 20, 2026

Summary

Why

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant