fix: resume downloads across HuggingFace commit hash changes#1536
fix: resume downloads across HuggingFace commit hash changes#1536ianbmacdonald wants to merge 1 commit into
Conversation
f788eb2 to
c895f54
Compare
|
added some testing and addressed some agent review feedback 1921182 makes orphan detection and directory-based download-status checks recursive for .partial files, which is important for nested paths under a snapshot. 226cfb8 finishes that off by updating the parallel add_model_to_cache() path and fixing the nested-partial regression test so the manifest filename actually matches the nested partial path being resumed. The new test shape uses a real HF-backed /pull, but it does not depend on a live upstream repo changing commit hash during the test window. Instead it seeds orphaned snapshot state locally and verifies selection / cleanup behavior from there. The main change is extra runtime/network. One integration note for later: if this branch is rebased onto or merged after #1412, the orphan-resume logic in download_from_huggingface() will need a follow-up pass for #1412’s per-file download_path manifest format and multi-repo snapshot layout. The recursive .partial scan changes should carry over directly, but the resume/rename path handling is still written around a single main-repo snapshot to be compatible with current main |
e19e7b1 to
7a39854
Compare
|
Now that #1412 has merged and this branch is rebased onto it, the integration notes from the earlier comment have been addressed across several commits: Multi-repo manifest support (e8e71cf, 073520e):
Stale manifest and content verification (24ae02e):
Snapshot collision handling (a282ba7):
Test coverage (073520e, a282ba7, 70c58d0):
|
jeremyfowers
left a comment
There was a problem hiding this comment.
@ianbmacdonald please see the two comments, other than that good to merge.
cdd069b to
8710666
Compare
|
Rebased onto current The gap they don't cover: an interrupted download resumed after the repo advanced to a new commit. #2066 reuses a completed previous snapshot when artifacts are unchanged, and #1950 verifies content on download — but an interrupted attempt (with a This adds a small Scoped to the main repo, where the single download manifest lives — I noted the auxiliary-checkpoint case as a follow-up in the PR description (an interrupted aux snapshot is simply re-downloaded fresh; still safe). +73 lines + two tests. |
8710666 to
c528c3f
Compare
|
Rebased onto current
Scope is now 2 files (+245/-0): the migration helper plus two regression tests (resume-across-commit, and not-migrating a completed stale snapshot). Auxiliary-repo (text-encoder/VAE) interrupted resume is documented as a follow-up. |
c528c3f to
615d2f6
Compare
|
Rebased onto current Verified on a fresh build (Ubuntu 26.04, glibc 2.43): |
615d2f6 to
328895f
Compare
|
Rebased onto current The |
328895f to
b6ce38d
Compare
The HF API returns the repo's latest commit SHA on every call. If a download is paused and the repo receives new commits before it resumes (common on fast-moving repos like unsloth), the new SHA differs from the one the download started under, so a fresh snapshots/<new_sha>/ is created and everything re-downloads from zero — orphaning the partially-downloaded snapshots/<old_sha>/. Before creating the new snapshot directory, look for an interrupted prior attempt (a snapshot under a different hash carrying an in-progress .download_manifest.json) and rename it forward to the new commit hash, so download_from_manifest resumes its files in place instead of restarting. The interrupted attempt is located by its manifest marker rather than refs/main, because refs/main is sticky (advanced only on successful completion, per lemonade-sdk#2066) and does not point at an in-progress snapshot. Carrying bytes forward is safe: per-file SHA verification (added in lemonade-sdk#1950) checks each carried file against the new commit's hash and re-downloads any that changed — including a .partial whose blob changed — so this change does not need its own content reconciliation. Scope: this targets the main repo, which is where the single download manifest lives. Auxiliary checkpoint repos keep no manifest marker in their own snapshot, so an interrupted auxiliary snapshot is not migrated and is re-downloaded fresh under the new hash (safe; lemonade-sdk#1950 still verifies what is fetched). Resuming interrupted auxiliary snapshots would need recursive .partial detection and is left as a follow-up. Adds test_030 (interrupted snapshot migrated forward and resumed in place, proven by unchanged mtime) and test_031 (a completed manifest-less snapshot is not migrated). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Co-Authored-By: GLM-5.1 <noreply@zhipuai.cn> Co-Authored-By: GPT-5.5 <noreply@openai.com>
b6ce38d to
7c94bdf
Compare
Summary
Resume an interrupted HuggingFace download across a commit-hash change instead of restarting from zero. Complements #1950 (SHA-on-download) and #2066 (sticky refs / completed-snapshot reuse), which don't cover the interrupted case.
Problem
The HF API returns the latest commit SHA on every call. If a download is interrupted and the repo advances to a new commit before it resumes (common on fast-moving repos like unsloth), a fresh
snapshots/<new_sha>/is created and everything re-downloads from zero, orphaning the partialsnapshots/<old_sha>/.Approach
Before creating the new snapshot dir, find an interrupted prior attempt — a snapshot under a different hash carrying an in-progress
.download_manifest.json— and rename it forward to the new commit hash sodownload_from_manifestresumes its files in place. The interrupted attempt is located by its manifest marker, notrefs/main, becauserefs/mainis sticky (advanced only on completion, per #2066) and does not point at an in-progress snapshot.Carrying bytes forward is safe: #1950's per-file SHA verification checks each file against the new commit's hash and re-downloads any that changed — including a
.partialwhose blob changed — so this change needs no content reconciliation of its own.Scope / follow-up
Targets the main repo, where the single download manifest lives. Auxiliary checkpoint repos (text encoder, VAE, mmproj, …) keep no manifest marker in their own snapshot, so an interrupted auxiliary snapshot is not migrated — it's re-downloaded fresh under the new hash (safe; #1950 still verifies what's fetched). Resuming interrupted auxiliary snapshots would need detecting them by a recursive
.partialscan and is left as a follow-up.Tests
test_030— interrupted snapshot migrated forward and resumed in place (proven by unchanged mtime)test_031— a completed, manifest-less snapshot is not migrated (only genuinely interrupted downloads are)