Skip to content

fix: resume downloads across HuggingFace commit hash changes#1536

Open
ianbmacdonald wants to merge 1 commit into
lemonade-sdk:mainfrom
ianbmacdonald:fix/download-resume-across-commits
Open

fix: resume downloads across HuggingFace commit hash changes#1536
ianbmacdonald wants to merge 1 commit into
lemonade-sdk:mainfrom
ianbmacdonald:fix/download-resume-across-commits

Conversation

@ianbmacdonald

@ianbmacdonald ianbmacdonald commented Apr 4, 2026

Copy link
Copy Markdown
Collaborator

Summary

Resume an interrupted HuggingFace download across a commit-hash change instead of restarting from zero. Complements #1950 (SHA-on-download) and #2066 (sticky refs / completed-snapshot reuse), which don't cover the interrupted case.

Problem

The HF API returns the latest commit SHA on every call. If a download is interrupted and the repo advances to a new commit before it resumes (common on fast-moving repos like unsloth), a fresh snapshots/<new_sha>/ is created and everything re-downloads from zero, orphaning the partial snapshots/<old_sha>/.

Approach

Before creating the new snapshot dir, find an interrupted prior attempt — a snapshot under a different hash carrying an in-progress .download_manifest.json — and rename it forward to the new commit hash so download_from_manifest resumes its files in place. The interrupted attempt is located by its manifest marker, not refs/main, because refs/main is sticky (advanced only on completion, per #2066) and does not point at an in-progress snapshot.

Carrying bytes forward is safe: #1950's per-file SHA verification checks each file against the new commit's hash and re-downloads any that changed — including a .partial whose blob changed — so this change needs no content reconciliation of its own.

Scope / follow-up

Targets the main repo, where the single download manifest lives. Auxiliary checkpoint repos (text encoder, VAE, mmproj, …) keep no manifest marker in their own snapshot, so an interrupted auxiliary snapshot is not migrated — it's re-downloaded fresh under the new hash (safe; #1950 still verifies what's fetched). Resuming interrupted auxiliary snapshots would need detecting them by a recursive .partial scan and is left as a follow-up.

Tests

  • test_030 — interrupted snapshot migrated forward and resumed in place (proven by unchanged mtime)
  • test_031 — a completed, manifest-less snapshot is not migrated (only genuinely interrupted downloads are)

@ianbmacdonald ianbmacdonald marked this pull request as ready for review April 4, 2026 04:13
@ianbmacdonald ianbmacdonald force-pushed the fix/download-resume-across-commits branch from f788eb2 to c895f54 Compare April 5, 2026 14:19
@ianbmacdonald

Copy link
Copy Markdown
Collaborator Author

added some testing and addressed some agent review feedback

1921182 makes orphan detection and directory-based download-status checks recursive for .partial files, which is important for nested paths under a snapshot.

226cfb8 finishes that off by updating the parallel add_model_to_cache() path and fixing the nested-partial regression test so the manifest filename actually matches the nested partial path being resumed.

The new test shape uses a real HF-backed /pull, but it does not depend on a live upstream repo changing commit hash during the test window. Instead it seeds orphaned snapshot state locally and verifies selection / cleanup behavior from there. The main change is extra runtime/network.

One integration note for later: if this branch is rebased onto or merged after #1412, the orphan-resume logic in download_from_huggingface() will need a follow-up pass for #1412’s per-file download_path manifest format and multi-repo snapshot layout. The recursive .partial scan changes should carry over directly, but the resume/rename path handling is still written around a single main-repo snapshot to be compatible with current main

@ianbmacdonald ianbmacdonald force-pushed the fix/download-resume-across-commits branch 2 times, most recently from e19e7b1 to 7a39854 Compare April 8, 2026 21:34
@ianbmacdonald

Copy link
Copy Markdown
Collaborator Author

Now that #1412 has merged and this branch is rebased onto it, the integration notes from the earlier comment have been addressed across several commits:

Multi-repo manifest support (e8e71cf, 073520e):

  • Orphan-resume path now preserves per-file download_path entries for non-main repos (text_encoder, vae, etc.) — only updates entries that match the old main snapshot path
  • Orphan detection heuristic counts completed file sizes + partial sizes (not just partials) using manifest file entries with their per-repo paths

Stale manifest and content verification (24ae02e):

  • Removed download_from_manifest(existing_manifest) call from orphan-resume entirely — the stale manifest's URLs or file list may not match upstream after a commit change
  • Instead, just relocate salvageable files (completed + partials) and fall through to the normal download path, which builds a fresh manifest from the current HF API response
  • Added size verification in download_from_manifest: when a completed file exists on disk, its size is compared against the manifest and re-downloaded if mismatched

Snapshot collision handling (a282ba7):

  • Merge policy when relocating into an existing current-hash snapshot: keep complete files over partials, keep the larger .partial when both exist, only move what improves on dest

Test coverage (073520e, a282ba7, 70c58d0):

  • test_030: single-repo orphan-resume using refs/main for authoritative hash
  • test_031: multi-repo orphan-resume with per-file download_path entries
  • test_032: collision path where both orphan and current-hash snapshots coexist
  • Tests manipulate disk state and re-pull directly (no /delete calls, which would wipe the seeded orphan state)

Comment thread src/cpp/server/model_manager.cpp Outdated
Comment thread src/cpp/server/model_manager.cpp Outdated

@jeremyfowers jeremyfowers left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ianbmacdonald please see the two comments, other than that good to merge.

@ianbmacdonald ianbmacdonald force-pushed the fix/download-resume-across-commits branch from cdd069b to 8710666 Compare June 3, 2026 20:27
@ianbmacdonald

Copy link
Copy Markdown
Collaborator Author

Rebased onto current main. In the meantime @fl0rianr's #1950 (SHA-on-download) and #2066 (sticky refs / snapshot reuse) landed and rewrote much of download_from_huggingface, so I reworked this down to a much smaller slice that builds on them.

The gap they don't cover: an interrupted download resumed after the repo advanced to a new commit. #2066 reuses a completed previous snapshot when artifacts are unchanged, and #1950 verifies content on download — but an interrupted attempt (with a .partial) under the old hash still gets orphaned and the new commit's snapshot re-downloads from zero.

This adds a small resume_snapshot_across_commit(): before creating the new snapshot dir, it finds an interrupted prior attempt (a snapshot under a different hash carrying an in-progress .download_manifest.json) and renames it forward to the new hash, so download_from_manifest resumes in place. It locates the interrupted snapshot by the manifest marker rather than refs/main, since #2066 made refs/main sticky (advanced only on completion) so it no longer points at an in-progress snapshot. Carrying bytes forward is safe because #1950's per-file SHA check re-fetches anything whose content changed across the commit — including a .partial whose blob changed.

Scoped to the main repo, where the single download manifest lives — I noted the auxiliary-checkpoint case as a follow-up in the PR description (an interrupted aux snapshot is simply re-downloaded fresh; still safe). +73 lines + two tests.

@ianbmacdonald ianbmacdonald force-pushed the fix/download-resume-across-commits branch from 8710666 to c528c3f Compare June 6, 2026 15:39
@ianbmacdonald

Copy link
Copy Markdown
Collaborator Author

Rebased onto current main. Both of your review points are addressed in the reworked slice:

  • Simplification — the ~150-line scan-score-walk-merge is gone, replaced with the rename-forward approach you sketched: resume_snapshot_across_commit() locates the interrupted snapshot by its in-progress .download_manifest.json marker and fs::renames it onto the new hash, letting download_from_manifest resume in place. Per-file hash verification (Fix: Adds SHA check on every download #1950) re-fetches anything the new commit changed, so no merge logic is needed.
  • Safe FS helpers — the directory walk now uses safe_exists / safe_dir_options / safe_is_directory with ec checks, consistent with the rest of the file. The old fs::file_size accumulation that could inflate on a set ec is gone entirely with the scan removed.

Scope is now 2 files (+245/-0): the migration helper plus two regression tests (resume-across-commit, and not-migrating a completed stale snapshot). Auxiliary-repo (text-encoder/VAE) interrupted resume is documented as a follow-up.

@ramkrishna2910 ramkrishna2910 added the bug Something isn't working label Jun 6, 2026
@ianbmacdonald ianbmacdonald force-pushed the fix/download-resume-across-commits branch from c528c3f to 615d2f6 Compare June 15, 2026 19:32
@ianbmacdonald

Copy link
Copy Markdown
Collaborator Author

Rebased onto current main (resolves the merge with the recently-merged /pull/variants tests). The two resume regression tests were renumbered 034/035 to avoid colliding with the new test_030test_033 pull/variants tests; the model_manager.cpp change merged cleanly past the auto re-download missing files work (#2239).

Verified on a fresh build (Ubuntu 26.04, glibc 2.43): test_034 (interrupted snapshot migrated forward and resumed in place) and test_035 (completed stale snapshot left intact) both pass against a live /pull, alongside the existing pull/variants tests.

@ianbmacdonald ianbmacdonald force-pushed the fix/download-resume-across-commits branch from 615d2f6 to 328895f Compare June 18, 2026 23:43
@ianbmacdonald

Copy link
Copy Markdown
Collaborator Author

Rebased onto current main (31cdc925a) — the only conflict was an import-list addition in test/server_endpoints.py (get_hf_cache_dir_candidates), resolved by keeping upstream's import; model_manager.cpp auto-merged. The PR is green/mergeable again.

The CHANGES_REQUESTED review from April predates several rounds of rework and has been addressed through the subsequent rebases. @jeremyfowers @ramkrishna2910 — would appreciate a re-review when you have a moment.

The HF API returns the repo's latest commit SHA on every call. If a download is
paused and the repo receives new commits before it resumes (common on fast-moving
repos like unsloth), the new SHA differs from the one the download started under,
so a fresh snapshots/<new_sha>/ is created and everything re-downloads from zero —
orphaning the partially-downloaded snapshots/<old_sha>/.

Before creating the new snapshot directory, look for an interrupted prior attempt
(a snapshot under a different hash carrying an in-progress .download_manifest.json)
and rename it forward to the new commit hash, so download_from_manifest resumes its
files in place instead of restarting. The interrupted attempt is located by its
manifest marker rather than refs/main, because refs/main is sticky (advanced only
on successful completion, per lemonade-sdk#2066) and does not point at an in-progress snapshot.

Carrying bytes forward is safe: per-file SHA verification (added in lemonade-sdk#1950) checks
each carried file against the new commit's hash and re-downloads any that changed —
including a .partial whose blob changed — so this change does not need its own
content reconciliation.

Scope: this targets the main repo, which is where the single download manifest
lives. Auxiliary checkpoint repos keep no manifest marker in their own snapshot, so
an interrupted auxiliary snapshot is not migrated and is re-downloaded fresh under
the new hash (safe; lemonade-sdk#1950 still verifies what is fetched). Resuming interrupted
auxiliary snapshots would need recursive .partial detection and is left as a
follow-up.

Adds test_030 (interrupted snapshot migrated forward and resumed in place, proven
by unchanged mtime) and test_031 (a completed manifest-less snapshot is not migrated).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: GLM-5.1 <noreply@zhipuai.cn>
Co-Authored-By: GPT-5.5 <noreply@openai.com>
@ianbmacdonald ianbmacdonald force-pushed the fix/download-resume-across-commits branch from b6ce38d to 7c94bdf Compare June 20, 2026 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants