Skip to content

fix(backends): stage and verify backend install before removing the working binary#2315

Open
ianbmacdonald wants to merge 1 commit into
lemonade-sdk:mainfrom
ianbmacdonald:fix/install-atomicity
Open

fix(backends): stage and verify backend install before removing the working binary#2315
ianbmacdonald wants to merge 1 commit into
lemonade-sdk:mainfrom
ianbmacdonald:fix/install-atomicity

Conversation

@ianbmacdonald

Copy link
Copy Markdown
Collaborator

What

Makes backend (re)installation crash-safe. BackendUtils::install_from_github no longer deletes the working binary before the replacement is downloaded and verified.

Closes #2312.

Why

On a version-mismatch / missing-version.txt reinstall, install_from_github called fs::remove_all(install_dir) up front — before downloading the new release asset. If that download then failed (slow/unreliable link, going offline mid-flight, a transient GitHub 5xx), the backend was left with no usable binary: the old one was already gone and the new one never arrived.

Surfaced by @ckuethe on #2279 ("not remove and replace the llama binary until the download is verified to be complete" — slow links, e.g. on an airplane). #2279 handled the latest-version lookup failure; this is the complementary download robustness case.

The fix

Stage → verify → atomic swap:

  1. Download + extract into a sibling install_dir + ".staging" directory — the working install is untouched.
  2. Verify the expected executable is present in staging.
  3. Promote with a recoverable backup that never leaves the backend with nothing:
    • rename the existing install aside to install_dir + ".old",
    • rename staging into install_dir,
    • delete .old only once the new tree is in place;
    • if the promotion rename fails, roll .old back.

Additional hardening (from review):

  • A RAII guard removes the staging tree on any pre-swap failure (download/probe/extract), so no half-built tree is left behind.
  • Stale staging that can't be cleared is treated as fatal, so a leftover binary can never be promoted as a mixed install.
  • The staged version.txt write is checked.
  • The swap helper throws a distinct error on a swap failure (vs. the empty-return "executable not found" case), so the caller reports an accurate message.

The staging/swap logic is a small header-only helper (lemon/backends/install_staging.h) so it can be unit-tested without the heavier backend_utils.cpp dependencies; find_executable_in_install_dir now delegates to it (single source of truth).

Testing

New ctest InstallAtomicityTest (test/cpp/test_install_atomicity.cpp), network-free, covering: successful swap (no .old/.staging left behind), regression guard — a staging tree missing the executable leaves the old working binary + version.txt byte-for-byte intact, fresh install, and (POSIX) a failed filesystem swap preserves the working binary and throws.

RED→GREEN verified: the regression assertions fail against the pre-fix remove-before-verify ordering and pass against the fix.

Validation environment

This is a pure std::filesystem change with no inference path exercised, so backend = none (no model load, no ROCm/Vulkan). The unit test runs on every platform via CI ctest.

Field Value
Change type backend install/filesystem logic — no inference, backend=none
Local Ubuntu 26.04 (resolute), full cmake --build clean, lemond links, ctest 3/3 green
ai3 Ubuntu 26.04 LTS (resolute), kernel 7.0.0-22-generic, glibc 2.43, g++ 15.2.0 — InstallAtomicityTest 17/17 pass, clean -Wall -Wextra
Tests test/cpp/test_install_atomicity.cpp (ctest InstallAtomicityTest)

Scope / follow-up

install_therock (the ROCm tarball path) has the same remove_all-before-download shape and would benefit from the same staging approach, but is left out of scope to keep this PR focused — happy to follow up in a separate PR.


🤖 Generated with Claude Code

@ianbmacdonald

Copy link
Copy Markdown
Collaborator Author

Process note: I opened this from my fork branch (ianbmacdonald:fix/install-atomicity). Since I have push access on this repo, I'm happy to move the branch onto lemonade-sdk/lemonade directly (so you can push to it, rebase, or take it over without the fork-PR edit dance) if that's easier for review — just say the word and I'll re-point it.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Makes backend (re)installation crash-safe by staging downloads/extractions and only replacing the currently-working backend install after verification, preventing “failed download leaves no binary” situations in BackendUtils::install_from_github.

Changes:

  • Add a header-only staging + atomic-swap helper (commit_staged_install) and reuse it for executable discovery.
  • Update install_from_github to extract into *.staging, write version.txt into staging, verify the executable exists, then promote via atomic swap.
  • Add a network-free CTest unit (InstallAtomicityTest) validating swap success, verify-fail preservation, fresh install, and POSIX swap-failure rollback.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
src/cpp/server/backends/backend_utils.cpp Switch install flow to stage → verify → atomic swap using the new helper.
src/cpp/include/lemon/backends/install_staging.h Introduces header-only executable lookup + commit/promotion helper.
test/cpp/test_install_atomicity.cpp Adds regression/unit tests for atomic install invariants and failure behavior.
CMakeLists.txt Registers the new C++ unit test executable and CTest entry.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/cpp/test_install_atomicity.cpp
Comment thread src/cpp/server/backends/backend_utils.cpp Outdated
Comment thread src/cpp/server/backends/backend_utils.cpp
@ianbmacdonald

Copy link
Copy Markdown
Collaborator Author

Addressed all three Copilot comments in the rebased push (e73f10407, now on top of e6d511bc9):

  1. test_install_atomicity.cpp — added the missing <functional> (std::hash<std::string>) and <iterator> (std::istreambuf_iterator) includes rather than relying on transitive inclusion.
  2. backend_utils.cpp — removed the unused url variable (the download path builds its URL from base_download_url + filename).
  3. backend_utils.cpp — the archive-leak-on-throw is now handled by a ZipGuard RAII (mirroring the existing StagingGuard) that removes the downloaded archive on any scope exit, including a throw from commit_staged_install() on a swap/rename failure. This also consolidates the previously-scattered per-throw fs::remove(zip_path) calls into a single owner.

…orking binary

install_from_github removed the existing install directory up front whenever the
installed version no longer matched the pin (or version.txt was missing), before
the replacement asset was downloaded. A slow or interrupted download (unreliable
link, going offline mid-flight, a transient GitHub 5xx) then left the backend with
no usable binary at all: the old one was already gone and the new one never arrived.

Make the reinstall atomic. The new install is downloaded and extracted into a
sibling staging directory and only swapped into place once the executable is
verified present. The swap (commit_staged_install) keeps a recoverable backup at
all times: the existing install is renamed aside to "<dir>.old", staging is renamed
into place, and the backup is deleted only once the new tree is verified present;
if the promotion rename fails the backup is rolled back, so a failed swap can never
lose both installs. A RAII guard removes the staging tree on any pre-swap failure,
stale staging that cannot be cleared is treated as fatal (so a leftover binary can
never be promoted as a mixed install), and the staged version.txt write is checked.

The staging + atomic-swap logic is factored into a small header-only helper
(lemon/backends/install_staging.h) and covered by a unit test
(test/cpp/test_install_atomicity.cpp, ctest InstallAtomicityTest) that asserts a
failed install -- both a missing-executable extraction and a failed filesystem
swap -- preserves the previously-working binary.

Closes lemonade-sdk#2312
Reported-by: ckuethe (lemonade-sdk#2279)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: GLM-5.2 <noreply@zhipuai.cn>
Co-Authored-By: GPT-5.5 <noreply@openai.com>
@ianbmacdonald ianbmacdonald force-pushed the fix/install-atomicity branch from e73f104 to 01b716d Compare June 20, 2026 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Backend reinstall is not crash-safe: install_from_github deletes the working binary before downloading the replacement

3 participants