fix(backends): stage and verify backend install before removing the working binary#2315
fix(backends): stage and verify backend install before removing the working binary#2315ianbmacdonald wants to merge 1 commit into
Conversation
|
Process note: I opened this from my fork branch ( |
There was a problem hiding this comment.
Pull request overview
Makes backend (re)installation crash-safe by staging downloads/extractions and only replacing the currently-working backend install after verification, preventing “failed download leaves no binary” situations in BackendUtils::install_from_github.
Changes:
- Add a header-only staging + atomic-swap helper (
commit_staged_install) and reuse it for executable discovery. - Update
install_from_githubto extract into*.staging, writeversion.txtinto staging, verify the executable exists, then promote via atomic swap. - Add a network-free CTest unit (
InstallAtomicityTest) validating swap success, verify-fail preservation, fresh install, and POSIX swap-failure rollback.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/cpp/server/backends/backend_utils.cpp |
Switch install flow to stage → verify → atomic swap using the new helper. |
src/cpp/include/lemon/backends/install_staging.h |
Introduces header-only executable lookup + commit/promotion helper. |
test/cpp/test_install_atomicity.cpp |
Adds regression/unit tests for atomic install invariants and failure behavior. |
CMakeLists.txt |
Registers the new C++ unit test executable and CTest entry. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ac6cb28 to
e73f104
Compare
|
Addressed all three Copilot comments in the rebased push (
|
…orking binary install_from_github removed the existing install directory up front whenever the installed version no longer matched the pin (or version.txt was missing), before the replacement asset was downloaded. A slow or interrupted download (unreliable link, going offline mid-flight, a transient GitHub 5xx) then left the backend with no usable binary at all: the old one was already gone and the new one never arrived. Make the reinstall atomic. The new install is downloaded and extracted into a sibling staging directory and only swapped into place once the executable is verified present. The swap (commit_staged_install) keeps a recoverable backup at all times: the existing install is renamed aside to "<dir>.old", staging is renamed into place, and the backup is deleted only once the new tree is verified present; if the promotion rename fails the backup is rolled back, so a failed swap can never lose both installs. A RAII guard removes the staging tree on any pre-swap failure, stale staging that cannot be cleared is treated as fatal (so a leftover binary can never be promoted as a mixed install), and the staged version.txt write is checked. The staging + atomic-swap logic is factored into a small header-only helper (lemon/backends/install_staging.h) and covered by a unit test (test/cpp/test_install_atomicity.cpp, ctest InstallAtomicityTest) that asserts a failed install -- both a missing-executable extraction and a failed filesystem swap -- preserves the previously-working binary. Closes lemonade-sdk#2312 Reported-by: ckuethe (lemonade-sdk#2279) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-Authored-By: GLM-5.2 <noreply@zhipuai.cn> Co-Authored-By: GPT-5.5 <noreply@openai.com>
e73f104 to
01b716d
Compare
What
Makes backend (re)installation crash-safe.
BackendUtils::install_from_githubno longer deletes the working binary before the replacement is downloaded and verified.Closes #2312.
Why
On a version-mismatch / missing-
version.txtreinstall,install_from_githubcalledfs::remove_all(install_dir)up front — before downloading the new release asset. If that download then failed (slow/unreliable link, going offline mid-flight, a transient GitHub 5xx), the backend was left with no usable binary: the old one was already gone and the new one never arrived.Surfaced by @ckuethe on #2279 ("not remove and replace the llama binary until the download is verified to be complete" — slow links, e.g. on an airplane). #2279 handled the latest-version lookup failure; this is the complementary download robustness case.
The fix
Stage → verify → atomic swap:
install_dir + ".staging"directory — the working install is untouched.install_dir + ".old",install_dir,.oldonly once the new tree is in place;.oldback.Additional hardening (from review):
version.txtwrite is checked.The staging/swap logic is a small header-only helper (
lemon/backends/install_staging.h) so it can be unit-tested without the heavierbackend_utils.cppdependencies;find_executable_in_install_dirnow delegates to it (single source of truth).Testing
New ctest
InstallAtomicityTest(test/cpp/test_install_atomicity.cpp), network-free, covering: successful swap (no.old/.stagingleft behind), regression guard — a staging tree missing the executable leaves the old working binary +version.txtbyte-for-byte intact, fresh install, and (POSIX) a failed filesystem swap preserves the working binary and throws.RED→GREEN verified: the regression assertions fail against the pre-fix
remove-before-verifyordering and pass against the fix.Validation environment
This is a pure
std::filesystemchange with no inference path exercised, so backend = none (no model load, no ROCm/Vulkan). The unit test runs on every platform via CI ctest.cmake --buildclean,lemondlinks,ctest3/3 green7.0.0-22-generic, glibc 2.43, g++ 15.2.0 —InstallAtomicityTest17/17 pass, clean-Wall -Wextratest/cpp/test_install_atomicity.cpp(ctestInstallAtomicityTest)Scope / follow-up
install_therock(the ROCm tarball path) has the sameremove_all-before-download shape and would benefit from the same staging approach, but is left out of scope to keep this PR focused — happy to follow up in a separate PR.🤖 Generated with Claude Code