fix(iii-init): re-run worker install after an early boot crash#1879
Draft
guibeira wants to merge 2 commits into
Draft
fix(iii-init): re-run worker install after an early boot crash#1879guibeira wants to merge 2 commits into
guibeira wants to merge 2 commits into
Conversation
Contributor
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
The `/var/.iii-prepared` marker gates the one-time setup/install step and persists with the VM-local dep dirs across restarts. It records only that the install command exited 0, not that the resulting dep tree is complete. An install that exits 0 but leaves an incomplete `node_modules` (observed in the quickstart: `iii-sdk` present, `@iii-dev/observability` dropped) gets frozen behind the marker. Every later boot reuses the broken deps and the worker crashes on the same missing import, so it never registers its functions and `iii trigger` returns `function_not_found`. A plain restart can't recover — only clearing the worker's artifacts does. Have PID-1 (iii-init) delete the marker when the worker child exits non-zero within 30s of its initial spawn, so the next boot re-runs install and self-heals. Scoped to real process exits (not signal deaths) so an intentional SIGTERM shutdown is never mistaken for a crash, and timed from the initial spawn so a crash after a source-edit restart (a runtime bug, not bad deps) doesn't needlessly trigger a reinstall. Applies to both legacy and supervised modes.
2a002be to
121c6f5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A worker can get permanently stuck in
function_not_found. Hit while following the quickstart:caller-worker(TS, ownsmath::add_two_numbers) was crashing on boot:The VM's
node_moduleshadiii-sdkbut not@iii-dev/observability— a declared, published dependency. The firstnpm installexited 0 but left an incomplete tree (transient partial install). The worker crashed before registering its functions, so the trigger never resolved.Why it persists (the real bug)
build_libkrun_local_scriptruns setup/install once, gated by/var/.iii-prepared, and the dep dirs live in/var/iii/depsbind-mounted across restarts. The marker records only that install exited 0, not that the dep tree is complete. So a partial-but-exit-0 install gets frozen: every later boot reuses the broken deps and crashes on the same missing import.iii worker restartcan't recover — only clearing the worker's artifacts does.The transient install drop itself isn't reliably reproducible (a clean re-prep installs the package fine), so the durable fix targets the persistence, not the trigger.
Fix
PID-1 (
iii-init) deletes/var/.iii-preparedwhen the worker child exits non-zero within 30s of its initial spawn, forcing the next boot to re-run install and self-heal. Guards:kill_for_shutdownuses SIGTERM; an intentional shutdown must not look like a crash.Applies to both legacy and supervised modes.
Tradeoff
A worker with a genuine early-startup bug will re-run install on each boot.
npm install/pip installare cheap and idempotent when deps are already complete, and the worker is crash-looping regardless — so this only adds self-heal attempts, never breaks a healthy worker.Tests
is_boot_failure_classifies_exits— exit vs signal, zero vs non-zero, inside vs outside the window.invalidate_prepared_marker_removes_and_tolerates_missing— removes an existing marker; a missing marker is a no-op.cargo test -p iii-initgreen;cargo fmt/clippyclean.