Skip to content

fix(iii-init): re-run worker install after an early boot crash#1879

Draft
guibeira wants to merge 2 commits into
mainfrom
fix/reprep-on-early-worker-crash
Draft

fix(iii-init): re-run worker install after an early boot crash#1879
guibeira wants to merge 2 commits into
mainfrom
fix/reprep-on-early-worker-crash

Conversation

@guibeira

Copy link
Copy Markdown
Contributor

Problem

A worker can get permanently stuck in function_not_found. Hit while following the quickstart:

$ iii trigger math::add_two_numbers a=10 b=20
Error: { "code": "function_not_found", "message": "Function math::add_two_numbers not found" }

caller-worker (TS, owns math::add_two_numbers) was crashing on boot:

Error [ERR_MODULE_NOT_FOUND]: Cannot find package '@iii-dev/observability' imported from /workspace/src/worker.ts

The VM's node_modules had iii-sdk but not @iii-dev/observability — a declared, published dependency. The first npm install exited 0 but left an incomplete tree (transient partial install). The worker crashed before registering its functions, so the trigger never resolved.

Why it persists (the real bug)

build_libkrun_local_script runs setup/install once, gated by /var/.iii-prepared, and the dep dirs live in /var/iii/deps bind-mounted across restarts. The marker records only that install exited 0, not that the dep tree is complete. So a partial-but-exit-0 install gets frozen: every later boot reuses the broken deps and crashes on the same missing import. iii worker restart can't recover — only clearing the worker's artifacts does.

The transient install drop itself isn't reliably reproducible (a clean re-prep installs the package fine), so the durable fix targets the persistence, not the trigger.

Fix

PID-1 (iii-init) deletes /var/.iii-prepared when the worker child exits non-zero within 30s of its initial spawn, forcing the next boot to re-run install and self-heal. Guards:

  • Process exits only, not signalskill_for_shutdown uses SIGTERM; an intentional shutdown must not look like a crash.
  • Timed from the initial spawn — a crash after a source-edit restart is a runtime bug, not bad deps, so it shouldn't trigger a reinstall.

Applies to both legacy and supervised modes.

Tradeoff

A worker with a genuine early-startup bug will re-run install on each boot. npm install/pip install are cheap and idempotent when deps are already complete, and the worker is crash-looping regardless — so this only adds self-heal attempts, never breaks a healthy worker.

Tests

  • is_boot_failure_classifies_exits — exit vs signal, zero vs non-zero, inside vs outside the window.
  • invalidate_prepared_marker_removes_and_tolerates_missing — removes an existing marker; a missing marker is a no-op.

cargo test -p iii-init green; cargo fmt/clippy clean.

@vercel

vercel Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
iii-website Ready Ready Preview, Comment Jun 18, 2026 1:18pm

Request Review

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: ead99a06-399c-4555-a9ff-7b8364ce21ae

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/reprep-on-early-worker-crash

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

The `/var/.iii-prepared` marker gates the one-time setup/install step and
persists with the VM-local dep dirs across restarts. It records only that
the install command exited 0, not that the resulting dep tree is complete.

An install that exits 0 but leaves an incomplete `node_modules` (observed
in the quickstart: `iii-sdk` present, `@iii-dev/observability` dropped)
gets frozen behind the marker. Every later boot reuses the broken deps and
the worker crashes on the same missing import, so it never registers its
functions and `iii trigger` returns `function_not_found`. A plain restart
can't recover — only clearing the worker's artifacts does.

Have PID-1 (iii-init) delete the marker when the worker child exits
non-zero within 30s of its initial spawn, so the next boot re-runs install
and self-heals. Scoped to real process exits (not signal deaths) so an
intentional SIGTERM shutdown is never mistaken for a crash, and timed from
the initial spawn so a crash after a source-edit restart (a runtime bug,
not bad deps) doesn't needlessly trigger a reinstall.

Applies to both legacy and supervised modes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant