Skip to content

feat(test-suite): replace fhevm bash cli with Bun runtime#2066

Open
Eikix wants to merge 39 commits intomainfrom
codex/bun-fhevm-cli
Open

feat(test-suite): replace fhevm bash cli with Bun runtime#2066
Eikix wants to merge 39 commits intomainfrom
codex/bun-fhevm-cli

Conversation

@Eikix
Copy link
Contributor

@Eikix Eikix commented Mar 9, 2026

Closes https://github.com/zama-ai/fhevm-internal/issues/1042

Summary

Replace the legacy test-suite/fhevm/fhevm-cli bash flow with a Bun-native TypeScript CLI that keeps runtime state under .fhevm and makes local stack workflows explicit, deterministic, resumable, and compatible with the current simple-ACL era.

What This PR Fixes

The old bash flow worked, but too much of the system lived in operator memory.

This PR makes these workflows first-class:

  • run the stack on latest-main, latest-release, an exact supported built SHA, or a GitOps network target
  • swap in local changes for one repo-owned group or a supported runtime service subset
  • run multicopro topologies locally with deterministic generated state and per-instance env/arg overrides
  • exercise the same CLI surface locally and in CI

Concretely, it removes these bash-era pain points:

  • implicit boot order and runtime discovery
  • stale downstream env/config after contract deployment and discovery
  • weak version targeting for “local changes on top of deployed-ish stack” workflows
  • multi-copro topologies requiring manual edits
  • failures that are hard to resume cleanly
  • local reliability depending too much on remembering the right ritual
  • workflows like “clean deploy, rebuild one local code path, then rerun a few suites on the same stack” being manual and flaky

Architecture

flowchart TD
  A["fhevm-cli up"] --> B["Resolve target"]
  B --> B1["latest-main: walk main SHAs until complete image set after 803f104"]
  B --> B2["latest-release: latest stable release"]
  B --> B3["sha: exact repo-owned SHA on main, fail if missing or before 803f104"]
  B --> B4["devnet/testnet/mainnet: GitOps bundles"]
  B1 --> C["Lock resolved bundle"]
  B2 --> C
  B3 --> C
  B4 --> C

  C --> D["Preflight / --dry-run"]
  D --> E["Generate runtime files under .fhevm"]
  E --> E1["env/"]
  E --> E2["compose/"]
  E --> E3["locks/"]
  E --> E4["addresses/"]
  E --> E5["state.json"]

  E --> F["Boot pipeline"]
  F --> F1["base"]
  F1 --> F2["kms-signer"]
  F2 --> F3["gateway-deploy"]
  F3 --> F4["host-deploy"]
  F4 --> F5["discover"]
  F5 --> F6["regenerate + validate"]
  F6 --> F7["coprocessor + kms-connector"]
  F7 --> F8["bootstrap"]
  F8 --> F9["relayer"]
  F9 --> F10["test-suite"]

  G["Local overrides (group or runtime service subset)"] --> E
  H["Multicopro topology + per-instance env/arg overrides"] --> E
  I["Compatibility policy"] --> E
  I --> F7

  E5 --> J["resume / from-step"]
  J --> F

  K["fhevm-cli test"] --> F10
Loading

What Changed

  • replace the bash fhevm-cli entrypoint with a thin Bun launcher and Bun-native TS implementation
  • remove the legacy bash helper scripts
  • add explicit target resolution for:
    • latest-main
    • latest-release
    • sha
    • devnet
    • testnet
    • mainnet
  • resolve latest-main by walking back zama-ai/fhevm/main until a complete short-SHA image bundle exists, but only at or after the simple-ACL floor 803f104
  • make --target sha --sha <git-sha> modern-only too:
    • must be a complete repo-owned image set
    • must be on main
    • must be at or after 803f104
  • persist runtime state, locks, generated env files, generated compose files, and discovered addresses under .fhevm
  • support repo-local overrides for:
    • coprocessor
    • kms-connector
    • gateway-contracts
    • host-contracts
    • test-suite
  • support per-service override filtering only for runtime groups: coprocessor, kms-connector, test-suite
  • add --override all as a thin alias for all full-group local overrides
  • add upgrade <group> for active runtime overrides only: coprocessor, kms-connector, test-suite
  • make upgrade restart runtime services only and skip one-shot DB migration services
  • support resumable named steps with --resume / --from-step
  • support multicopro topologies with deterministic naming plus per-instance env/arg overrides
  • add deploy as a pure alias for up
  • remove doctor in favor of up --dry-run
  • make fhevm-cli test always run with --no-compile
  • remove override build profiles; local overrides always build release images
  • harden GitHub resolution errors for missing gh, missing auth, and rate limiting
  • make bootstrap probing retry-safe across onchain IDs and MinIO material publication
  • regenerate missing .fhevm/env and .fhevm/compose artifacts from state.json before resume/upgrade/teardown
  • keep .fhevm state on disk when teardown fails so cleanup can be retried
  • move the existing e2e workflows to the new CLI instead of introducing a parallel workflow
  • add workflow image_mode=registry|workspace and override=... dispatch semantics
  • keep merge-queue/orchestrated E2E on explicit image_mode=registry
  • after #2072, gate simple-ACL compat explicitly:
    • modern-only latest-main / sha
    • old relayer config rewrite for old relayer versions
    • old tx-sender flags only for old versions
    • no MULTICHAIN_ACL_ADDRESS requirement for full modern workspace protocol overrides
  • fix merge-queue/orchestrated E2E after simple ACL by:
    • assigning a modern relayer for post-803f104 stacks
    • emitting 7-char SHA tags
  • fix mocked-payment approval ordering so the relayer tx-sender approves the real deployed PROTOCOL_PAYMENT_ADDRESS, not a stale pre-deploy address

Important Invariants

  • version target selection is explicit instead of "latest-ish"
  • every resolved bundle is printed and locked
  • latest-main and sha are modern-only after the simple-ACL cutover
  • exact-SHA target selection fails fast unless every repo-owned package is available for that SHA tag
  • gateway/host discovery regenerates downstream env/config before dependent services start
  • dependent services fail fast if discovery is incomplete
  • resume is step-based instead of relying on rerunning ad hoc bash fragments
  • multicopro artifacts are CLI-owned and deterministic
  • local override rebuilds are observable and upgradable on a running stack

Compatibility Handling

This PR keeps explicit compatibility only where supported targets still need it:

  • latest-main and sha are modern-only once resolution enforces the simple-ACL floor
  • older coprocessor images still need legacy API-key and tx-sender CLI args on non-main targets
  • older connector images still need legacy chain-id env shape on non-main targets
  • older relayer versions still need legacy readiness config shape
  • the CLI applies those shims from explicit compatibility policy instead of burying them inline in orchestration code

CI Changes

  • update test-suite-e2e-tests.yml to boot with the Bun CLI
  • update test-suite-e2e-operators-tests.yml to boot with the Bun CLI
  • add workflow dispatch support for:
    • image_mode=registry
    • image_mode=workspace
    • override=...
  • keep merge-queue/orchestrated E2E on registry
  • replace timeout-based log following with one-shot log snapshots on failure
  • feed explicit workflow image versions through *_VERSION environment overrides

Validation

  • bun run check
  • bun test (98 tests)
  • local end-to-end validation on rebased head fcff4739:
    • baseline lifecycle passed
    • override marker test passed
    • upgrade coprocessor rebuilt V1 -> V2 from unstaged local changes and surfaced the new marker in runtime logs
    • targeted functional tests passed
  • manual workflow validation:
    • workspace + override=all passed on supported heads
    • merge-queue-style sacrificial PR #2087 passed after simple-ACL follow-ups

@cla-bot cla-bot bot added the cla-signed label Mar 9, 2026
@zama-ai zama-ai deleted a comment from claude bot Mar 9, 2026
@mergify
Copy link

mergify bot commented Mar 9, 2026

🧪 CI Insights

Here's what we observed from your CI run for 0dbf663.

🟢 All jobs passed!

But CI Insights is watching 👀

@Eikix Eikix force-pushed the codex/bun-fhevm-cli branch from 7408594 to a651dc5 Compare March 9, 2026 19:45
@Eikix Eikix marked this pull request as ready for review March 10, 2026 09:13
@Eikix Eikix requested review from a team as code owners March 10, 2026 09:13
Copy link
Contributor

@eudelins-zama eudelins-zama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small note to say it would be nice to merge this feature branch first to avoid handling conflicts with this one 😇

@Eikix
Copy link
Contributor Author

Eikix commented Mar 10, 2026

@claude /pr-review

@cla-bot cla-bot bot removed the cla-signed label Mar 10, 2026
@PanGan21
Copy link

@cla-bot check

@cla-bot cla-bot bot added the cla-signed label Mar 10, 2026
@cla-bot
Copy link

cla-bot bot commented Mar 10, 2026

The cla-bot has been summoned, and re-checked this pull request!

@zama-ai zama-ai deleted a comment from claude bot Mar 10, 2026
@zama-ai zama-ai deleted a comment from cla-bot bot Mar 10, 2026
@zama-ai zama-ai deleted a comment from cla-bot bot Mar 10, 2026
@zama-ai zama-ai deleted a comment from cla-bot bot Mar 10, 2026
@zama-ai zama-ai deleted a comment from cla-bot bot Mar 10, 2026
@zama-ai zama-ai deleted a comment from cla-bot bot Mar 10, 2026
@zama-ai zama-ai deleted a comment from claude bot Mar 10, 2026
@Eikix Eikix force-pushed the codex/bun-fhevm-cli branch from 4b82c14 to 8fce5b0 Compare March 10, 2026 15:14
@zama-ai zama-ai deleted a comment from claude bot Mar 10, 2026
@zama-ai zama-ai deleted a comment from claude bot Mar 10, 2026
@claude
Copy link

claude bot commented Mar 10, 2026

PR Review: feat(test-suite): replace fhevm bash cli with Bun runtime

  • Pre-review validation
  • Gather context (PR diff, source files, workflows)
  • Analyze code for bugs, security, and error handling
  • Validate and deduplicate findings
  • Post final review

Overall Assessment

This is an impressive piece of infrastructure engineering. The Bun-native TypeScript CLI is a significant upgrade over the legacy bash scripts, with clean architecture, strong type safety, and comprehensive test coverage. The code is well-structured with good separation of concerns across runtime.ts, artifacts.ts, versions.ts, compat.ts, and layout.ts.

All 7 issues from the previous review have been addressed:

Previous Finding Status
Empty catch {} swallows regen() errors ✅ Fixed
DI bypass for waitForRpc/ethCallId/ensureMaterialUrl ✅ Fixed — all use deps.fetch
depends_on rewriting breaks shared deps ✅ Fixed — checks clonedServices set
Compat policies skip SHA-tagged images ✅ Fixed — by design, with explicit test
resolveEnvMap silent unresolved circular refs ✅ Fixed — now throws with key names
Dead ternary in lock name ✅ Fixed — lock names are now distinct per target
writeEnvFile JSON.stringify quoting corrupts values ✅ Fixed — uses single quotes first via quoteEnvValue

Additionally, the mergeArgs bare flag over-matching concern now has an explicit test confirming correct behavior (line 113-118 in utils.test.ts).


🟢 No High-Confidence Bugs Found

After thorough analysis across all source files, no runtime bugs were identified at high confidence. Key areas verified:

  • Env file quoting round-trip: quoteEnvValueparseEnv correctly handles JSON strings with embedded quotes via single-quote wrapping
  • Multi-coprocessor isolation: Instance envs are properly shallow-copied before mutation; indices array has 5 elements matching the validated 1-5 range
  • Shell injection: shellEscape correctly escapes single quotes; structured argv arrays prevent injection in Bun.spawn
  • Resume logic: startStep/stateStepIndex correctly handles fresh, resume, and --from-step scenarios
  • Compose interpolation: interpolateComposeValue creates new object trees without mutating templates
  • Race conditions: All operations use sequential await patterns; no concurrent state mutations found

🟡 Nit: discoverSigner error missing log sample

test-suite/fhevm/src/runtime.ts:464-469 — Confidence: 90/100

When the regex fails to match the KMS signer handle, the error message doesn't include a snippet of the actual logs. This makes debugging harder when the log format changes or the container failed silently:

if (!match) {
  throw new Error("Could not extract KMS signer handle");
}

Suggestion: Include a truncated log snippet:

throw new Error(`Could not extract KMS signer handle from logs:\n${(logs.stdout || logs.stderr || "<empty>").slice(0, 500)}`);

🟡 Nit: ensureMaterialUrl timeout error lacks HTTP status

test-suite/fhevm/src/runtime.ts:633-642 — Confidence: 88/100

When the material endpoint times out, the error only says the URL wasn't ready but doesn't include the last HTTP status code:

throw new Error(`Material endpoint not ready: ${url}`);

Including the last status would help distinguish a 404 (material missing) from a 403 (auth issue) or 500 (MinIO down).


🔵 Observation: dockerInspect parses even on failure

test-suite/fhevm/src/runtime.ts:104-117 — Confidence: 75/100

dockerInspect uses allowFailure: true but always tries to parse stdout. When docker inspect fails (container doesn't exist), it falls through to JSON.parse("[]") which returns an empty array — this is actually handled correctly by callers checking if (inspect) after destructuring. So this is not a bug, but the "invalid JSON" error message could be misleading in edge cases where docker returns non-JSON error text to stdout.


✅ What Looks Good

  • Architecture: Clean separation between types, layout, artifacts, versions, compat, and runtime. The RuntimeDeps DI pattern is elegant and enables thorough unit testing
  • Step-based state machine: Deterministic, resumable pipeline with proper step ordering, state persistence, and --from-step support
  • Compatibility handling: Explicit compat policies with version-gated shims is much better than inline conditionals
  • Schema guards: assertSchemaCompatibility prevents subtle DB migration drift when using per-service overrides on latest-release
  • Version resolution: The latest-main floor enforcement (acfa977) and exact-SHA validation are well thought out
  • Testing: 85+ tests covering argument validation, compat edge cases, version resolution, env parsing, and round-trips
  • CI workflows: Proper permissions scoping (permissions: {} at workflow level), SHA-pinned actions, no template injection vulnerabilities
  • Security: Structured argv arrays in Bun.spawn prevent shell injection; shellEscape used for the one sh -lc case; deriveWallet now validates cast output with regex
  • Documentation: ARCHITECTURE.md and CLI README are thorough and match the implementation
  • Env quoting: quoteEnvValue correctly handles Docker Compose env_file semantics with single-quote preference

Summary

This is a well-executed migration from bash to TypeScript. All previous review issues have been resolved. No new bugs were found. The remaining items are minor error-message quality improvements (nits) that are non-blocking.

Verdict: Looks good to merge 🟢


View job

@Eikix
Copy link
Contributor Author

Eikix commented Mar 10, 2026

Opened a sacrificial draft PR to exercise the merge-queue-style E2E orchestration on the same head commit as this PR: #2087

It is draft-only and should not be merged. Once we have the workflow signal, we can close it.

Replace the 500-line bash fhevm-cli with a structured Bun/TypeScript
implementation. The new CLI resolves version bundles from GitHub
releases or GitOps branches, generates compose overrides and env files
under .fhevm/, and boots the stack through a 13-step pipeline with
resume support.

Key modules:
- versions.ts: resolve targets (latest-main, latest-release, gitops)
- artifacts.ts: compose/env generation, local build isolation
- runtime.ts:  boot pipeline, test runner, pause/unpause
- compat.ts:   legacy CLI flag injection for older image versions
- cli.test.ts: unit tests for version resolution, compat, layout

Local builds use :fhevm-local image tags to avoid overwriting registry
images. Services sharing an image tag are deduplicated before buildx.
@Eikix Eikix force-pushed the codex/bun-fhevm-cli branch from aab6a5e to fcff473 Compare March 11, 2026 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants