Skip to content

feat(test-suite): local multi coprocessor deploy + e2e debug mode#1959

Merged
mergify[bot] merged 17 commits intomainfrom
codex/multicoproc-local-e2e
Feb 15, 2026
Merged

feat(test-suite): local multi coprocessor deploy + e2e debug mode#1959
mergify[bot] merged 17 commits intomainfrom
codex/multicoproc-local-e2e

Conversation

@Eikix
Copy link
Copy Markdown
Contributor

@Eikix Eikix commented Feb 11, 2026

Summary

  • add local multicoprocessor deploy flags to fhevm-cli:
    • --coprocessors N
    • --coprocessor-threshold T
  • wire deploy env generation so gateway + host are initialized dynamically for NUM_COPROCESSORS and COPROCESSOR_THRESHOLD
  • support additional local coprocessor instances (coprocessor1-*, coprocessor2-*, ...) with per-instance env and DB-migration-first startup
  • update MinIO endpoint rewriting to patch all generated coprocessor env files
  • keep ./fhevm-cli test input-proof as the single input-proof path for both 1/1 and n/t topologies
  • add coprocessor topology bootstrap wait + aggregated topology log collection for local debugging
  • add smoke profiles:
    • multi-2-2
    • multi-3-5
  • document local usage in test-suite/README.md

Why

We need a concrete local mode to validate e2e flow with NUM_COPROCESSORS > 1 and inspect where topology/consensus-adjacent flows break.

Scope

  • local debugging/validation workflow only
  • supports up to 5 coprocessors in this iteration
  • not production orchestration

Local runbook

cd test-suite/fhevm

# 1) Deploy a topology (examples)
./fhevm-cli deploy --local --coprocessors 2 --coprocessor-threshold 2
# or
./fhevm-cli deploy --local --coprocessors 5 --coprocessor-threshold 3

# 2) Run e2e input proof (same command for 1/1 and n/t)
./fhevm-cli test input-proof

# 3) Optional quick profiles
./fhevm-cli smoke multi-2-2
./fhevm-cli smoke multi-3-5

# 4) Cleanup
./fhevm-cli clean

Validation done

  • syntax:
    • bash -n test-suite/fhevm/fhevm-cli
    • bash -n test-suite/fhevm/scripts/deploy-fhevm-stack.sh
  • local e2e (fresh tmp clone, this branch):
    • ./fhevm-cli clean
    • ./fhevm-cli deploy --local --build
    • ./fhevm-cli test input-proof
    • ./fhevm-cli deploy --local --build --coprocessors 2 --coprocessor-threshold 2
    • ./fhevm-cli test input-proof
  • local e2e (historical run on this branch):
    • ./fhevm-cli deploy --local --coprocessors 5 --coprocessor-threshold 3
    • ./fhevm-cli test input-proof

@cla-bot cla-bot bot added the cla-signed label Feb 11, 2026
@mergify
Copy link
Copy Markdown

mergify bot commented Feb 11, 2026

🧪 CI Insights

Here's what we observed from your CI run for e4949a0.

🟢 All jobs passed!

But CI Insights is watching 👀

@Eikix
Copy link
Copy Markdown
Contributor Author

Eikix commented Feb 11, 2026

Ran local 2-coprocessor validation end-to-end in this branch.

What I ran:

  • ./test-suite/fhevm/fhevm-cli deploy --local --coprocessors 2 --coprocessor-threshold 2
  • ./test-suite/fhevm/fhevm-cli test multicoproc-input-proof

Observed behavior:

  • Deploy now succeeds with both coprocessor stacks up (coprocessor-* and coprocessor1-*) and host/gateway contracts initialized with 2 signers + threshold 2.
  • First test attempt failed with relayer bad JSON while key bootstrap was still converging.
  • Added a bootstrap gate in multicoproc-input-proof mode that waits for Fetched keyset on all *-sns-worker containers before running tests.
  • Re-ran test after gate: PASS (test user input uint64 (non-trivial)).

Commits added after initial PR draft:

  • 91071500 fix multicoprocessor compose generation issues
  • dcef1dd8 gate multicoprocessor test on key bootstrap readiness

@Eikix
Copy link
Copy Markdown
Contributor Author

Eikix commented Feb 11, 2026

Update: I ran local N=5, threshold=3 end-to-end and validated the multicoproc input-proof path.

What I changed

  • Raised local multicoproc cap from 2 -> 5.
  • Derive coprocessor signer/sender keys from gateway mnemonic for all instances (cast), and write per-instance env (coprocessor_i DB + private key).
  • Split additional coprocessor startup in 2 phases:
    • run coprocessor{i}-db-migration first
    • only then start runtime services
      This avoids DB connection storms during startup.
  • Increased local Postgres max_connections to 500 for this mode.
  • Updated fhevm-cli multicoproc key-bootstrap gate to enforce quorum readiness (COPROCESSOR_THRESHOLD) instead of waiting for all SNS workers.

Validation I ran

  • Deploy: test-suite/fhevm/fhevm-cli deploy --local --coprocessors 5 --coprocessor-threshold 3
  • Test: test-suite/fhevm/fhevm-cli test multicoproc-input-proof
  • Result: PASS (1 passing, uint64 non-trivial input proof), with bootstrap readiness confirmed at 4/5 and threshold 3.

Note

  • I saw a reproducible local quirk where coprocessor-sns-worker (index 0) can stay on an empty LOB while other instances fetch keyset successfully. Quorum-based readiness unblocks testing while we investigate that root cause separately.

@Eikix
Copy link
Copy Markdown
Contributor Author

Eikix commented Feb 11, 2026

Follow-up update after implementing next steps:

What I added

  • fhevm-cli smoke profiles:
    • multi-2-2
    • multi-5-3
  • Manual CI workflow for multicoproc smoke runs:
    • .github/workflows/test-suite-multicoproc-smoke.yml
    • matrix over multi-2-2 and multi-5-3
  • Deploy stability guards:
    • Wait for relayer readiness signal (All servers are ready and responding) before launching test-suite.
    • Handle one-shot host deploy completion more robustly in wait logic.
  • setup-kms-signer-address.sh no longer writes the key-id artifact file in repo root (curl -sSf inline fetch).

Validation I ran locally

  • test-suite/fhevm/fhevm-cli smoke multi-2-2 -> PASS
  • test-suite/fhevm/fhevm-cli smoke multi-5-3 -> PASS
  • test-suite/fhevm/fhevm-cli deploy --local --coprocessors 5 --coprocessor-threshold 3 && test-suite/fhevm/fhevm-cli test multicoproc-input-proof
    • First test attempt on one run hit transient relayer JSON error before readiness guard was added.
    • With readiness guard in place, reruns passed.

Investigation follow-up

@Eikix
Copy link
Copy Markdown
Contributor Author

Eikix commented Feb 11, 2026

Addressed your self-review comments in the latest push (c4fe3d7c):

  • switched wording to threshold-first in docs (2 out of 2, i.e. n out of m style)
  • clarified that ./fhevm-cli test input-proof is the default path for both 1/1 and n/t topologies
  • kept multicoproc-input-proof only as a legacy alias (documented as such)

Files touched:

  • test-suite/README.md
  • test-suite/fhevm/fhevm-cli

@antoniupop
Copy link
Copy Markdown
Collaborator

@claude /pr-review

@claude
Copy link
Copy Markdown

claude bot commented Feb 11, 2026

Claude finished @antoniupop's task in 3m 52s —— View job


PR Review Complete ✅

PR: feat(test-suite): local 2-coprocessor deploy + e2e debug mode


Summary

This PR adds local multi-coprocessor deployment support to the fhevm-cli testing tool, enabling local validation of n-of-m coprocessor topologies. The changes are well-structured and the author has done thorough local validation.

Key additions:

  • New CLI flags: --coprocessors N and --coprocessor-threshold T
  • Dynamic env generation for gateway + host contracts with derived signing keys
  • Support for up to 5 coprocessor instances with per-instance DB and env files
  • Key bootstrap waiting with quorum-based readiness
  • Smoke profiles: multi-2-2 and multi-3-5

Bug Analysis

No critical runtime bugs found. The code is well-structured for its stated purpose (local debugging/validation workflow).


Error Handling Review

The review identified several || true patterns that suppress errors silently. However, given that this is explicitly a local debugging/validation workflow (not production orchestration), these are acceptable tradeoffs:

Observations (Non-Blocking)

Location Pattern Assessment
fhevm-cli:108 docker logs ... || true Acceptable - log collection is best-effort for debugging
fhevm-cli:122 docker ps ... || true Acceptable - continues polling with appropriate warning
fhevm-cli:142 rg ... || true Acceptable - falls back to total_count with sensible default

Nit: The cast wallet address/private-key commands in deploy-fhevm-stack.sh:458-459 don't have explicit error handling, but set -e at the top of the script will catch failures. Adding context to these errors would improve debuggability:

# Current
cp_address=$(cast wallet address --mnemonic "$gateway_mnemonic" --mnemonic-index "$account_index")

# Suggested (optional improvement)
if ! cp_address=$(cast wallet address --mnemonic "$gateway_mnemonic" --mnemonic-index "$account_index"); then
  log_error "Failed to derive coprocessor address for index $account_index"
  exit 1
fi

Positive Observations

  1. Good prerequisite validation - The script properly checks for cast availability and MNEMONIC presence before attempting operations
  2. Clear error messages - Error contexts like "Missing MNEMONIC in $gateway_env; cannot derive coprocessor accounts" are actionable
  3. Proper threshold validation - The script validates that threshold ≤ coprocessor count
  4. Quorum-based readiness - Smart approach to wait for threshold rather than all instances
  5. Clean curl improvement - The change in setup-kms-signer-address.sh from file-based to inline curl with -sSf is a bug fix for the original code
  6. Database scaling - max_connections=500 is appropriate for multi-coprocessor local mode
  7. Documentation - README updates with clear examples and the threshold-first notation (2 out of 2)

Minor Notes

  • The sed escaping in set_env_value() handles / and & but not | (the delimiter). This is fine since the values being set (addresses, URLs, private keys) don't contain |.
  • The host-sc-deploy completion marker detection and auto-stop is a pragmatic workaround for one-shot jobs that don't exit cleanly.

Verdict

Approved

This is a well-implemented feature for local multi-coprocessor testing. The author has clearly done extensive local validation (documented in PR comments), and the code follows sensible patterns for a debugging workflow. The scope is appropriately limited to local use only.


| Branch: codex/multicoproc-local-e2e

antoniupop
antoniupop previously approved these changes Feb 11, 2026
@Eikix Eikix marked this pull request as ready for review February 11, 2026 16:33
@Eikix Eikix requested a review from a team as a code owner February 11, 2026 16:33
@enitrat
Copy link
Copy Markdown
Contributor

enitrat commented Feb 11, 2026

at some point we should really re-write this entire local deployment stack in non-bash

@Eikix
Copy link
Copy Markdown
Contributor Author

Eikix commented Feb 12, 2026

Superseding my earlier progress comments with current status:

  • current canonical test path is ./fhevm-cli test input-proof for both 1/1 and n/t
  • smoke profiles in branch are multi-2-2 and multi-3-5
  • multicoproc local deploy/test validated from a fresh tmp clone on this branch with --local --build:
    • default 1/1: deploy + input-proof pass
    • 2/2: deploy + input-proof pass
  • references in older comments to multicoproc-input-proof, multi-5-3, and issue Local multicoproc: one SNS worker can lag keyset bootstrap indefinitely in N=5/T=3 #1963 are outdated

Please use PR body + this comment as the up-to-date source of truth.

@Eikix
Copy link
Copy Markdown
Contributor Author

Eikix commented Feb 12, 2026

Adding two concrete local runbooks to validate consensus behavior on compute path.

1) Happy path (2/2, all honest)

cd /Users/work/.codex/worktrees/2d79/fhevm/test-suite/fhevm
./fhevm-cli clean
./fhevm-cli deploy --local --coprocessors 2 --coprocessor-threshold 2
./fhevm-cli test input-proof-compute-decrypt

Then verify on-chain (gateway chain, :8546):

  1. AddCiphertextMaterial has two txs for the same ctHandle.
  2. Both txs have identical ciphertextDigest and snsCiphertextDigest.
  3. AddCiphertextMaterialConsensus is emitted for that handle.
  4. getAddCiphertextMaterialConsensusTxSenders(handle) returns 2 addresses.

Example commands:

# 1) list AddCiphertextMaterial logs
 cast logs --json --rpc-url http://localhost:8546 \
  --address 0xF0bFB159C7381F7CB332586004d8247252C5b816 \
  "AddCiphertextMaterial(bytes32,uint256,bytes32,bytes32,address)" \
| jq -r '.[] | [.topics[1], ("0x"+(.data[66:130])), ("0x"+(.data[130:194])), ("0x"+(.data[218:258]))] | @tsv'

# 2) consensus events
cast logs --rpc-url http://localhost:8546 \
  --address 0xF0bFB159C7381F7CB332586004d8247252C5b816 \
  "AddCiphertextMaterialConsensus(bytes32,uint256,bytes32,bytes32,address[])"

# 3) sender set for a given handle
cast call --rpc-url http://localhost:8546 \
  0xF0bFB159C7381F7CB332586004d8247252C5b816 \
  "getAddCiphertextMaterialConsensusTxSenders(bytes32)(address[])" \
  <HANDLE>

2) Corrupted path (2/2 with 1 honest + 1 corrupted image)

Start from the same normal 2/2 deploy above, then replace only coprocessor1 runtime services with tag a1179b6:

cd /Users/work/.codex/worktrees/2d79/fhevm/test-suite/fhevm

docker compose -p fhevm -f docker-compose/coprocessor-docker-compose.yml \
  --env-file env/staging/.env.coprocessor.local \
  stop coprocessor1-host-listener coprocessor1-gw-listener coprocessor1-tfhe-worker coprocessor1-sns-worker coprocessor1-transaction-sender

COPROCESSOR_HOST_LISTENER_VERSION=a1179b6 \
COPROCESSOR_GW_LISTENER_VERSION=a1179b6 \
COPROCESSOR_TFHE_WORKER_VERSION=a1179b6 \
COPROCESSOR_SNS_WORKER_VERSION=a1179b6 \
COPROCESSOR_TX_SENDER_VERSION=a1179b6 \
docker compose -p fhevm -f docker-compose/coprocessor-docker-compose.yml \
  --env-file env/staging/.env.coprocessor.local \
  up -d coprocessor1-host-listener coprocessor1-gw-listener coprocessor1-tfhe-worker coprocessor1-sns-worker coprocessor1-transaction-sender

./fhevm-cli test input-proof-compute-decrypt

Expected behavior for corrupted case:

  1. For same ctHandle, two AddCiphertextMaterial txs exist but with different digest(s).
  2. No AddCiphertextMaterialConsensus event for that handle.
  3. getAddCiphertextMaterialConsensusTxSenders(handle) is empty (or below threshold).
  4. User decrypt path times out/fails because threshold consensus is not reached.

This gives a positive proof (happy path) and a negative proof (consensus break) on the same compute flow.

@Eikix
Copy link
Copy Markdown
Contributor Author

Eikix commented Feb 12, 2026

@mergify queue

@mergify
Copy link
Copy Markdown

mergify bot commented Feb 12, 2026

Merge Queue Status

Rule: main


This pull request spent 1 hour 2 minutes 36 seconds in the queue, including 2 minutes 20 seconds running CI.

Required conditions to merge
  • check-success = run-e2e-tests / fhevm-e2e-test
  • #approved-reviews-by >= 1 [🛡 GitHub branch protection]
  • #changes-requested-reviews-by = 0 [🛡 GitHub branch protection]
  • #review-threads-unresolved = 0 [🛡 GitHub branch protection]
  • branch-protection-review-decision = APPROVED [🛡 GitHub branch protection]
  • any of [🛡 GitHub branch protection]:
    • check-success = common-pull-request/lint (bpr)
    • check-neutral = common-pull-request/lint (bpr)
    • check-skipped = common-pull-request/lint (bpr)
  • any of [🛡 GitHub branch protection]:
    • check-skipped = coprocessor-cargo-listener-tests/cargo-tests (bpr)
    • check-neutral = coprocessor-cargo-listener-tests/cargo-tests (bpr)
    • check-success = coprocessor-cargo-listener-tests/cargo-tests (bpr)
  • any of [🛡 GitHub branch protection]:
    • check-skipped = coprocessor-cargo-test/cargo-tests (bpr)
    • check-neutral = coprocessor-cargo-test/cargo-tests (bpr)
    • check-success = coprocessor-cargo-test/cargo-tests (bpr)
  • any of [🛡 GitHub branch protection]:
    • check-skipped = coprocessor-dependency-analysis/dependencies-check (bpr)
    • check-neutral = coprocessor-dependency-analysis/dependencies-check (bpr)
    • check-success = coprocessor-dependency-analysis/dependencies-check (bpr)
  • any of [🛡 GitHub branch protection]:
    • check-skipped = gateway-contracts-deployment-tests/sc-deploy (bpr)
    • check-neutral = gateway-contracts-deployment-tests/sc-deploy (bpr)
    • check-success = gateway-contracts-deployment-tests/sc-deploy (bpr)
  • any of [🛡 GitHub branch protection]:
    • check-skipped = kms-connector-tests/test-connector (bpr)
    • check-neutral = kms-connector-tests/test-connector (bpr)
    • check-success = kms-connector-tests/test-connector (bpr)

Reason

The merge conditions cannot be satisfied due to failing checks

Failing checks:

Hint

You may have to fix your CI before adding the pull request to the queue again.
If you update this pull request, to fix the CI, it will automatically be requeued once the queue conditions match again.
If you think this was a flaky issue instead, you can requeue the pull request, without updating it, by posting a @mergifyio requeue comment.

@mergify mergify bot added the queued label Feb 12, 2026
mergify bot added a commit that referenced this pull request Feb 12, 2026
@mergify mergify bot added dequeued and removed queued labels Feb 12, 2026
@Eikix
Copy link
Copy Markdown
Contributor Author

Eikix commented Feb 13, 2026

@Mergifyio requeue

@mergify
Copy link
Copy Markdown

mergify bot commented Feb 13, 2026

Merge Queue Status

Rule: main


This pull request spent 43 minutes 52 seconds in the queue, including 36 minutes 25 seconds running CI.

Required conditions to merge
  • check-success = run-e2e-tests / fhevm-e2e-test
  • any of [🛡 GitHub branch protection]:
    • check-neutral = kms-connector-tests/test-connector (bpr)
    • check-skipped = kms-connector-tests/test-connector (bpr)
    • check-success = kms-connector-tests/test-connector (bpr)
  • #approved-reviews-by >= 1 [🛡 GitHub branch protection]
  • #changes-requested-reviews-by = 0 [🛡 GitHub branch protection]
  • #review-threads-unresolved = 0 [🛡 GitHub branch protection]
  • branch-protection-review-decision = APPROVED [🛡 GitHub branch protection]
  • any of [🛡 GitHub branch protection]:
    • check-success = common-pull-request/lint (bpr)
    • check-neutral = common-pull-request/lint (bpr)
    • check-skipped = common-pull-request/lint (bpr)
  • any of [🛡 GitHub branch protection]:
    • check-skipped = coprocessor-cargo-listener-tests/cargo-tests (bpr)
    • check-neutral = coprocessor-cargo-listener-tests/cargo-tests (bpr)
    • check-success = coprocessor-cargo-listener-tests/cargo-tests (bpr)
  • any of [🛡 GitHub branch protection]:
    • check-skipped = coprocessor-cargo-test/cargo-tests (bpr)
    • check-neutral = coprocessor-cargo-test/cargo-tests (bpr)
    • check-success = coprocessor-cargo-test/cargo-tests (bpr)
  • any of [🛡 GitHub branch protection]:
    • check-skipped = coprocessor-dependency-analysis/dependencies-check (bpr)
    • check-neutral = coprocessor-dependency-analysis/dependencies-check (bpr)
    • check-success = coprocessor-dependency-analysis/dependencies-check (bpr)
  • any of [🛡 GitHub branch protection]:
    • check-skipped = gateway-contracts-deployment-tests/sc-deploy (bpr)
    • check-neutral = gateway-contracts-deployment-tests/sc-deploy (bpr)
    • check-success = gateway-contracts-deployment-tests/sc-deploy (bpr)

Reason

The merge conditions cannot be satisfied due to failing checks

Failing checks:

Hint

You may have to fix your CI before adding the pull request to the queue again.
If you update this pull request, to fix the CI, it will automatically be requeued once the queue conditions match again.
If you think this was a flaky issue instead, you can requeue the pull request, without updating it, by posting a @mergifyio requeue comment.

@mergify mergify bot removed the dequeued label Feb 13, 2026
@Eikix Eikix force-pushed the codex/multicoproc-local-e2e branch from 7a8a4bd to 0254d91 Compare February 13, 2026 14:44
@Eikix
Copy link
Copy Markdown
Contributor Author

Eikix commented Feb 13, 2026

@Mergifyio requeue

@mergify
Copy link
Copy Markdown

mergify bot commented Feb 13, 2026

requeue

☑️ This pull request is already queued

@Eikix
Copy link
Copy Markdown
Contributor Author

Eikix commented Feb 15, 2026

@mergify queue

@mergify
Copy link
Copy Markdown

mergify bot commented Feb 15, 2026

queue

☑️ Command queue ignored because it is already running from a previous command.

@mergify mergify bot removed the merge-queued label Feb 15, 2026
mergify bot added a commit that referenced this pull request Feb 15, 2026
@mergify mergify bot merged commit 7eb8bb9 into main Feb 15, 2026
62 checks passed
@mergify mergify bot removed the queued label Feb 15, 2026
@mergify mergify bot deleted the codex/multicoproc-local-e2e branch February 15, 2026 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants