Skip to content

feat(taiko-client): latched ZK→SGX drain/resume for proof submitter#21795

Open
davidtaikocha wants to merge 9 commits into
mainfrom
prover/zk-sgx-drain-resume
Open

feat(taiko-client): latched ZK→SGX drain/resume for proof submitter#21795
davidtaikocha wants to merge 9 commits into
mainfrom
prover/zk-sgx-drain-resume

Conversation

@davidtaikocha

Copy link
Copy Markdown
Collaborator

Summary

Improves the ZK/SGX proof routing added in #21782. Today shouldUseZKProof is a stateless, per-proposal distance check, so the prover can flap between ZK and SGX around the threshold and leaves the ZK backend churning a zk_any backlog nobody is waiting on.

This PR turns that check into a latched two-state machine in the proof submitter:

  • Trigger: the first time proposalID > lastFinalizedProposalID + maxZKProofProposalDistance, the prover latches into SGX-draining mode and fires a one-off POST /v3/prover/clear on the ZK endpoint to discard the stale zk_any backlog.
  • Drain: while latched, every proposal is proven via SGX (ZK is not attempted).
  • Resume: switches back to ZK only when both (A) proposalID ≤ lastFinalizedProposalID + 1 (backlog drained) and (B) GET /v3/prover/status reports data.clean == true. A future breach repeats the cycle.

The two control-plane endpoints come from raiko2 #93.

Design decisions

  • Inline latch on ProofSubmitter, mutex-guarded (each RequestProof runs in its own goroutine); clear fires exactly once per transition via a CAS.
  • data.clean is the single resume gate.
  • No new flag — reuses maxZKProofProposalDistance (default 30; 0 disables the machine entirely, preserving prior behavior).
  • Conservative + degrade: clear is best-effort (bounded background retries); a not-clean /status keeps draining; a /status error or 404 degrades to resuming on condition (A) alone — so the prover never gets stuck on SGX if raiko2 [protocol] fix #84 test:genesis/deploy_L1 workflow errors #93 isn't deployed yet, and the per-proposal zk_any_not_drawn / timeout fallbacks are unchanged and never touch the latch.

/status is only queried once condition (A) holds, so it isn't polled per-proposal during a deep backlog.

Observability

  • prover_zk_backlog_sgx_mode gauge (1 = draining, 0 = ZK)
  • prover_zk_backlog_clear counter

Rollout

Safe to deploy before raiko2 #93 ships: with the endpoints absent, the prover still latches on a breach and resumes once caught up to the tip (condition A) — a strict improvement over per-proposal flapping, minus the upstream-drain wait. Once #93 deploys, clear accelerates draining and data.clean adds the upstream-drain guarantee, with no prover config change.

Test plan

  • go test -count=1 -race ./packages/taiko-client/prover/proof_submitter/ ./packages/taiko-client/prover/proof_producer/ — ✅ both pass, no data races
  • go test -count=1 ./packages/taiko-client/prover/ -run TestNewConfigFromCliContextMaxZKProofProposalDistance — ✅
  • go build ./packages/taiko-client/... and go vet on the changed packages — ✅ clean

New coverage: control-plane client (httptest: method/path, data.clean parse, non-200/404 error, dummy short-circuit); state machine (within-distance, breach latches + clears exactly once incl. a 50-goroutine concurrency test, resume when drained+clean, stay-SGX when not drained / not clean, degrade on /status error, machine disabled, nil-backlog stateless fallback, bounded clear-retry on failure). Existing TestShouldUseZKProof / TestIsProposalOutOfRange preserved.

Note: the prover/proof_submitter/transaction integration suite needs a live L1 devnet (pre-existing) and is not part of this change's verification.

🤖 Generated with Claude Code

davidtaikocha and others added 6 commits June 16, 2026 10:51
Add ZKBacklogController (ClearBacklog / StatusClean) implemented by the ZK
ComposeProofProducer, wrapping raiko2 POST /v3/prover/clear and
GET /v3/prover/status (data.clean). Dummy mode short-circuits.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the mutex-guarded SGX-draining latch (markSGXFallback / inSGXFallback /
resumeZK), wire the ZK control-plane client into NewProofSubmitter, and add
prover_zk_backlog_sgx_mode and prover_zk_backlog_clear metrics.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
decideUseZK latches into SGX on the first distance breach (firing a one-off
backlog clear) and resumes ZK once the backlog is drained and the ZK status is
clean, degrading to drained-alone when /status is unavailable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
RequestProof now consults decideUseZK each retry instead of the stateless
distance check, so a backlog breach latches into SGX draining until caught up.
Per-call zk_any_not_drawn / timeout fallbacks are unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Store the prover's long-lived context on ProofSubmitter and use it for the
  background backlog-clear goroutine (so it can't inherit a request-scoped ctx).
- Add a test for the ClearBacklog bounded-retry failure posture.
- Update maxZKProofProposalDistance flag help to describe the latched drain/resume.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Consolidate the raiko2 control-plane response types: drop the clear-only
response struct (the body is unused), unexport the status response, and remove
the unused status fields. Fold the submitter ctx into the struct literal.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

🐋 DeepSeek Code Review

🔴 Critical Issues

No critical issues found — no fund loss, chain halt, or security vulnerabilities introduced.

🟡 Warnings

  1. Premature ZK resume on any StatusClean error
    In canResumeZK, any error from the status endpoint (including transient 5xx, timeouts, context cancellations) unconditionally resumes ZK. While the PR intends to "degrade to resuming on condition (A) alone" to avoid getting stuck if raiko2 [protocol] fix #84 test:genesis/deploy_L1 workflow errors #93 isn’t deployed, this makes the latch fragile: a flaky endpoint or slow response can erase the SGX-draining protection before the ZK backend is actually idle, reintroducing flapping.
    Consider: retry the status call a few times before falling back, or only degrade on a well‑known “endpoint absent” signal (e.g. 404) — and treat other errors as “not clean” to stay draining.

  2. lastFinalizedProposalID nil safety
    canResumeZK performs new(big.Int).Add(lastFinalizedProposalID, common.Big1) without checking for nil. The state machine only reaches SGX mode after a distance breach, which implies a valid (non‑nil) finalized ID under the current shouldUseZKProof implementation, but a future change or edge case (e.g. chain freshly bootstrapped, no finalized proposals) could trigger a nil addition and panic. Defensive nil check before the arithmetic is cheap and prevents a crash.

🔵 Suggestions

  1. Make StatusClean timeout more predictable
    canResumeZK currently uses the inbound request’s context for StatusClean. If the request context has a long deadline (or none), a slow status endpoint can block the calling goroutine unexpectedly. Wrapping with a short explicit timeout (e.g., 3-5 s) keeps resume checks responsive.

  2. Configurable clearBackoffMaxRetries
    The backlog clear retries are hard‑coded at 5. If the ZK endpoint takes longer to become available, more retries might be necessary. Making this configurable (or tying it to a short deadline instead of a fixed count) would improve flexibility without code changes.

  3. Metric naming consistency
    prover_zk_backlog_sgx_mode is a gauge; a name like prover_zk_backlog_sgx_mode is fine but could be mistaken for a counter. Consider prover_zk_backlog_mode to avoid ambiguity.

🟢 What Looks Good

  • Concurrency: The zkFallback mutex correctly coordinates multiple RequestProof goroutines; CAS semantics guarantee exactly‑once backlog clear and proper latching/ unlatching.
  • Idempotent clear: fireClearAsync fires exactly once per SGX transition, with bounded retries and graceful failure. It uses the submitter’s long‑lived context, so it outlives individual proof requests.
  • Graceful degradation: The absence of a ZK endpoint or the control‑plane interface (ZKBacklogController) cleanly falls back to the stateless per‑proposal distance check, ensuring no regression when raiko2 [protocol] fix #84 test:genesis/deploy_L1 workflow errors #93 is not deployed.
  • Test coverage: The new tests cover all states (machine disabled, nil backend, within‑distance, breach, concurrent breach single clear, resume on drain+clean, stay SGX when not drained/not clean, degrade on status error, and clear retry exhaustion). They use atomic counters and channels to assert exactly‑once semantics.

Automatically triggered on PR update • model: deepseek-v4-pro

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 053c4f0edd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/taiko-client/prover/proof_submitter/proof_submitter.go
@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 84.80000% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 32.44%. Comparing base (a9adcce) to head (ccd20d4).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...s/taiko-client/prover/proof_producer/zk_backlog.go 78.26% 5 Missing and 5 partials ⚠️
...o-client/prover/proof_submitter/proof_submitter.go 0.00% 5 Missing ⚠️
...taiko-client/prover/proof_submitter/zk_fallback.go 94.59% 2 Missing and 2 partials ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
packages/taiko-client/internal/metrics/metrics.go 0.00% <ø> (ø)
...taiko-client/prover/proof_submitter/zk_fallback.go 94.59% <94.59%> (ø)
...o-client/prover/proof_submitter/proof_submitter.go 10.91% <0.00%> (+0.10%) ⬆️
...s/taiko-client/prover/proof_producer/zk_backlog.go 78.26% <78.26%> (ø)

Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 027e612...ccd20d4. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

davidtaikocha and others added 3 commits June 16, 2026 11:12
The zkvmProofProducer.(ZKBacklogController) assertion is a compile-time
capability check, not a runtime probe of the Raiko host. Clarify that the
zkBacklog==nil branch covers the no-ZK-endpoint config (and guards canResumeZK
from a nil deref); a ZK host predating raiko2 #93 returns 404 and the machine
degrades by design.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Restructure the new zk_backlog and zk_fallback tests as suite.Suite (matching
the rest of the prover package), and wrap the long assertion line that tripped
the lll linter.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant