feat(taiko-client): latched ZK→SGX drain/resume for proof submitter by davidtaikocha · Pull Request #21795 · taikoxyz/taiko-mono

davidtaikocha · 2026-06-16T02:58:39Z

Summary

Improves the ZK/SGX proof routing added in #21782. Today shouldUseZKProof is a stateless, per-proposal distance check, so the prover can flap between ZK and SGX around the threshold and leaves the ZK backend churning a zk_any backlog nobody is waiting on.

This PR turns that check into a latched two-state machine in the proof submitter:

Trigger: the first time proposalID > lastFinalizedProposalID + maxZKProofProposalDistance, the prover latches into SGX-draining mode and fires a one-off POST /v3/prover/clear on the ZK endpoint to discard the stale zk_any backlog.
Drain: while latched, every proposal is proven via SGX (ZK is not attempted).
Resume: switches back to ZK only when both (A) proposalID ≤ lastFinalizedProposalID + 1 (backlog drained) and (B) GET /v3/prover/status reports data.clean == true. A future breach repeats the cycle.

The two control-plane endpoints come from raiko2 #93.

Design decisions

Inline latch on ProofSubmitter, mutex-guarded (each RequestProof runs in its own goroutine); clear fires exactly once per transition via a CAS.
data.clean is the single resume gate.
No new flag — reuses maxZKProofProposalDistance (default 30; 0 disables the machine entirely, preserving prior behavior).
Conservative + degrade: clear is best-effort (bounded background retries); a not-clean /status keeps draining; a /status error or 404 degrades to resuming on condition (A) alone — so the prover never gets stuck on SGX if raiko2 [protocol] fix #84 test:genesis/deploy_L1 workflow errors #93 isn't deployed yet, and the per-proposal zk_any_not_drawn / timeout fallbacks are unchanged and never touch the latch.

/status is only queried once condition (A) holds, so it isn't polled per-proposal during a deep backlog.

Observability

prover_zk_backlog_sgx_mode gauge (1 = draining, 0 = ZK)
prover_zk_backlog_clear counter

Rollout

Safe to deploy before raiko2 #93 ships: with the endpoints absent, the prover still latches on a breach and resumes once caught up to the tip (condition A) — a strict improvement over per-proposal flapping, minus the upstream-drain wait. Once #93 deploys, clear accelerates draining and data.clean adds the upstream-drain guarantee, with no prover config change.

Test plan

go test -count=1 -race ./packages/taiko-client/prover/proof_submitter/ ./packages/taiko-client/prover/proof_producer/ — ✅ both pass, no data races
go test -count=1 ./packages/taiko-client/prover/ -run TestNewConfigFromCliContextMaxZKProofProposalDistance — ✅
go build ./packages/taiko-client/... and go vet on the changed packages — ✅ clean

New coverage: control-plane client (httptest: method/path, data.clean parse, non-200/404 error, dummy short-circuit); state machine (within-distance, breach latches + clears exactly once incl. a 50-goroutine concurrency test, resume when drained+clean, stay-SGX when not drained / not clean, degrade on /status error, machine disabled, nil-backlog stateless fallback, bounded clear-retry on failure). Existing TestShouldUseZKProof / TestIsProposalOutOfRange preserved.

Note: the prover/proof_submitter/transaction integration suite needs a live L1 devnet (pre-existing) and is not part of this change's verification.

🤖 Generated with Claude Code

Add ZKBacklogController (ClearBacklog / StatusClean) implemented by the ZK ComposeProofProducer, wrapping raiko2 POST /v3/prover/clear and GET /v3/prover/status (data.clean). Dummy mode short-circuits. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add the mutex-guarded SGX-draining latch (markSGXFallback / inSGXFallback / resumeZK), wire the ZK control-plane client into NewProofSubmitter, and add prover_zk_backlog_sgx_mode and prover_zk_backlog_clear metrics. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

decideUseZK latches into SGX on the first distance breach (firing a one-off backlog clear) and resumes ZK once the backlog is drained and the ZK status is clean, degrading to drained-alone when /status is unavailable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

RequestProof now consults decideUseZK each retry instead of the stateless distance check, so a backlog breach latches into SGX draining until caught up. Per-call zk_any_not_drawn / timeout fallbacks are unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- Store the prover's long-lived context on ProofSubmitter and use it for the background backlog-clear goroutine (so it can't inherit a request-scoped ctx). - Add a test for the ClearBacklog bounded-retry failure posture. - Update maxZKProofProposalDistance flag help to describe the latched drain/resume. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Consolidate the raiko2 control-plane response types: drop the clear-only response struct (the body is unused), unexport the status response, and remove the unused status fields. Fold the submitter ctx into the struct literal. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions · 2026-06-16T03:01:35Z

🐋 DeepSeek Code Review

🔴 Critical Issues

No critical issues found — no fund loss, chain halt, or security vulnerabilities introduced.

🟡 Warnings

Premature ZK resume on any StatusClean error
In canResumeZK, any error from the status endpoint (including transient 5xx, timeouts, context cancellations) unconditionally resumes ZK. While the PR intends to "degrade to resuming on condition (A) alone" to avoid getting stuck if raiko2 [protocol] fix #84 test:genesis/deploy_L1 workflow errors #93 isn’t deployed, this makes the latch fragile: a flaky endpoint or slow response can erase the SGX-draining protection before the ZK backend is actually idle, reintroducing flapping.
Consider: retry the status call a few times before falling back, or only degrade on a well‑known “endpoint absent” signal (e.g. 404) — and treat other errors as “not clean” to stay draining.
lastFinalizedProposalID nil safety
canResumeZK performs new(big.Int).Add(lastFinalizedProposalID, common.Big1) without checking for nil. The state machine only reaches SGX mode after a distance breach, which implies a valid (non‑nil) finalized ID under the current shouldUseZKProof implementation, but a future change or edge case (e.g. chain freshly bootstrapped, no finalized proposals) could trigger a nil addition and panic. Defensive nil check before the arithmetic is cheap and prevents a crash.

🔵 Suggestions

Make StatusClean timeout more predictable
canResumeZK currently uses the inbound request’s context for StatusClean. If the request context has a long deadline (or none), a slow status endpoint can block the calling goroutine unexpectedly. Wrapping with a short explicit timeout (e.g., 3-5 s) keeps resume checks responsive.
Configurable clearBackoffMaxRetries
The backlog clear retries are hard‑coded at 5. If the ZK endpoint takes longer to become available, more retries might be necessary. Making this configurable (or tying it to a short deadline instead of a fixed count) would improve flexibility without code changes.
Metric naming consistency
prover_zk_backlog_sgx_mode is a gauge; a name like prover_zk_backlog_sgx_mode is fine but could be mistaken for a counter. Consider prover_zk_backlog_mode to avoid ambiguity.

🟢 What Looks Good

Concurrency: The zkFallback mutex correctly coordinates multiple RequestProof goroutines; CAS semantics guarantee exactly‑once backlog clear and proper latching/ unlatching.
Idempotent clear: fireClearAsync fires exactly once per SGX transition, with bounded retries and graceful failure. It uses the submitter’s long‑lived context, so it outlives individual proof requests.
Graceful degradation: The absence of a ZK endpoint or the control‑plane interface (ZKBacklogController) cleanly falls back to the stateless per‑proposal distance check, ensuring no regression when raiko2 [protocol] fix #84 test:genesis/deploy_L1 workflow errors #93 is not deployed.
Test coverage: The new tests cover all states (machine disabled, nil backend, within‑distance, breach, concurrent breach single clear, resume on drain+clean, stay SGX when not drained/not clean, degrade on status error, and clear retry exhaustion). They use atomic counters and channels to assert exactly‑once semantics.

Automatically triggered on PR update • model: deepseek-v4-pro

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 053c4f0edd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

codecov · 2026-06-16T03:11:11Z

Codecov Report

❌ Patch coverage is 84.80000% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 32.44%. Comparing base (a9adcce) to head (ccd20d4).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
...s/taiko-client/prover/proof_producer/zk_backlog.go	78.26%	5 Missing and 5 partials ⚠️
...o-client/prover/proof_submitter/proof_submitter.go	0.00%	5 Missing ⚠️
...taiko-client/prover/proof_submitter/zk_fallback.go	94.59%	2 Missing and 2 partials ⚠️

Additional details and impacted files

Files with missing lines	Coverage Δ
packages/taiko-client/internal/metrics/metrics.go	`0.00% <ø> (ø)`
...taiko-client/prover/proof_submitter/zk_fallback.go	`94.59% <94.59%> (ø)`
...o-client/prover/proof_submitter/proof_submitter.go	`10.91% <0.00%> (+0.10%)`	⬆️
...s/taiko-client/prover/proof_producer/zk_backlog.go	`78.26% <78.26%> (ø)`

Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 027e612...ccd20d4. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The zkvmProofProducer.(ZKBacklogController) assertion is a compile-time capability check, not a runtime probe of the Raiko host. Clarify that the zkBacklog==nil branch covers the no-ZK-endpoint config (and guards canResumeZK from a nil deref); a ZK host predating raiko2 #93 returns 404 and the machine degrades by design. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Restructure the new zk_backlog and zk_fallback tests as suite.Suite (matching the rest of the prover package), and wrap the long assertion line that tripped the lll linter. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

davidtaikocha and others added 6 commits June 16, 2026 10:51

davidtaikocha requested a review from YoGhurt111 as a code owner June 16, 2026 02:58

chatgpt-codex-connector Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread packages/taiko-client/prover/proof_submitter/proof_submitter.go

davidtaikocha added area.client option.do-not-merge labels Jun 16, 2026

davidtaikocha and others added 3 commits June 16, 2026 11:12

refactor(taiko-client): rename machineSaysZK to backlogAllowsZK

ccd20d4

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

davidtaikocha mentioned this pull request Jun 16, 2026

refactor(taiko-client): use resty for raiko HTTP requests #21796

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(taiko-client): latched ZK→SGX drain/resume for proof submitter#21795

feat(taiko-client): latched ZK→SGX drain/resume for proof submitter#21795
davidtaikocha wants to merge 9 commits into
mainfrom
prover/zk-sgx-drain-resume

davidtaikocha commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

codecov Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davidtaikocha commented Jun 16, 2026

Summary

Design decisions

Observability

Rollout

Test plan

Uh oh!

github-actions Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐋 DeepSeek Code Review

🔴 Critical Issues

🟡 Warnings

🔵 Suggestions

🟢 What Looks Good

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

codecov Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jun 16, 2026 •

edited

Loading

codecov Bot commented Jun 16, 2026 •

edited

Loading