feat(taiko-client): latched ZK→SGX drain/resume for proof submitter#21795
feat(taiko-client): latched ZK→SGX drain/resume for proof submitter#21795davidtaikocha wants to merge 9 commits into
Conversation
Add ZKBacklogController (ClearBacklog / StatusClean) implemented by the ZK ComposeProofProducer, wrapping raiko2 POST /v3/prover/clear and GET /v3/prover/status (data.clean). Dummy mode short-circuits. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the mutex-guarded SGX-draining latch (markSGXFallback / inSGXFallback / resumeZK), wire the ZK control-plane client into NewProofSubmitter, and add prover_zk_backlog_sgx_mode and prover_zk_backlog_clear metrics. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
decideUseZK latches into SGX on the first distance breach (firing a one-off backlog clear) and resumes ZK once the backlog is drained and the ZK status is clean, degrading to drained-alone when /status is unavailable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
RequestProof now consults decideUseZK each retry instead of the stateless distance check, so a backlog breach latches into SGX draining until caught up. Per-call zk_any_not_drawn / timeout fallbacks are unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Store the prover's long-lived context on ProofSubmitter and use it for the background backlog-clear goroutine (so it can't inherit a request-scoped ctx). - Add a test for the ClearBacklog bounded-retry failure posture. - Update maxZKProofProposalDistance flag help to describe the latched drain/resume. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Consolidate the raiko2 control-plane response types: drop the clear-only response struct (the body is unused), unexport the status response, and remove the unused status fields. Fold the submitter ctx into the struct literal. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
🐋 DeepSeek Code Review🔴 Critical IssuesNo critical issues found — no fund loss, chain halt, or security vulnerabilities introduced. 🟡 Warnings
🔵 Suggestions
🟢 What Looks Good
Automatically triggered on PR update • model: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 053c4f0edd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Codecov Report❌ Patch coverage is Additional details and impacted files
Continue to review full report in Codecov by Harness.
🚀 New features to boost your workflow:
|
The zkvmProofProducer.(ZKBacklogController) assertion is a compile-time capability check, not a runtime probe of the Raiko host. Clarify that the zkBacklog==nil branch covers the no-ZK-endpoint config (and guards canResumeZK from a nil deref); a ZK host predating raiko2 #93 returns 404 and the machine degrades by design. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Restructure the new zk_backlog and zk_fallback tests as suite.Suite (matching the rest of the prover package), and wrap the long assertion line that tripped the lll linter. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
Improves the ZK/SGX proof routing added in #21782. Today
shouldUseZKProofis a stateless, per-proposal distance check, so the prover can flap between ZK and SGX around the threshold and leaves the ZK backend churning azk_anybacklog nobody is waiting on.This PR turns that check into a latched two-state machine in the proof submitter:
proposalID > lastFinalizedProposalID + maxZKProofProposalDistance, the prover latches into SGX-draining mode and fires a one-offPOST /v3/prover/clearon the ZK endpoint to discard the stalezk_anybacklog.proposalID ≤ lastFinalizedProposalID + 1(backlog drained) and (B)GET /v3/prover/statusreportsdata.clean == true. A future breach repeats the cycle.The two control-plane endpoints come from raiko2 #93.
Design decisions
ProofSubmitter, mutex-guarded (eachRequestProofruns in its own goroutine);clearfires exactly once per transition via a CAS.data.cleanis the single resume gate.maxZKProofProposalDistance(default 30;0disables the machine entirely, preserving prior behavior).clearis best-effort (bounded background retries); a not-clean/statuskeeps draining; a/statuserror or 404 degrades to resuming on condition (A) alone — so the prover never gets stuck on SGX if raiko2 [protocol] fix #84test:genesis/deploy_L1workflow errors #93 isn't deployed yet, and the per-proposalzk_any_not_drawn/ timeout fallbacks are unchanged and never touch the latch./statusis only queried once condition (A) holds, so it isn't polled per-proposal during a deep backlog.Observability
prover_zk_backlog_sgx_modegauge (1 = draining, 0 = ZK)prover_zk_backlog_clearcounterRollout
Safe to deploy before raiko2 #93 ships: with the endpoints absent, the prover still latches on a breach and resumes once caught up to the tip (condition A) — a strict improvement over per-proposal flapping, minus the upstream-drain wait. Once #93 deploys,
clearaccelerates draining anddata.cleanadds the upstream-drain guarantee, with no prover config change.Test plan
go test -count=1 -race ./packages/taiko-client/prover/proof_submitter/ ./packages/taiko-client/prover/proof_producer/— ✅ both pass, no data racesgo test -count=1 ./packages/taiko-client/prover/ -run TestNewConfigFromCliContextMaxZKProofProposalDistance— ✅go build ./packages/taiko-client/...andgo veton the changed packages — ✅ cleanNew coverage: control-plane client (httptest: method/path,
data.cleanparse, non-200/404 error, dummy short-circuit); state machine (within-distance, breach latches + clears exactly once incl. a 50-goroutine concurrency test, resume when drained+clean, stay-SGX when not drained / not clean, degrade on/statuserror, machine disabled, nil-backlog stateless fallback, bounded clear-retry on failure). ExistingTestShouldUseZKProof/TestIsProposalOutOfRangepreserved.🤖 Generated with Claude Code