[Fix] Accept proof submission even it has been timeout #1764

noel2004 · 2025-11-25T13:01:14Z

We faced a delimma in collection_time setting (i.e. the time limit to set an assigned prove task being timeout): value too small would cause a possible time consuming task can not be completed since all submission would be rejected by timeout; in the other hand, a too big timeout would take too long to re-assign a task if the connection with assignment (prover) lost.

This PR advise to accept the proof submission even it has been timeout: there is no proper reason to reject the result if it can be verified. With the fixing we can reduce the interval of reassignment without worring about a permanent failure of occasional long-running task. The counter of timeout failure would still be counted.

Summary by CodeRabbit

Bug Fixes
- Improved timeout handling: timed submissions are now processed through validation, tracked, logged, and their status synchronized instead of being rejected.
- Added last-chance recovery to ensure assigned prover tasks are retried or reconciled on error.
- Synchronized prover task status updates to prevent mismatches and lost state during processing.
New Files / Configurations
- Added testnet/prover end-to-end configs, genesis/state files, and new prover runtime/configuration manifest files.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-25T13:01:27Z

Walkthrough

HandleZkProof's signature changes to a named error return with a defer-based recovery; timeout handling in validation is converted from an early-reject path to an overridden timeout flow that flips task status and records metrics; updateProofStatus results are synced back into the local proverTask. Other added files are test/config artifacts.

Changes

Cohort / File(s)	Summary
Proof receiver logic `coordinator/internal/logic/submitproof/proof_receiver.go`	Changed `HandleZkProof` to return a named error `(rerr error)` and added a deferred recovery (`proofRecover`) triggered when `rerr` is non-nil and task remains `Assigned`. Reworked timeout handling: removed early timeout-return path, now overrides timeout to mark task as `Assigned`, increments timeout metric, logs timing, and continues validation flow. `updateProofStatus` call now propagates the prover task’s current `FailureType`, and local `proverTask.ProvingStatus`/`FailureType` are synchronized from update results. Consolidated proof-time calculations and removed duplicate/early-exit timeout branches.
Prover e2e — mainnet feynman (tests/configs) `tests/prover-e2e/mainnet-feynman/.make.env`, `tests/prover-e2e/mainnet-feynman/config.json`, `tests/prover-e2e/mainnet-feynman/config.template.json`, `tests/prover-e2e/mainnet-feynman/genesis.json`	Added environment file and multiple configuration files for a mainnet-feynman prover e2e setup: `.make.env` (BEGIN/END_BLOCK, SCROLL_FORK_NAME), `config.json` (db/fetch flags), `config.template.json` (prover_manager, db, l2, auth, sequencer templates), and a comprehensive `genesis.json` with allocations, codes, and storage entries for testnet initialization.
zkvm-prover configs `zkvm-prover/config_remote.json`, `zkvm-prover/target.json`	Added `config_remote.json` (sdk_config, coordinator/prover settings, circuit entries with asset detours) and `target.json` (chunk hash and empty batches/bundles) for zkvm-prover runtime configuration.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Focus areas:
- HandleZkProof defer/recovery: ensure named rerr semantics don't mask errors and recovery side-effects are correct.
- Timeout override: verify the decision to flip status to Assigned, metric increment, and log/timing correctness.
- updateProofStatus synchronization: check concurrency/atomicity and that persisted vs in-memory state remain consistent.
- Large data files (genesis.json): validate bulk additions for accidental corruptions or mis-encoded fields.

Suggested reviewers

georgehao
Thegaram
lispc

Poem

🐇 I hopped through code at break of dawn,

A timeout met — I stretched and yawned.
I nudged the task, kept status bright,
Tuned logs and metrics through the night.
A little hop, and proofs move on.

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	The PR description explains the problem, rationale, and solution, but lacks the conventional commits format checklist and deployment/breaking change sections required by the template.	Add the required template sections including PR title format confirmation (fix: prefix is correct), deployment tag versioning checkbox, and breaking change label checkbox to fully comply with the repository template.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[Fix] Accept proof submission even it has been timeout' clearly describes the main change: accepting proof submissions that occur after timeout, which aligns with the core logic modification in proof_receiver.go.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/accept_timeout_submission

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between edb5123 and e1e5278.

📒 Files selected for processing (1)

coordinator/internal/logic/submitproof/proof_receiver.go (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

coordinator/internal/logic/submitproof/proof_receiver.go (1)

common/types/db.go (2)

ProverTaskFailureType (99-99)

ProverTaskFailureTypeTimeout (105-105)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: check
GitHub Check: tests
GitHub Check: tests

coordinator/internal/logic/submitproof/proof_receiver.go

codecov-commenter · 2025-11-25T13:13:58Z

Codecov Report

❌ Patch coverage is 20.83333% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 36.47%. Comparing base (27dd62e) to head (1347762).

Files with missing lines	Patch %	Lines
...nator/internal/logic/submitproof/proof_receiver.go	20.83%	16 Missing and 3 partials ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1764      +/-   ##
===========================================
- Coverage    36.48%   36.47%   -0.02%     
===========================================
  Files          248      248              
  Lines        21314    21328      +14     
===========================================
+ Hits          7777     7779       +2     
- Misses       12709    12720      +11     
- Partials       828      829       +1

Flag	Coverage Δ
coordinator	`32.69% <20.83%> (-0.07%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…_submission

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

coordinator/internal/logic/submitproof/proof_receiver.go (1)
326-339: Timeout override now correctly precedes the submit‑twice guard; add tests and tidy comment/log text

Placing this timeout override before the ProverProofValid/ProverProofInvalid guard fixes the previously reported issue: late proofs for tasks marked ProverProofInvalid + ProverTaskFailureTypeTimeout are now allowed through validator instead of being rejected as “cannot submit twice”. The in‑memory flip back to ProverAssigned is sufficient because updateProofStatus later persists the final Valid/Undefined state.

A few follow‑ups:

This behavior is important but currently untested (Codecov shows 0% patch coverage here). It’s worth adding at least one unit/integration test that:

Creates a ProverTask with ProvingStatus=ProverProofInvalid and FailureType=ProverTaskFailureTypeTimeout.

Calls validator with StatusOk.

Asserts err == nil, validateFailureProverTaskTimeout increments, and proverTask.ProvingStatus becomes ProverAssigned.

The comments/logs still mention SessionInfoFailureTimeout and read a bit awkwardly. You could tighten them up as:
-	// Internally we overide the timeout failure:
-	// if prover task FailureType is SessionInfoFailureTimeout, the submit proof is timeout, but we still accept it
+	// Internally we override timeout failures:
+	// if the prover task FailureType is ProverTaskFailureTypeTimeout, the submission has timed out but we still accept it.
@@
-		log.Warn("proof submit proof have timeout", "hash", proofParameter.TaskID, "taskType", proverTask.TaskType,
+		log.Warn("proof submission timed out but will be verified", "hash", proofParameter.TaskID, "taskType", proverTask.TaskType,
 			"proverName", proverTask.ProverName, "proverPublicKey", pk, "proofTime", proofTimeSec)
Note that validateFailureProverTaskTimeout now counts timed‑out tasks whose proofs you still accept; if dashboards assumed it was a strict failure counter, you may want to adjust their interpretation.

🧹 Nitpick comments (2)

coordinator/internal/logic/submitproof/proof_receiver.go (2)

158-185: Defer-based last‑chance recovery looks sound; only minor polishing nits

The named return + deferred proofRecover nicely closes gaps where we’d previously exit with an error while the task was still assigned (e.g. hardForkName/unmarshal failures), without double‑recovering once updateProofStatus has run.

Two small nits you may want to address:

The log message has a typo: "last chance proof recover triggerred" → "last chance proof recover triggered".

Consider using a more specific failureType than ProverTaskFailureTypeUndefined if you want to distinguish these “unexpected error” recoveries (e.g. a dedicated ServerError/“unexpected” code), but this is optional.

Overall the control flow and conditions (rerr != nil and ProverAssigned) look correct and non‑intrusive.

464-466: Status/failureType sync into proverTask is necessary and correct

Syncing status and failureType back into proverTask after a successful transaction is important so the deferred last‑chance recovery in HandleZkProof can reliably detect whether the task is still ProverAssigned or already transitioned to Valid/Invalid. This prevents double calls to proofRecover/closeProofTask while still allowing recovery when updateProofStatus fails.

Tiny nit: the comment has a typo (failture → failure), but functionally this change is good.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e1e5278 and 055f76e.

📒 Files selected for processing (1)

coordinator/internal/logic/submitproof/proof_receiver.go (5 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

coordinator/internal/logic/submitproof/proof_receiver.go (2)

coordinator/internal/types/submit_proof.go (1)

SubmitProofParameter (4-13)

common/types/db.go (7)

ProverProveStatus (72-72)

ProvingStatus (140-140)

ProverAssigned (78-78)

ProverTaskFailureTypeUndefined (103-103)

ProverProofInvalid (82-82)

ProverTaskFailureType (99-99)

ProverTaskFailureTypeTimeout (105-105)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: tests
GitHub Check: check
GitHub Check: tests

coordinator/internal/logic/submitproof/proof_receiver.go

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

coordinator/internal/logic/submitproof/proof_receiver.go (1)
319-355: Timeout override logic achieves late‑proof acceptance; minor text/metrics nits

The timeout override in validator:

Correctly detects the “timed‑out task” case (ProverProofInvalid + ProverTaskFailureTypeTimeout), increments the timeout metric, and resets ProvingStatus to ProverAssigned before the double‑submit guard. This ensures late proofs for timed‑out tasks are no longer rejected by the “cannot submit twice” check and can flow through verification as intended.

Leaves FailureType untouched so that, on success, closeProofTask can propagate the existing timeout marker into storage, which aligns with the “still count timeouts” requirement.

Two optional cleanups (one of which is already called out in a previous review):
The comment still refers to SessionInfoFailureTimeout, but the actual enum is ProverTaskFailureTypeTimeout, and the log message "proof submit proof have timeout" is hard to read. The earlier suggested wording would improve clarity:
// Internally we overide the timeout failure:

// if prover task FailureType is SessionInfoFailureTimeout, the submit proof is timeout, but we still accept it
// Internally override timeout failures:

// if prover task FailureType is ProverTaskFailureTypeTimeout, the submission has timed out, but we still accept it.
@@
  log.Warn("proof submit proof have timeout", "hash", proofParameter.TaskID, "taskType", proverTask.TaskType,
  log.Warn("proof submission timed out but will be verified", "hash", proofParameter.TaskID, "taskType", proverTask.TaskType,
  	"proverName", proverTask.ProverName, "proverPublicKey", pk, "proofTime", proofTimeSec)
Semantically, validateFailureProverTaskTimeout now counts “submissions whose task had previously timed out”, regardless of whether this particular submission ultimately succeeds. If your dashboards previously interpreted it strictly as a validation failure metric, it’s worth sanity‑checking that this new meaning is acceptable.

🧹 Nitpick comments (3)

coordinator/internal/logic/submitproof/proof_receiver.go (1)

155-299: Last‑chance proofRecover behavior and scope

The new named return + defer correctly ensures that any non‑nil error path after loading a proverTask will trigger a final proofRecover when the task is still ProverAssigned, which helps close tasks that previously leaked (e.g. hard‑fork lookup errors, unwrap/marshal failures). It also correctly avoids double‑recovering once updateProofStatus has run, thanks to the ProverProveStatus(proverTask.ProvingStatus) check.

Two follow‑ups you might want to consider:

The doc comment above HandleZkProof (“only proving/verifying error will lead to setting status as skipped. db/unmarshal errors will not…”) is now outdated, since these errors will also end up calling proofRecover via the defer. Updating the comment to reflect the new behavior would avoid confusion.

If you still want to cover actual panics (as mentioned in the earlier discussion), this defer currently doesn’t help because rerr remains nil on panic. You’d need a recover() branch inside the same defer to mark the task failed on panic as well, if that’s desired.

Functionally the current behavior looks consistent with the PR goal; above are polish/semantics clarifications.

tests/prover-e2e/mainnet-feynman/config.json (1)

1-14: Consider replacing hard‑coded DB credentials with placeholders or env‑driven values

This e2e config bakes postgres://dev:dev@localhost:5432/scroll?sslmode=disable directly into dsn. Even if this is only for local/dev use, it will keep static‑analysis tools flagging “credentials in repo” and may encourage copy‑paste into other environments.

If feasible, consider either:

Switching to a placeholder DSN (e.g. "postgres://<user>:<password>@localhost:5432/scroll?sslmode=disable"), or

Documenting that this file is an example and is normally overridden by env‑specific config.

tests/prover-e2e/mainnet-feynman/config.template.json (1)

1-41: Template placeholders & typo

This template is clear and matches the expected structure for prover manager, DB, L2, auth, and sequencer configs. Two small suggestions:

To avoid secret‑scanner noise and make it obvious these are placeholders, consider changing values like the DSN, "prover secret key", and "not need" to explicit placeholder tokens (e.g. "postgres://<user>:<password>@localhost/scroll?sslmode=disable", "<prover_auth_secret>", "<sequencer_decryption_key_if_needed>").

Minor typo: "endpoint": "<serach a public rpc endpoint like alchemy>" → "endpoint": "<search a public RPC endpoint like Alchemy>".

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 055f76e and 1e85070.

📒 Files selected for processing (7)

coordinator/internal/logic/submitproof/proof_receiver.go (6 hunks)
tests/prover-e2e/mainnet-feynman/.make.env (1 hunks)
tests/prover-e2e/mainnet-feynman/config.json (1 hunks)
tests/prover-e2e/mainnet-feynman/config.template.json (1 hunks)
tests/prover-e2e/mainnet-feynman/genesis.json (1 hunks)
zkvm-prover/config_remote.json (1 hunks)
zkvm-prover/target.json (1 hunks)

✅ Files skipped from review due to trivial changes (2)

tests/prover-e2e/mainnet-feynman/.make.env
zkvm-prover/target.json

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-04-15T08:52:44.176Z

Learnt from: colinlyguo
Repo: scroll-tech/scroll PR: 1645
File: rollup/proposer-tool-config.json:34-40
Timestamp: 2025-04-15T08:52:44.176Z
Learning: In configuration files like `rollup/proposer-tool-config.json`, placeholders such as `<mainnet read db config>` are intentionally left as-is to be replaced by users with their own configuration values when deploying the tool.

Applied to files:

tests/prover-e2e/mainnet-feynman/config.json
tests/prover-e2e/mainnet-feynman/config.template.json

🧬 Code graph analysis (1)

coordinator/internal/logic/submitproof/proof_receiver.go (3)

coordinator/internal/types/submit_proof.go (1)

SubmitProofParameter (4-13)

common/types/db.go (8)

ProverProveStatus (72-72)

ProvingStatus (140-140)

ProverAssigned (78-78)

ProverTaskFailureTypeUndefined (103-103)

ProverProofInvalid (82-82)

ProverTaskFailureType (99-99)

ProverTaskFailureTypeTimeout (105-105)

ProverProofValid (80-80)

coordinator/internal/types/auth.go (1)

ProverName (18-18)

🪛 Checkov (3.2.334)

tests/prover-e2e/mainnet-feynman/config.json

[medium] 4-5: Basic Auth Credentials

(CKV_SECRET_4)

tests/prover-e2e/mainnet-feynman/config.template.json

[medium] 22-23: Basic Auth Credentials

(CKV_SECRET_4)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: check
GitHub Check: tests
GitHub Check: tests

🔇 Additional comments (3)

coordinator/internal/logic/submitproof/proof_receiver.go (1)
399-476: Syncing proverTask status after updateProofStatus is a good safety net

Propagating the status and failureType arguments back into the in‑memory proverTask after the transaction:
proverTask.ProvingStatus = int16(status)
proverTask.FailureType = int16(failureType)
nicely ensures that any subsequent logic (including the HandleZkProof defer) sees the up‑to‑date state and won’t accidentally treat a closed task as still Assigned. Likewise, having closeProofTask pass through the current proverTask.FailureType instead of hard‑coding Undefined lets you preserve “timeout” markers on eventually‑successful proofs, which matches the PR’s intent of still recording timeouts.

No functional issues spotted here.
zkvm-prover/config_remote.json (1)

1-36: Remote prover configuration looks structurally sound

The sdk_config and circuits sections are well‑structured, with clear separation of coordinator, prover, and per‑hard‑fork circuit URLs/paths. No hard‑coded secrets are present; everything here looks appropriate for a committed remote config.

tests/prover-e2e/mainnet-feynman/genesis.json (1)

1-109: Genesis state file appears consistent with a mainnet‑like test setup

The chain config, scroll section, and pre‑allocs (including the 0x5300… system contracts with code/storage) all look coherent for a mainnet‑like test environment. There are no embedded private keys—only public addresses and bytecode—so committing this genesis file is reasonable assuming it intentionally mirrors the target network’s public genesis.

accept timeout submission

e1e5278

noel2004 requested a review from Thegaram November 25, 2025 13:01

noel2004 requested a review from georgehao November 25, 2025 13:01

coderabbitai bot reviewed Nov 25, 2025

View reviewed changes

coordinator/internal/logic/submitproof/proof_receiver.go Outdated Show resolved Hide resolved

noel2004 added 4 commits November 26, 2025 16:21

Merge remote-tracking branch 'origin/develop' into fix/accept_timeout…

d19a3fb

…_submission

Merge remote-tracking branch 'origin/develop' into fix/accept_timeout…

ab46c98

…_submission

Merge remote-tracking branch 'origin/develop' into fix/accept_timeout…

8344914

…_submission

Fixing and handling unexpected return routine

055f76e

coderabbitai bot reviewed Dec 11, 2025

View reviewed changes

georgehao reviewed Dec 11, 2025

View reviewed changes

coordinator/internal/logic/submitproof/proof_receiver.go Show resolved Hide resolved

georgehao reviewed Dec 11, 2025

View reviewed changes

coordinator/internal/logic/submitproof/proof_receiver.go Show resolved Hide resolved

Keep failure type in close task

1e85070

coderabbitai bot reviewed Dec 11, 2025

View reviewed changes

georgehao previously approved these changes Dec 11, 2025

View reviewed changes

noel2004 requested a review from lispc December 11, 2025 08:12

lispc previously approved these changes Dec 11, 2025

View reviewed changes

purge stuffs commited by mistake

1347762

noel2004 dismissed stale reviews from lispc and georgehao via 1347762 December 11, 2025 10:07

noel2004 requested review from georgehao and lispc December 11, 2025 10:26

georgehao approved these changes Dec 11, 2025

View reviewed changes

lispc approved these changes Dec 12, 2025

View reviewed changes

noel2004 merged commit 7de388e into develop Dec 12, 2025
6 checks passed

noel2004 deleted the fix/accept_timeout_submission branch December 12, 2025 03:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Fix] Accept proof submission even it has been timeout #1764

[Fix] Accept proof submission even it has been timeout #1764

Uh oh!

noel2004 commented Nov 25, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 25, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

codecov-commenter commented Nov 25, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[Fix] Accept proof submission even it has been timeout #1764

[Fix] Accept proof submission even it has been timeout #1764

Uh oh!

Conversation

noel2004 commented Nov 25, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

noel2004 commented Nov 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 25, 2025 •

edited

Loading

codecov-commenter commented Nov 25, 2025 •

edited

Loading