feat(coprocessor): drift detection in gw-listener#2096
Conversation
637dca2 to
a76cffe
Compare
Compare local ciphertext digests against on-chain consensus from the CiphertextCommits contract. Enabled via --ciphertext-commits-address. Adds early-warning logging when peer submissions diverge and structured warn logs when local digest mismatches consensus. Four new Prometheus counters provide observability. Detection-only — no recovery action.
…ft checks - P1: Propagate DB errors from handle_consensus instead of swallowing them, so the block is retried via the outer backoff loop. - P2: Queue unresolved consensus events (local digest not yet computed) in a bounded pending queue (10k cap) and retry each block tick, so late-arriving local digests are still checked. - Refactor comparison logic into try_resolve_consensus shared by both the initial check and the retry path.
515f6a5 to
56d267f
Compare
🧪 CI InsightsHere's what we observed from your CI run for 26ddfd4. 🟢 All jobs passed!But CI Insights is watching 👀 |
- Clarify drift counter description in metrics.rs - Document all 5 drift metrics in docs/metrics/metrics.md - Bump drift timeout defaults from 50/10 to 3000/3000 blocks (~5 min at 100ms block time) to accommodate coprocessors stuck for minutes - Replace manual bail with clap requires on ciphertext_commits_address
There was a problem hiding this comment.
Thank you for working on it so quickly! I know it's been rushed and I understand the reasoning and the extensive use of AI.
That said, it's been a bit hard for me to verify all works as expected.
I am approving it, knowing that this is just for detection and won't change coprocessor state in any way. If we observe issues, we can fix them, but it won't affect coprocessor functionality.
Thanks @Eikix, good job!
P.S. Maybe the things that could affect the coprocessor are changes in gw_listener.rs, we could double check these, but they do look ok.
The field name now reflects what the detector is *doing* (replaying historical logs) rather than a side-effect toggle. Inverted semantics: `replaying: true` suppresses warns and DB queries, `false` is normal operation. Also removes redundant `if self.alerts_enabled` guard in try_check_local_consensus — already short-circuited by the early return at the top of the function.
- Add `type CiphertextDigest = FixedBytes<32>` to avoid literal 32 - Document implicit upper bound on open_handles - Remove redundant early-return guard in flush_metrics - fetch_expected_coprocessor_tx_senders now requires gateway_config_address parameter; caller decides whether to call based on config
Use domain-meaningful names: HandleOutcome describes what happened to a handle, not an internal mechanism. Variants renamed: - KeepWaiting → Pending - NoConsensusTimeout → GatewayNeverReachedConsensus - ConsensusUncheckedTimeout → LocalDigestNeverAppeared - MissingSubmissionPostGrace → NotAllCoprocessorsSubmitted
…prove clarity - Fix verify_proof metrics being incremented both inside the function and at the call site (bug: double-counting) - Remove unnecessary .clone() on owned event field - Replace unreachable Pending return with unreachable!() - Replace cross-function .unwrap() with let-else in evict_stale - Collapse two identical "local digests not ready" debug branches - Use < instead of != in classify_handle (invariant is always <) - Rename fetch_expected_coprocessor_tx_senders to fetch_expected_senders - Remove redundant replaying guard in finalize_completed_without_consensus
…tener All drift detection code paths now log-and-continue instead of propagating errors to run_get_logs. A transient DB or RPC failure during drift checking should not interrupt processing of VerifyProofRequest, ActivateKey, or ActivateCrs events.
Eikix
left a comment
There was a problem hiding this comment.
[AUTOMATED]
Code Review
Reviewed for bugs, guideline compliance, and error handling. Found 2 issues.
Summary
- Bug (low severity): Histogram metrics are observed eagerly, breaking the deferred-metric invariant stated in the main loop comment.
- Bug (low severity):
finalize_completed_without_consensususes==instead of>=, missing handles when unexpected senders submit.
No guideline violations found. Error handling for drift detection paths correctly uses log-and-continue.
Changed Lines CoverageCoverage of added/modified lines: 86.0% Per-file breakdownDiff CoverageDiff: origin/main...HEAD, staged and unstaged changes
Summary
coprocessor/fhevm-engine/gw-listener/src/bin/gw_listener.rscoprocessor/fhevm-engine/gw-listener/src/drift_detector.rscoprocessor/fhevm-engine/gw-listener/src/gw_listener.rs |
dartdart26
left a comment
There was a problem hiding this comment.
Just put one comment re grace periods, please let me know what you think of it.
…lock Duration The gateway chain only produces blocks when there are transactions, making block count an unreliable proxy for elapsed time. Switch drift detection from block-number arithmetic to std::time::Instant/Duration: - ConfigSettings and CLI args use Duration (humantime format, e.g. "5m") - EventContext carries observed_at: Instant - HandleState/ConsensusState carry first_seen_at/received_at: Instant - classify_handle compares wall-clock elapsed time against Duration thresholds - Block numbers retained for logging/metrics (still useful diagnostics) - Removes get_block_number RPC call from rebuild_drift_detector (no longer needed)
Resolve conflict in test-suite/fhevm/fhevm-cli by taking main's version (tab indentation, negative-acl test type) and re-adding our ciphertext-drift test type.
…fy Default is test-only
dartdart26
left a comment
There was a problem hiding this comment.
Added some nits, will let you decide if you want to address them or not. Otherwise, approved.
- Prefix DriftDetector fields with `drift_` for consistency with ConfigSettings - Move classify_handle from free function to &self method on DriftDetector
|
@Mergifyio queue |
Merge Queue Status
Required conditions to enter a queue
|
Merge Queue Status
This pull request spent 2 hours 49 minutes 13 seconds in the queue, including 1 hour 48 minutes 22 seconds running CI. Required conditions to merge
|
Summary
gw-listenerby pollingCiphertextCommitseventsGatewayConfigwhen drift detection is enabledCiphertextCommitsblock and rebuild open-handle state from chain logs after restartciphertext_digestrow, retrying later if local digests are not ready yetdrift_detected,consensus_timeout,missing_submissionScope Notes
CiphertextCommitsblockGatewayConfigupdates refresh the current expected sender set for new handles; existing open handles keep their original membership snapshotTest Plan
SQLX_OFFLINE=true cargo build -p gw-listener --manifest-path coprocessor/fhevm-engine/Cargo.tomlSQLX_OFFLINE=true cargo test -p gw-listener --manifest-path coprocessor/fhevm-engine/Cargo.toml./test-suite/fhevm/fhevm-cli test ciphertext-drifttest-suite-e2e-testswithdeploy-build=truegw-listenerwhile a handle is still open and verify tracking resumes from the persisted watermark/metricsexposesdrift_detected,consensus_timeout, andmissing_submissionE2E Drift Scenario
ciphertext_digestciphertextthe first time a row becomes fully readytransaction-senderthen submits the already-polluted digest, which should trigger the Rust drift detectorcloses https://github.com/zama-ai/fhevm-internal/issues/1147