[consensus/simplex] Infer parent certification from notarization#3613
[consensus/simplex] Infer parent certification from notarization#36130xAysh wants to merge 2 commits into
Conversation
|
handle_certification calls sync_journal on every call. In the inferred ancestor walk, this means one fsync per ancestor in the chain, O(n) fsyncs instead of one. In the normal automaton path this was never an issue since handle_certification was called once per notarization, but the loop changes that. Is there a way to batch the journal syncs across multiple views, or should the inferred path append all ancestor certification entries first and sync once at the end? |
| let artifact = Artifact::Notarization(notarization.clone()); | ||
| let (added, equivocator) = self.state.add_notarization(notarization); | ||
| let mut inferred = Vec::new(); | ||
| if added { |
There was a problem hiding this comment.
infer_parent() only walks up one level
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit cf35574. Configure here.
45fb11b to
a07ee17
Compare
6db4899 to
c6a805b
Compare
When a notarization for view N arrives, the f+1 signers must have certified N's parent before they could vote, so any uncertified ancestors with a notarization can be certified locally without dispatching to the automaton. - Add `is_certify_ready` to `Round` as the inference eligibility predicate - Add `infer_ancestors` to `State`: walks the ancestor chain, certifies Ready ancestors with notarizations, removes them from `certification_candidates`, returns list for journaling and signaling - Add `notarized_views_descending` to `State` for the post-replay pass - Refactor `handle_certification` to append-only; callers own syncing - Add inference walk in `try_broadcast_notarization`: after notarization is durable, infer ancestors, journal all, single `sync_all`, then signal - Add post-replay inference pass after journal replay to recover certifications that were not written before a crash - 8 unit tests for `infer_ancestors` covering all stop conditions - Integration test `test_post_replay_inference_certifies_ancestor` verifying crash-recovery path across all 6 scheme variants
|
this PR is ready for review. Bugbot did not run after my latest push. I checked the PR checks and do not see a Bugbot run. Could anyone please take a look, or tell me if I should retrigger it from my side? |

Fixes #3433
Summary
When a notarization for view N arrives, the f+1 signers had to certify N's parent before they could vote. If this node has not certified that parent yet, it can stall and wait for automaton responses that are not needed. This PR infers that certification from the notarization and keeps progress moving. Also, a quick note: this PR was noisy earlier while I cleaned up the approach and re-opened it with only the final design and fixes.
Design decisions
try_broadcast_notarization), not on receive.handle_certificationis append-only, and broadcast inference doessync_journal(view)then inferred appends then onesync_all().Inferred certifications use normal signaling (
resolver.certified(..., true)andActivity::Certification) and remove inferred views fromcertification_candidates().Changes
round.rs: Addedis_certify_ready()to check if a view can be inferred.state.rs: Addedinfer_ancestors(view). It walks backward from the parent ofview, certifies ancestor views that areReadyand notarized, removes those views fromcertification_candidates, and returnsVec<(View, Notarization)>so the caller can journal and signal them. It stops at<= last_finalized, missing round entry, non-Readystate, or missing notarization.actor.rs:handle_certificationis now append-only. It no longer calls sync.sync_journalafterhandle_certification.try_broadcast_notarization, aftersync_journalmakes the notarization durable, it callsinfer_ancestors, journals each inferred ancestor, runs onesync_all, then sendsresolver.certifiedandreporter.report(Activity::Certification)for each inferred view.Ready. This restores inferred certifications that were not written before a crash.Crash safety
Sync order is
sync_journal(notarization)->append_journal(ancestors)->sync_all(). This guarantees the notarization is durable before inferred ancestor certifications. If a crash happens between those sync points, the post-replay pass derives the missing certifications again.Tests
infer_ancestorsto cover all stop conditions andcertification_candidatescleanup.test_post_replay_inference_certifies_ancestor: first run usesCertifier::Pending(view 3 staysOutstanding), notarization for view 4 arrives (inference stops), then crash. Second run usesCertifier::Cancel, and certification must come from post-replay inference. The test verifiesresolver.certified(view_3, true)is emitted. It runs on all 6 scheme variants.only_finalization_rescues_validatorstays unchanged.Certifier::CancelgivesOutstanding, which is a valid stop condition for inference.