Skip to content

[consensus/simplex] Infer parent certification from notarization#3613

Open
0xAysh wants to merge 2 commits into
commonwarexyz:mainfrom
0xAysh:consensus/infer-certification
Open

[consensus/simplex] Infer parent certification from notarization#3613
0xAysh wants to merge 2 commits into
commonwarexyz:mainfrom
0xAysh:consensus/infer-certification

Conversation

@0xAysh
Copy link
Copy Markdown
Contributor

@0xAysh 0xAysh commented Apr 16, 2026

Fixes #3433

Summary

When a notarization for view N arrives, the f+1 signers had to certify N's parent before they could vote. If this node has not certified that parent yet, it can stall and wait for automaton responses that are not needed. This PR infers that certification from the notarization and keeps progress moving. Also, a quick note: this PR was noisy earlier while I cleaned up the approach and re-opened it with only the final design and fixes.

Design decisions

  • Inference runs on first notarization broadcast (try_broadcast_notarization), not on receive.
  • Inference only certifies ancestors that are Ready and notarized, and stops at clear boundaries (<= last_finalized, missing round, non-Ready, missing notarization).
  • Sync is batched: handle_certification is append-only, and broadcast inference does sync_journal(view) then inferred appends then one sync_all().
  • A post-replay inference pass recovers inferred certifications that were not written before crash.
    Inferred certifications use normal signaling (resolver.certified(..., true) and Activity::Certification) and remove inferred views from certification_candidates().

Changes

  • round.rs: Added is_certify_ready() to check if a view can be inferred.
  • state.rs: Added infer_ancestors(view). It walks backward from the parent of view, certifies ancestor views that are Ready and notarized, removes those views from certification_candidates, and returns Vec<(View, Notarization)> so the caller can journal and signal them. It stops at <= last_finalized, missing round entry, non-Ready state, or missing notarization.
  • actor.rs:
    • handle_certification is now append-only. It no longer calls sync.
    • The certify-wait handler now calls sync_journal after handle_certification.
    • In try_broadcast_notarization, after sync_journal makes the notarization durable, it calls infer_ancestors, journals each inferred ancestor, runs one sync_all, then sends resolver.certified and reporter.report(Activity::Certification) for each inferred view.
    • Added one post-replay inference pass after journal replay. On startup the certify pool is empty, so views are Ready. This restores inferred certifications that were not written before a crash.

Crash safety

Sync order is sync_journal(notarization) -> append_journal(ancestors) -> sync_all(). This guarantees the notarization is durable before inferred ancestor certifications. If a crash happens between those sync points, the post-replay pass derives the missing certifications again.

Tests

  • Added 8 unit tests for infer_ancestors to cover all stop conditions and certification_candidates cleanup.
  • Added test_post_replay_inference_certifies_ancestor: first run uses Certifier::Pending (view 3 stays Outstanding), notarization for view 4 arrives (inference stops), then crash. Second run uses Certifier::Cancel, and certification must come from post-replay inference. The test verifies resolver.certified(view_3, true) is emitted. It runs on all 6 scheme variants.
  • Existing only_finalization_rescues_validator stays unchanged. Certifier::Cancel gives Outstanding, which is a valid stop condition for inference.

Comment thread consensus/src/simplex/actors/voter/actor.rs Outdated
@0xAysh 0xAysh marked this pull request as draft April 16, 2026 22:03
@0xAysh 0xAysh marked this pull request as ready for review April 16, 2026 22:11
@0xAysh 0xAysh changed the title [consensus/simplex] Infer Certification For Notarization [consensus/simplex] Infer parent certification from notarization Apr 16, 2026
@0xAysh
Copy link
Copy Markdown
Contributor Author

0xAysh commented Apr 16, 2026

handle_certification calls sync_journal on every call. In the inferred ancestor walk, this means one fsync per ancestor in the chain, O(n) fsyncs instead of one. In the normal automaton path this was never an issue since handle_certification was called once per notarization, but the loop changes that. Is there a way to batch the journal syncs across multiple views, or should the inferred path append all ancestor certification entries first and sync once at the end?

let artifact = Artifact::Notarization(notarization.clone());
let (added, equivocator) = self.state.add_notarization(notarization);
let mut inferred = Vec::new();
if added {
Copy link
Copy Markdown
Contributor Author

@0xAysh 0xAysh Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

infer_parent() only walks up one level

Comment thread consensus/src/simplex/actors/voter/actor.rs Outdated
Comment thread consensus/src/simplex/actors/voter/state.rs Outdated
Comment thread consensus/src/simplex/actors/voter/actor.rs Outdated
@0xAysh 0xAysh marked this pull request as draft April 17, 2026 02:15
@0xAysh 0xAysh marked this pull request as ready for review April 18, 2026 01:55
@0xAysh 0xAysh marked this pull request as draft April 18, 2026 02:03
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit cf35574. Configure here.

Comment thread consensus/src/simplex/actors/voter/actor.rs Outdated
@0xAysh 0xAysh force-pushed the consensus/infer-certification branch from 45fb11b to a07ee17 Compare April 18, 2026 07:06
@0xAysh 0xAysh closed this Apr 20, 2026
@0xAysh 0xAysh force-pushed the consensus/infer-certification branch from 6db4899 to c6a805b Compare April 20, 2026 07:47
When a notarization for view N arrives, the f+1 signers must have
certified N's parent before they could vote, so any uncertified ancestors
with a notarization can be certified locally without dispatching to the
automaton.

- Add `is_certify_ready` to `Round` as the inference eligibility predicate
- Add `infer_ancestors` to `State`: walks the ancestor chain, certifies
  Ready ancestors with notarizations, removes them from
  `certification_candidates`, returns list for journaling and signaling
- Add `notarized_views_descending` to `State` for the post-replay pass
- Refactor `handle_certification` to append-only; callers own syncing
- Add inference walk in `try_broadcast_notarization`: after notarization
  is durable, infer ancestors, journal all, single `sync_all`, then signal
- Add post-replay inference pass after journal replay to recover
  certifications that were not written before a crash
- 8 unit tests for `infer_ancestors` covering all stop conditions
- Integration test `test_post_replay_inference_certifies_ancestor`
  verifying crash-recovery path across all 6 scheme variants
@0xAysh 0xAysh reopened this Apr 20, 2026
@0xAysh
Copy link
Copy Markdown
Contributor Author

0xAysh commented Apr 20, 2026

this PR is ready for review. Bugbot did not run after my latest push. I checked the PR checks and do not see a Bugbot run. Could anyone please take a look, or tell me if I should retrigger it from my side?

@0xAysh 0xAysh marked this pull request as ready for review April 20, 2026 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[consensus/simplex] Infer Certification

1 participant