run restore on all stf periodically by sapinb · Pull Request #145 · alpenlabs/mosaic

sapinb · 2026-04-02T06:08:16Z

Description

Run stf::restore() periodically on all state machines based on configured time interval.

This is a catch-all to resolve any lost actions that prevent the stf from continuing execution.
Equivalent to periodically restarting the mosaic node.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature/Enhancement (non-breaking change which adds functionality or enhances an existing one)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactor
New or updated tests
Dependency update
Security fix

Notes to Reviewers

Job scheduler does not currently dedupe actions. With periodic restores, this can potentially cause duplicate job runs, although it will not impact the correctness of the process, just some inefficiency.

Checklist

I have performed a self-review of my code.
I have commented my code where necessary.
I have updated the documentation if needed.
My changes do not introduce new warnings.
I have added tests that prove my changes are effective or that my feature works.
New and existing tests pass with my changes.

Related Issues

AaronFeickert · 2026-04-02T18:47:01Z

What's the worst-case scenario for the efficiency hit due to job duplication? If this ends up meaningfully affecting runtime, we should understand the extent.

This is a catch-all to resolve any lost actions that prevent the stf from continuing execution. Equivalent to periodically restarting the mosaic node.

Zk2u

If we are going to do this, we should clear the job queue then restore into it. This needs to ensure that the clearing operation is done properly, doesn't have race conditions or violate any of the other guarantees the job scheduler makes to the other components

Zk2u · 2026-04-09T12:38:34Z

Blocking issue in crates/sm-executor/src/lib.rs inside run_inner.

This PR still rebuilds self.job_handle.recv(), self.net_client.recv(), and self.command_rx.recv() directly inside monoio::select! on every loop iteration. That is the cancel-unsafe pattern we just fixed in #167: if another branch wins after a direct handoff has matched a waiting kanal receiver, the dropped receive future can drop the delivered message.

So as written, this PR reintroduces the same completion-loss class we traced for the GeneratingPolynomialCommitments wedge. Before this can merge, it needs to be rebased onto current main and the periodic restore tick integrated with the persistent pinned receive futures pattern from #167, rather than the old per-iteration recv() futures.

This is on the assumption this is actually a good idea, which i'm not sure it is and we should discuss on slack first

sapinb requested review from AaronFeickert, Zk2u and manishbista28 April 2, 2026 06:08

sapinb force-pushed the feat/periodic-stf-restore branch from 7c90d38 to 80ac2b9 Compare April 2, 2026 07:47

AaronFeickert force-pushed the feat/periodic-stf-restore branch from 80ac2b9 to c81a8af Compare April 2, 2026 18:44

AaronFeickert approved these changes Apr 3, 2026

View reviewed changes

feat: add run restore on all stf periodically

2b77426

This is a catch-all to resolve any lost actions that prevent the stf from continuing execution. Equivalent to periodically restarting the mosaic node.

AaronFeickert force-pushed the feat/periodic-stf-restore branch from c81a8af to 2b77426 Compare April 3, 2026 16:40

Zk2u requested changes Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run restore on all stf periodically#145

run restore on all stf periodically#145
sapinb wants to merge 1 commit intomainfrom
feat/periodic-stf-restore

sapinb commented Apr 2, 2026 •

edited

Loading

Uh oh!

AaronFeickert commented Apr 2, 2026

Uh oh!

Zk2u left a comment

Uh oh!

Zk2u commented Apr 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sapinb commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Notes to Reviewers

Checklist

Related Issues

Uh oh!

AaronFeickert commented Apr 2, 2026

Uh oh!

Zk2u left a comment

Choose a reason for hiding this comment

Uh oh!

Zk2u commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sapinb commented Apr 2, 2026 •

edited

Loading

Zk2u commented Apr 9, 2026 •

edited

Loading