Skip to content

run restore on all stf periodically#145

Open
sapinb wants to merge 1 commit intomainfrom
feat/periodic-stf-restore
Open

run restore on all stf periodically#145
sapinb wants to merge 1 commit intomainfrom
feat/periodic-stf-restore

Conversation

@sapinb
Copy link
Copy Markdown
Collaborator

@sapinb sapinb commented Apr 2, 2026

Description

Run stf::restore() periodically on all state machines based on configured time interval.

This is a catch-all to resolve any lost actions that prevent the stf from continuing execution.
Equivalent to periodically restarting the mosaic node.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature/Enhancement (non-breaking change which adds functionality or enhances an existing one)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactor
  • New or updated tests
  • Dependency update
  • Security fix

Notes to Reviewers

Job scheduler does not currently dedupe actions. With periodic restores, this can potentially cause duplicate job runs, although it will not impact the correctness of the process, just some inefficiency.

Checklist

  • I have performed a self-review of my code.
  • I have commented my code where necessary.
  • I have updated the documentation if needed.
  • My changes do not introduce new warnings.
  • I have added tests that prove my changes are effective or that my feature works.
  • New and existing tests pass with my changes.

Related Issues

@sapinb sapinb force-pushed the feat/periodic-stf-restore branch from 7c90d38 to 80ac2b9 Compare April 2, 2026 07:47
@AaronFeickert AaronFeickert force-pushed the feat/periodic-stf-restore branch from 80ac2b9 to c81a8af Compare April 2, 2026 18:44
@AaronFeickert
Copy link
Copy Markdown
Collaborator

What's the worst-case scenario for the efficiency hit due to job duplication? If this ends up meaningfully affecting runtime, we should understand the extent.

This is a catch-all to resolve any lost actions that prevent the stf from continuing
execution. Equivalent to periodically restarting the mosaic node.
@AaronFeickert AaronFeickert force-pushed the feat/periodic-stf-restore branch from c81a8af to 2b77426 Compare April 3, 2026 16:40
Copy link
Copy Markdown
Collaborator

@Zk2u Zk2u left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to do this, we should clear the job queue then restore into it. This needs to ensure that the clearing operation is done properly, doesn't have race conditions or violate any of the other guarantees the job scheduler makes to the other components

@Zk2u
Copy link
Copy Markdown
Collaborator

Zk2u commented Apr 9, 2026

Blocking issue in crates/sm-executor/src/lib.rs inside run_inner.

This PR still rebuilds self.job_handle.recv(), self.net_client.recv(), and self.command_rx.recv() directly inside monoio::select! on every loop iteration. That is the cancel-unsafe pattern we just fixed in #167: if another branch wins after a direct handoff has matched a waiting kanal receiver, the dropped receive future can drop the delivered message.

So as written, this PR reintroduces the same completion-loss class we traced for the GeneratingPolynomialCommitments wedge. Before this can merge, it needs to be rebased onto current main and the periodic restore tick integrated with the persistent pinned receive futures pattern from #167, rather than the old per-iteration recv() futures.

This is on the assumption this is actually a good idea, which i'm not sure it is and we should discuss on slack first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants