Skip to content

Conversation

@gzalz
Copy link
Contributor

@gzalz gzalz commented Nov 10, 2025

Problem: Internal and external operators have seen transient errors crashing tip router operator. These errors are generally related to a timed out RPC request. Upon examining loop_stages I have found errors that perhaps are not handled in the way we would like.

Solution:

  • Wait for epoch info and schedule rpc requests to come back, log failures, this state is required to do anything useful with the operator. A failed request should be handled gracefully, this action is periodic.
  • We should not handle submit_to_ncn in CastVote with ?, log an error here, this gives the operator a chance to vote again and recover from any potential RPC issues that were responsible for the failure.

OperatorState::CastVote => {
let meta_merkle_tree_path =
meta_merkle_tree_path(epoch_to_process, &cli.get_save_path());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the big one. Any panic that killed the main process on an external validator was coming from here. The tokio task that submits to NCN from main.rs is properly handled. This is not.

mrmizz
mrmizz previously approved these changes Nov 10, 2025
Copy link
Contributor

@mrmizz mrmizz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heat

@gzalz gzalz merged commit cfee297 into master Nov 10, 2025
6 checks passed
@gzalz gzalz deleted the fix/stability branch November 10, 2025 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants