Skip to content

fix(node): gracefully clean up iota-node validator components #6831

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

semenov-vladyslav
Copy link
Contributor

Description of change

After a validator node leaves the committee, a few resources in validator components are leaked including metrics and grpc server. If the node will try to re-join the committee it will crash due to resources already being in use.

This PR:

  • changes the way validator metrics are registered: a dedicated registry is used instead of default one;
  • adds a graceful shutdown of validator grpc server and metrics registry clean up during reconfiguration when a validator leaves the committee;
  • adds a reproducer test.

Notes

There are a few ways to handle metrics issue:

  • ignore AlreadyReg error while trying to re-register metrics with the default registry; requires special unwrap handling for each metric;
  • unregister individual metrics with the default registry; the existing components do not provide such functionality; requires unregistering all individual metrics;
  • use a dedicated registry and remove it entirely from the registry service once node is not validator anymore; this seems to be the easiest and cleanest way.

⚠️ There may be other undetected resources leaking that do not cause error if validator components are re-created.

Links to any relevant issues

Fixes #6277.

Type of change

  • Bug fix (a non-breaking change which fixes an issue)

How the change has been tested

scripts/simtest/cargo-simtest simtest --package iota-e2e-tests --test "reconfiguration_tests" --profile ci -- test_reconfig_with_same_validator

Change checklist

  • I have followed the contribution guidelines for this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked that new and existing unit tests pass locally with my changes

Release Notes

  • Protocol:
  • Nodes (Validators and Full nodes): Fixed a bug causing iota-node to crash after re-joining the consensus committee as validator.
  • Indexer:
  • JSON-RPC:
  • GraphQL:
  • CLI:
  • Rust SDK:
  • REST API:

@semenov-vladyslav semenov-vladyslav requested review from a team as code owners May 9, 2025 09:34
Copy link

vercel bot commented May 9, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

4 Skipped Deployments
Name Status Preview Comments Updated (UTC)
apps-backend ⬜️ Ignored (Inspect) Visit Preview May 22, 2025 8:32am
apps-ui-kit ⬜️ Ignored (Inspect) Visit Preview May 22, 2025 8:32am
rebased-explorer ⬜️ Ignored (Inspect) Visit Preview May 22, 2025 8:32am
wallet-dashboard ⬜️ Ignored (Inspect) Visit Preview May 22, 2025 8:32am

@iota-ci iota-ci added core-protocol node Issues related to the Core Node team labels May 9, 2025
Comment on lines +1764 to 1769
if validator_server_cancel_handle.send(()).is_ok() {
debug!("Validator grpc server cancelled");
} else {
warn!("Failed to cancel validator grpc server");
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on the grpc server cancellation!! 🤯

@alexsporn alexsporn added this to the v1.2.x - protocol v8 milestone May 19, 2025
@muXxer muXxer force-pushed the node/fix-6277-revalidator-crash branch from 3d14434 to f7d54b1 Compare May 22, 2025 08:28
Copy link
Contributor

@piotrm50 piotrm50 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

) -> Result<(
tokio::task::JoinHandle<Result<()>>,
tokio::sync::oneshot::Sender<()>,
)> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to make the return type a struct ? so both return values can be named ? (just for clarity & knowing what they are, wouldn't change functionality)

.bind(config.network_address())
.await
.map_err(|err| anyhow!(err.to_string()))?;
let cancel_handle = server.take_cancel_handle().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose unwrapping is okay here since this only gets run once at node startup ? Also should we use expect instead of unwrap to add error context ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, cancel_handle seems to always contain a handle (at least until using take() the first time):
https://vscode.dev/github/iotaledger/iota/blob/node/fix-6277-revalidator-crash/crates/iota-network-stack/src/server.rs#L210

Would it then make sense to implement cancel() on the server itself ? e.g. grpc_server.cancel() or grpc_server.shutdown() ? then the server can handle the logic of take()-ing its own cancel handle ,and just emit a warning if it's already received a cancel order previously ?

Again this wouldn't really change the logic, it's just for clarity, so it's also fine with me if we don't want to spend time on changing this

Copy link
Contributor

@bingyanglin bingyanglin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-protocol node Issues related to the Core Node team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validator node crashes when it was a validator node before, and becomes a validator/committee member again
9 participants