-
Notifications
You must be signed in to change notification settings - Fork 34
fix(node): gracefully clean up iota-node validator components #6831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 4 Skipped Deployments
|
if validator_server_cancel_handle.send(()).is_ok() { | ||
debug!("Validator grpc server cancelled"); | ||
} else { | ||
warn!("Failed to cancel validator grpc server"); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch on the grpc server cancellation!! 🤯
3d14434
to
f7d54b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
) -> Result<( | ||
tokio::task::JoinHandle<Result<()>>, | ||
tokio::sync::oneshot::Sender<()>, | ||
)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to make the return type a struct ? so both return values can be named ? (just for clarity & knowing what they are, wouldn't change functionality)
.bind(config.network_address()) | ||
.await | ||
.map_err(|err| anyhow!(err.to_string()))?; | ||
let cancel_handle = server.take_cancel_handle().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose unwrapping is okay here since this only gets run once at node startup ? Also should we use expect instead of unwrap to add error context ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, cancel_handle seems to always contain a handle (at least until using take() the first time):
https://vscode.dev/github/iotaledger/iota/blob/node/fix-6277-revalidator-crash/crates/iota-network-stack/src/server.rs#L210
Would it then make sense to implement cancel() on the server itself ? e.g. grpc_server.cancel() or grpc_server.shutdown() ? then the server can handle the logic of take()-ing its own cancel handle ,and just emit a warning if it's already received a cancel order previously ?
Again this wouldn't really change the logic, it's just for clarity, so it's also fine with me if we don't want to spend time on changing this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Description of change
After a validator node leaves the committee, a few resources in validator components are leaked including metrics and grpc server. If the node will try to re-join the committee it will crash due to resources already being in use.
This PR:
Notes
There are a few ways to handle metrics issue:
AlreadyReg
error while trying to re-register metrics with the default registry; requires special unwrap handling for each metric;Links to any relevant issues
Fixes #6277.
Type of change
How the change has been tested
scripts/simtest/cargo-simtest simtest --package iota-e2e-tests --test "reconfiguration_tests" --profile ci -- test_reconfig_with_same_validator
Change checklist
Release Notes