fix: seed peer connections kept open#7687
Conversation
Summary of ChangesHello @martinserts, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses several issues related to peer connectivity management, particularly concerning seed nodes. It introduces a new feature to automatically disconnect long-lived seed peer connections to improve network health and resource utilization. Additionally, it rectifies bugs in idle peer identification and seed peer database queries, ensuring more accurate peer management. The changes also include updates to configuration and persistence mechanisms to support these improvements. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces several important fixes and a new feature for managing seed peer connections. It corrects the logic for identifying idle peers, fixes an issue where the wrong cleanup interval was used by the ConnectivityManager, and resolves a bug in seed peer detection that failed with combined flags. A new feature is added to forcibly disconnect seed peers after a configurable grace period (max_seed_peer_age), provided minimum connectivity is maintained. Additionally, peer flags and features are now persisted to the database. The changes are well-implemented and include a new test for the seed peer disconnection logic. I have one suggestion to improve code readability, which is a minor issue.
There was a problem hiding this comment.
Looks good. The seed peer disconnection logic is solid, the min_connectivity guard prevents over-disconnection and the proactive dialer re-includes seeds when connectivity drops below threshold. Nice test coverage for the happy path too.
I've got three clarification questions inline. None of them are blockers, just want to make sure I'm reading the intent right.
| user_agent: self.user_agent.clone(), | ||
| }) | ||
| .with_connection_pool_refresh_interval(config.dht.connectivity.random_pool_refresh_interval) | ||
| .with_connection_pool_refresh_interval(config.dht.connectivity.update_interval) |
There was a problem hiding this comment.
This swaps random_pool_refresh_interval (default 2h) for update_interval (default 2min). That's a 60x increase in how often the connection pool refresh runs, which now includes seed disconnection, inactive reaping, and proactive dialing.
Was this intentional? If the goal was to make seed cleanup happen more often, it might be worth a dedicated timer for that instead of cranking the whole pool maintenance cycle. If it is intentional, probably worth a note in the changelog since it'll change behavior for anyone running default configs.
There was a problem hiding this comment.
This was intentional.
Binding random_pool_refresh_interval to 2h I think is not correct.
Internal default is 60 seconds!
See comms/core/src/connectivity/config.rs
connection_pool_refresh_interval: Duration::from_secs(60)If node lost connection, it would wait 2 hours before proactive dialling kicks in. This restores the responsiveness (as it was initially intended I guess).
I am looking for changelog, but it seems changelogs are auto-generated here.
| pub fn get_inactive_outbound_connections_mut(&mut self, min_age: Duration) -> Vec<&mut PeerConnection> { | ||
| self.filter_connections_mut(|conn| { | ||
| conn.age() > min_age && conn.handle_count() <= 1 && conn.substream_count() > 2 | ||
| conn.age() > min_age && conn.handle_count() <= 1 && conn.substream_count() < 3 |
There was a problem hiding this comment.
This flips the condition from substream_count() > 2 to substream_count() < 3. These aren't equivalent, the old version matched connections with 3+ substreams, the new one matches connections with 0-2 substreams.
I'm assuming this is a bug fix? The function name says "inactive" connections, and fewer substreams = more inactive makes sense. Just want to confirm the old logic was wrong.
There was a problem hiding this comment.
Yes, a bug fix.
This feature was from this PR - #4607
And it says:
only reap connections that have less than 3 substreams
So it was a bug.
| .inner_join(multi_addresses::table.on(multi_addresses::peer_id.eq(peers::peer_id))) | ||
| .filter(peers::flags.eq(PeerFlags::SEED.to_i32())) | ||
| .filter(diesel::dsl::sql::<diesel::sql_types::Bool>(&format!( | ||
| "flags & {} != 0", |
There was a problem hiding this comment.
Good fix, using bitwise AND instead of exact equality means peers with SEED | OTHER_FLAG won't get missed. Since PeerFlags::SEED.to_i32() is a compile-time constant (0x01), the format string is safe here.
Minor thought: if Diesel ever adds a bitwise AND expression, it'd be nice to move off raw SQL. Not a big deal for now though.
Description
Bugs fixed:
substream_count() > 2->substream_count() < 3)ConnectivityManagerwas usingrandom_pool_refresh_interval(default 2 hrs) instead of intendedupdate_interval(default 2 minutes) to fire cleanup logicget_seed_peerschecked exact equality (flags == SEED). If a peer had multiple flags (SEED | COMPACT), it would be ignored.New features / changes:
> min_connectivity)max_seed_peer_age(default 15 minutes)update_peer_sqlwill persistflagsandfeaturesas wellCloses #7558
Motivation and Context
It was noticed, that seed connections where kept open for a long time.
How Has This Been Tested?
Added a test.
And tested manually:
when 15 minutes elapsed
What process can a PR reviewer use to test or verify this change?
Breaking Changes