feat(network): define NetworkTransport trait + TcpTransport (7/n)#15575
feat(network): define NetworkTransport trait + TcpTransport (7/n)#15575shreyan-gupta merged 3 commits intomasterfrom
Conversation
7c1eaef to
f56d8ce
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #15575 +/- ##
==========================================
- Coverage 69.48% 69.46% -0.03%
==========================================
Files 940 942 +2
Lines 214797 214963 +166
Branches 214797 214963 +166
==========================================
+ Hits 149262 149322 +60
- Misses 59610 59720 +110
+ Partials 5925 5921 -4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
067d970 to
5936035
Compare
52a0563 to
1a25eb9
Compare
5936035 to
6aba775
Compare
83791ae to
c138ce8
Compare
6aba775 to
2cca9f3
Compare
5ca1a60 to
a622b9d
Compare
2cca9f3 to
3012d05
Compare
a622b9d to
86d7eb3
Compare
3012d05 to
db586cd
Compare
24d7e74 to
f58c995
Compare
db586cd to
d4bcea4
Compare
f58c995 to
be1d301
Compare
d4bcea4 to
03cb9f6
Compare
942b085 to
40ce58d
Compare
9910399 to
745bd65
Compare
40ce58d to
a5c7cb8
Compare
745bd65 to
eab51e8
Compare
a83854c to
b774ffb
Compare
83cb624 to
5944472
Compare
b774ffb to
4a69807
Compare
4a69807 to
9869ba3
Compare
| } | ||
| } | ||
|
|
||
| fn broadcast_message(&self, tier: tcp::Tier, msg: Arc<PeerMessage>) { |
There was a problem hiding this comment.
given that only T2 broadcast is supported I suggest we just remove tier argument
| clock: &time::Clock, | ||
| edge: Edge, | ||
| info: PeerConnectionInfo, | ||
| transport: Arc<dyn NetworkTransport>, |
There was a problem hiding this comment.
does it make sense to introduce transport as a field in NetworkState instead of passing to its functions? seems like it feels here better than in PeerActor
There was a problem hiding this comment.
I had tried that but I think it's not possible due to circular dependencies and ownership issues. I can take a deeper look.
There was a problem hiding this comment.
The transport that NetworkState methods receive comes from PeerManagerActor. PMA constructs it via TcpTransport::new(state.clone()) — so NetworkState is built first, then TcpTransport wraps an Arc of it (it needs Arc<NetworkState> to spawn PeerActors). PMA holds both and passes transport into the methods that need it.
Storing transport on NetworkState would close a ref cycle (NetworkState → transport → state) and leak both. We could break the cycle with Weak<dyn NetworkTransport> on NetworkState (via Arc::new_cyclic), but every call would then need self.transport.upgrade().expect(...).<method>() — extra boilerplate for an upgrade that can only fail during teardown.
I'd leave transport as a parameter, but happy to revisit if you have a structuring in mind that I'm not seeing.
🤖 Response from Claude Code on behalf of Shreyan
There was a problem hiding this comment.
that makes sense, but in this case I find it weird that NetworkState both need NetworkTransport instance and also itself implements it.
There was a problem hiding this comment.
Let me get back to this problem after all the refactoring. I did have some ideas of introducing an interface for the upward communication from transport layer to the network graph layer (network state) and this can fall as a part of that.
But that said, better to unblock as is for now.
9869ba3 to
27c372d
Compare
There was a problem hiding this comment.
Pull request overview
Introduces a transport abstraction layer in chain/network by defining a dyn-safe NetworkTransport trait and wiring message delivery paths to use it, with an initial production implementation (TcpTransport) wrapping the existing pool-based networking.
Changes:
- Add
NetworkTransporttrait andConnectHandlefuture type. - Implement
TcpTransportas the production transport using existingNetworkStatetier pools. - Thread
Arc<dyn NetworkTransport>throughPeerManagerActor,PeerActor, andNetworkStatemessage/broadcast APIs; update affected tests/helpers.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| chain/network/src/peer_manager/tests/tier1.rs | Update tier1 tests to pass a NetworkTransport when sending messages. |
| chain/network/src/peer_manager/testonly.rs | Update test-only ping helper to pass a NetworkTransport. |
| chain/network/src/peer_manager/tcp_transport.rs | Add production TcpTransport implementing NetworkTransport. |
| chain/network/src/peer_manager/peer_manager_actor.rs | Store Arc<dyn NetworkTransport> in PMA and use it for shutdown + routing sends. |
| chain/network/src/peer_manager/network_transport.rs | Define NetworkTransport, ConnectHandle, and ConnectError. |
| chain/network/src/peer_manager/network_state/routing.rs | Route routing-table broadcasts and account announcements via NetworkTransport. |
| chain/network/src/peer_manager/network_state/mod.rs | Update message-send APIs to take &dyn NetworkTransport / Arc<dyn ...>; pass transport through connect/disconnect flows. |
| chain/network/src/peer_manager/mod.rs | Export new network_transport and tcp_transport modules. |
| chain/network/src/peer/testonly.rs | Update peer test harness to provide a NetworkTransport to PeerActor::spawn. |
| chain/network/src/peer/peer_actor.rs | Add transport field to PeerActor and thread transport through edge/routing/account flows. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let new_edge = | ||
| edge.remove_edge(this.config.node_id(), &this.config.node_key); | ||
| this.add_edges(&clock, EdgesWithSource::Local(vec![new_edge.clone()])) | ||
| .await | ||
| .unwrap() | ||
| let transport = TcpTransport::new(this.clone()); | ||
| this.add_edges( | ||
| &clock, | ||
| EdgesWithSource::Local(vec![new_edge.clone()]), | ||
| transport, | ||
| ) |
There was a problem hiding this comment.
fix_local_edges creates a new TcpTransport inside NetworkState to satisfy the add_edges(..., transport) API. This couples NetworkState to the production transport and will break the goal of having NetworkState be transport-agnostic (e.g. for a future TestLoop transport). Consider changing fix_local_edges to accept/capture an Arc<dyn NetworkTransport> from the caller (PeerManagerActor) instead of constructing TcpTransport here, and remove the TcpTransport dependency from network_state.
| let transport: Arc<dyn NetworkTransport> = TcpTransport::new(state.clone()); | ||
| async move { |
There was a problem hiding this comment.
A TcpTransport instance is created for the listener loop here, but another TcpTransport::new(state.clone()) is created later when constructing PeerManagerActor. Since the PR description notes that Pools will move into TcpTransport in a follow-up, having multiple transport instances risks splitting/duplicating transport-owned state. Prefer creating a single Arc<dyn NetworkTransport> once and cloning it into both the listener loop and the actor struct.
| let transport = TcpTransport::new(network_state.clone()); | ||
| let (addr, handshake_signal) = | ||
| Self::spawn(clock, actor_system, stream, network_state, transport)?; |
There was a problem hiding this comment.
spawn_and_handshake constructs a fresh TcpTransport internally, which hard-codes the production transport and also increases the number of TcpTransport instances created per connection attempt. If TcpTransport becomes stateful (as planned when Pools move into it), this pattern can cause subtle divergence. Consider taking Arc<dyn NetworkTransport> as a parameter (or otherwise sourcing it from the caller) so this helper can work with non-TCP transports and reuse the shared transport instance.
| /// Transitional: wraps NetworkState's Pools. In PR 9, Pools move to | ||
| /// TcpTransport directly. | ||
| pub(crate) struct TcpTransport { | ||
| pub state: Arc<NetworkState>, |
There was a problem hiding this comment.
TcpTransport exposes its state field as pub, which makes it easy for other modules to bypass the NetworkTransport abstraction and reach directly into NetworkState/Pools. Since this type is meant to encapsulate transport behavior, consider making the field private (and exposing only the trait methods), to keep the abstraction boundary intact.
| pub state: Arc<NetworkState>, | |
| state: Arc<NetworkState>, |
…hread transport Define the 6-method NetworkTransport trait (send_message, broadcast_message, connect_to_peer, disconnect_peer, shutdown, get_connection_info) and create TcpTransport as a transitional wrapper delegating to NetworkState's Pools. Thread transport through all NetworkState methods that send or broadcast. PMA holds Arc<dyn NetworkTransport>, PeerActor holds Arc<dyn NetworkTransport>. stop_actor uses transport.broadcast_message(T2, Disconnect) + transport.shutdown(). Methods updated: send_message_to_peer, send_message_to_account, send_pong, send_ping, broadcast_routing_table_update, add_edges, add_accounts, finalize_edge, on_peer_connected, on_peer_disconnected, register, unregister, fix_local_edges. All delivery now goes through the trait. Pools remain on NetworkState (move to TcpTransport in PR 9). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- broadcast_message: replace the all-three-tier match with explicit unreachable!() for T1/T3. The previous debug_assert + match masked the fact that broadcast is only valid on T2; the unreachable arms document and enforce the trait contract directly. - disconnect_peer: drop the early `return` after the first matching tier. A peer can be connected on multiple tiers simultaneously (e.g. T1 + T2 for a validator); stop the connection on every tier we find it on, otherwise a partial disconnect leaves a stale entry. Addresses VanBarbascu's review on #15575.
Only T2 has broadcast semantics — T1 (direct validator routing) and T3 (direct state sync) do not. The tier argument was effectively a constant at every call site and the unreachable! arms documented a capability we do not have. Drop the parameter from the trait, simplify TcpTransport's impl, and update both call sites (peer_manager_actor stop_actor and network_state::broadcast_routing_table_update).
27c372d to
864d458
Compare
## Background
\`pytest sanity/rpc_view_history.py\` has been failing intermittently in
the merge queue. Initial diagnosis (see PR description history)
suggested a master-state regression, but a closer look at the test0 logs
from a failing run showed the test was racing on doomslug's first-block
timing.
## What's actually wrong
The test queries account state at \`block=1\` to verify historical RPC
works. In a 3-validator dev cluster with
\`min_block_production_delay=1s\`, doomslug can skip the very first
height — endorsements for block 1 don't arrive in time, the cluster goes
straight from height 0 to height 2, and \`BlockHeight\` column has no
entry for height 1.
Sequence from the failing test0 log:
\`\`\`
19:49:20.865 producing optimistic block validator=test0 height=1
prev_height=0
19:49:21.745 collect_block_approval target_height=1 (test1 endorsement)
19:49:22.888 considering blocks for production start_height=1
end_height=2
19:49:22.889 doomslug: not ready to produce block target_height=1
19:49:22.892 produce_block{next_height=2} ← jumped to 2
\`\`\`
When the test then queries \`block=1\`, \`get_block_hash_by_height(1)\`
returns \`DBNotFoundErr\` → \`UnknownBlock\` → \`handle_unknown_block\`
returns HTTP 422.
This is a flake in the test, not a regression. Same flake hit:
- [run 4557](https://nayduck.nearone.org/#/run/4557) (PR near#15478, master
\`6c6c99a\`) — non-nightly variant
- [run 4561](https://nayduck.nearone.org/#/run/4561), [run
4562](https://nayduck.nearone.org/#/run/4562) (PR near#15575, master
\`a9a453b\`) — nightly variant
A successful retry of PR near#15478 ([run
4560](https://nayduck.nearone.org/#/run/4560)) passed the same test on
the same SHA \`a9a453b\`, confirming the flake.
## Fix
Capture the block height after \`wait_for_blocks(target=3)\` via
\`get_latest_block().height\` — that block is already on the canonical
chain, so it's safe to query. Replace the hardcoded \`block=1\` with
this captured height.
## Verification
[NayDuck run near#4565](https://nayduck.nearone.org/#/run/4565) — 15× each
of the stable and nightly variants (30 total) on this branch.
NetworkTransporttrait — 5-method dyn-safe trait:send_message(tier, peer_id, msg),broadcast_message(tier, msg)connect_to_peer(clock, peer_info, tier) -> ConnectHandle(async, default no-op)disconnect_peer(peer_id, ban_reason)(default no-op),shutdown()(default no-op)transport_info()is added in PR 9TcpTransport— production impl wrapping existing Pool-based codeConnectHandle— implementsFuture<Output = Result<(), ConnectError>>Arc<dyn NetworkTransport>alongsideArc<NetworkState>