Skip to content

[Perf] Dedicated sync streams#4232

Draft
ljedrz wants to merge 3 commits into
stagingfrom
perf/dedicated_sync_stream2
Draft

[Perf] Dedicated sync streams#4232
ljedrz wants to merge 3 commits into
stagingfrom
perf/dedicated_sync_stream2

Conversation

@ljedrz

@ljedrz ljedrz commented Apr 24, 2026

Copy link
Copy Markdown
Collaborator

This PR is a draft implementation of dedicated sync streams listed in the node hardening issue, as an alternative to the general chunking proposal. Initially, it applies only to the validators, which will benefit from it the most.

The main reasons for the proposal are:

  • increasing the performance of consensus (by reducing Head-of-Line blocking caused by large network messages)
  • reducing DoS surface (by greatly reducing the maximum network message size)

Benefits over chunking:

  • simpler non-sync network plumbing
  • the potential to reduce the maximum network message size to a much greater degree

The general approach is as follows: the Gateway's Event gains 2 new variants (in order to temporarily maintain backward compatibility via the existing BlockRequest and BlockResponse), SyncRequest and SyncResponse. When a node receives a SyncRequest, it responds with an address to a dedicated sync stream, and a short-lived access token that must be used in order to establish the connection. Once established, the responder sends BlockResponse messages (through the dedicated stream), and the existing BlockSync plumbing handles the rest. Once the maximum number of responses (if such a limit is desired) per a sync stream is sent, a new sync stream needs to be opened in order to receive more blocks.

The rough list of code changes, enumerated for simpler referencing if need be:

  1. Some of the network messages are moved to snarkos-node-network to avoid circular dependencies (it will also make sense for the currently Gateway-only messages to reside there once this syncing is extended to non-validators).
  2. New structs, SyncToken and SyncResponse, are introduced.
  3. The Event is extended with 2 new variants, SyncRequest (holding a BlockRequest) and SyncResponse (which holds the new SyncResponse struct).
  4. A SyncStreams object is introduced; it is essentially a node that requires no special peer handling or address resolution, and has a trivial handshake. Just like the Gateway, it contains a clone of sync-related Senders, and the LedgerService. Using a node makes stream handling a lot simpler, and - compared to ad-hoc streams - reduces potential NAT/firewall issues (since only a single listener port is involved). The Tcp node is very lightweight, and so are its connections.
  5. The BlockSync plumbing is adjusted to account for the new logic.

The current state of the PR: nodes can successfully establish dedicated sync streams and send/receive blocks, but I'm running into design details of the current BlockSync setup that are incompatible with the new approach; the problematic spots can be seen commented out in block_sync.rs. The issues I've identified thus far are:

  • the sync requests are matched with addresses associated with BlockLocators (as opposed to the dedicated sync streams)
  • the sync responses are currently distributed in a somewhat "fanout" fashion; instead, we should now also be prepared to send many BlockResponse (or even just Block) messages to single peers via a single stream; the expected block ranges should also be much larger, so as to minimize the number of SyncRequest messages that need to be handled by the Gateway

Once these are resolved, the related tests will also need some adjustment.

@kaimast please let me know if you have ideas on how the aforementioned BlockSync integration issues can be solved while maintaining backward compatibility, or suggest how this logic could be delegated elsewhere; feel free to commit to this PR if you'd like to integrate these changes with BlockSync in a way that's aligned with your design and future plans.

Open questions:

  1. Do we want to "cycle" through several streams while syncing? It might be unnecessary, since at that point we're already past twofold authentication (the validator handshake + the access token). Since the sync streams are not used for anything else and are lightweight, I see little harm in it.
  2. Should we limit the requestor to a single sync stream? This is a problem if we wanted to have high syncing performance while cycling streams (as we couldn't begin a new stream before concluding the existing one).
  3. Do we want to limit the number of requested blocks? This needs to be weighed with the desired syncing redundancy factor and the performance implications it has for the providers. Note: dedicated sync streams are by design more robust than singular requests for blocks, so we may not need as much redundancy anymore.
  4. The values for some of the consts.

@ljedrz ljedrz requested review from kaimast and vicsn April 24, 2026 14:47

@vicsn vicsn left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1./2. I would try and optimize for simplicity of the implementation. With the foundations in place, in the future we can more formally balance robustness and performance under various scenario's.

3./4. Can you take a first stab in a google doc for the limits we choose? We can try to allocate, say, a rough 10GB at any given time for all peers combined. Assuming current average blocksize if needed.

};

#[derive(Clone, PartialEq, Eq, Hash)]
pub struct SyncToken([u8; 32]);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document what this is and how it works? Are we exposed to worse MITM attacks compared to the Gateway handshake?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The token is sent in plaintext and could be observed or tampered with by an inline MITM, yes. However, it is strictly an anti-DoS mechanism to prevent unauthorized resource consumption on the validator, not a cryptographic session key. The integrity of the downloaded blocks is guaranteed by the hashes and signatures, not by the transport layer, so any MITM tampering would be instantly caught and rejected.

Comment thread node/sync/src/node.rs
#[async_trait]
impl<N: Network> OnConnect for SyncStreams<N> {
async fn on_connect(&self, peer_addr: SocketAddr) {
// Check if we're the ones who provide the sync.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can two peers sync from each other concurrently? Should we add a test for these edge cases?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior is controlled by the BlockSync logic, and remains unchanged. I don't see how such a scenario would be useful, so I'm pretty sure it is disallowed. As for the comment, this check is only required due to the modular nature of the Tcp plumbing - at the point of OnConnect::on_connect, we don't readily know the side of the connection (though this could be exposed by the Tcp if needed), so we can look up the applicable block request (which we would need anyway) in order to check it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could happen if for some reason a block request gets delayed. However, it is very unlikely.

Comment on lines 118 to +124
// Retrieve the start (inclusive) and end (exclusive) block height.
let candidate_start_height = self.first().map(|b| b.height()).unwrap_or(0);
let candidate_end_height = 1 + self.last().map(|b| b.height()).unwrap_or(0);
// let candidate_start_height = self.first().map(|b| b.height()).unwrap_or(0);
// let candidate_end_height = 1 + self.last().map(|b| b.height()).unwrap_or(0);
// Check that the range matches the block request.
if start_height != candidate_start_height || end_height != candidate_end_height {
bail!("Peer '{peer_ip}' sent an invalid block response (range does not match block request)")
}
// if start_height != candidate_start_height || end_height != candidate_end_height {
// bail!("Peer '{peer_ip}' sent an invalid block response (range does not match block request)")
// }

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are to reuse the existing BlockRequest, this check is no longer correct, as the request can span a much greater range of blocks than the responses

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't the plan to change block requests to individual blocks?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, ultimately I'd want to do so, but I wanted to minimize the diff, at least for now; this would also be easier once the backward compatibility is no longer needed

Comment thread cli/src/commands/start.rs
Comment thread cli/src/commands/start.rs Outdated
Comment thread cli/src/commands/start.rs
Comment thread node/network/src/block_request.rs
Comment thread node/bft/events/src/lib.rs
Comment thread node/sync/src/helpers/sync_channel.rs Outdated
@kaimast

kaimast commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

I left a bunch of nitpicky comments, but overall the PR looks great so far!

@ljedrz

ljedrz commented Apr 30, 2026

Copy link
Copy Markdown
Collaborator Author

@kaimast thanks for the comments so far; also highlighting the request from the description, as I wouldn't want to break the current setup due to its incompatibilities with the new design:

please let me know if you have ideas on how the aforementioned BlockSync integration issues can be solved while maintaining backward compatibility, or suggest how this logic could be delegated elsewhere; feel free to commit to this PR if you'd like to integrate these changes with BlockSync in a way that's aligned with your design and future plans.

ljedrz added 3 commits May 4, 2026 12:11
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
@ljedrz ljedrz force-pushed the perf/dedicated_sync_stream2 branch from 0428a37 to 33e19aa Compare May 4, 2026 10:18
@ljedrz

ljedrz commented May 4, 2026

Copy link
Copy Markdown
Collaborator Author

Rebased to fix a conflict.

@kaimast

kaimast commented May 5, 2026

Copy link
Copy Markdown
Contributor

Does this branch already work? It would be useful to see whether the performance regression noted in #4224 is also present here.

I am not sure if you have permission to trigger the benchmark workflow. If you do, please trigger when the initial implementation is done.

@ljedrz

ljedrz commented May 5, 2026

Copy link
Copy Markdown
Collaborator Author

No, as I'm not sure how to plug these changes into BlockSync while maintaining backward compatibility - I was hoping you could suggest how to work around the locator address matching and range limits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants