Skip to content

feat: Defer packet signature verification in gossip #5977

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

nyetwurk
Copy link

@nyetwurk nyetwurk commented Apr 24, 2025

Moves expensive par_verify from initial packet handling (verify_packet) to later in the processing pipeline (process_packets) after cheaper checks.

This avoids unnecessary cryptographic work for packets that might be dropped, improving performance under load.

Problem

Profiling the gossip code revealed that the majority of ingress packet processing was dedicated to calling par_verify on packets that were going to be discarded later anyway.

Summary of Changes

Defer par_verify until the (majority) of ingress discards have already been completed.

DO NOT MERGE YET

[eta] This change alters the relative speed at which various packet queues are processed, which results in the possibility of one of the queues in the pipeline (specifically, the channel between the run_socket_consume and run_listen threads) growing without bound, and the producer for that queue (the run_socket_consume thread) may be faster than the consumer (the run_listen thread, now performing par_verify inside process_packets), depending on the type of machine it is running on and the network/CPU load.. This probably merits a ticket.

Even worse, this is likely because the patch de-parallelizes par_verify() which was definitely unintended.

@mergify mergify bot requested a review from a team April 24, 2025 20:34
Moves expensive `par_verify` from initial packet handling (`verify_packet`) to
later in the processing pipeline (`process_packets`) after cheaper checks.

This avoids unnecessary cryptographic work for packets that might be dropped,
improving performance under load.
@nyetwurk nyetwurk force-pushed the defer-gossip-par_verify branch from 1d4b5da to 50a8fa3 Compare April 24, 2025 20:34
@nyetwurk
Copy link
Author

can refactor this to eschew continue if folks like it better:

for (from_addr, packet) in packets.drain(..).flatten().filter_map(|(from_addr, packet)| {
    if packet.par_verify() {
        Some((from_addr, packet)) // Keep verified packets
    } else {
        debug!("Deferred verification failed for packet from {}", from_addr);
        None // Discard unverified packets
    }
}) {

@nyetwurk nyetwurk mentioned this pull request Apr 24, 2025
@nyetwurk
Copy link
Author

nyetwurk commented Apr 25, 2025

Do not merge yet, still testing.

It seems stable in master but I am seeing memory leaks on some machines in my v2.2 branch backport. This could use another pair of eyes and/or another tester.

@nyetwurk
Copy link
Author

nyetwurk commented Apr 25, 2025

Best guess:

On slow machines, run_socket_consume is now significantly faster than run_listen, leading to an ever growing backlog in the crossbeam_channel, where MAX_GOSSIP_TRAFFIC only limits the length of the VecDeque

On fast machines, run_listen is fast enough to keep up with run_socket_consume.

Doing validation earlier bakes in built in backpressure to prevent buffer overflow, but at the cost of CPU resources.

@alexpyattaev
Copy link

Doing validation earlier bakes in build in backpressure to prevent buffer overflow, but at the cost of CPU resources.

The best strategy in such cases is to use EvictingSender from streamer crate (or equivalent logic). Straight up bounded channel has the downside that it will drop the freshest packets, and we would rather drop the oldest ones if we are going to drop anything.

@nyetwurk
Copy link
Author

Doing validation earlier bakes in built in backpressure to prevent buffer overflow, but at the cost of CPU resources.

we would rather drop the oldest ones if we are going to drop anything.

I see you are a man of culture :)

@nyetwurk
Copy link
Author

Confirmed, this affects all branches, master and 2.2. The queuing problem is inherent to the current network stack, and is a fundamental flaw that we simply have been fortunate enough to avoid. Any re-allocation of CPU resources to different parts of the network stack will likely result in dramatically different queuing behavior.

@nyetwurk
Copy link
Author

nyetwurk commented Apr 25, 2025

I do not have the resources, expertise, or ability to make it function properly without significant help.

The main sticking point is that this change de-parallelizes verify, and makes its use single threaded, exacerbating the already bad queuing behavior noted above.

The correct solution is to parallelize all of the easy checks (e.g. shred version checking) that can be done before verify so that they are parallel as well.

@alexpyattaev mentioned the cheap checks don't actually have to be parallel, and he's probably right.

I think just adding a different discard/backpressure approach is the easy fix, but unsure if it has other side effects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants