-
Notifications
You must be signed in to change notification settings - Fork 461
feat: Defer packet signature verification in gossip #5977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Moves expensive `par_verify` from initial packet handling (`verify_packet`) to later in the processing pipeline (`process_packets`) after cheaper checks. This avoids unnecessary cryptographic work for packets that might be dropped, improving performance under load.
1d4b5da
to
50a8fa3
Compare
can refactor this to eschew for (from_addr, packet) in packets.drain(..).flatten().filter_map(|(from_addr, packet)| {
if packet.par_verify() {
Some((from_addr, packet)) // Keep verified packets
} else {
debug!("Deferred verification failed for packet from {}", from_addr);
None // Discard unverified packets
}
}) { |
Do not merge yet, still testing. It seems stable in master but I am seeing memory leaks on some machines in my v2.2 branch backport. This could use another pair of eyes and/or another tester. |
Best guess: On slow machines, On fast machines, Doing validation earlier bakes in built in backpressure to prevent buffer overflow, but at the cost of CPU resources. |
The best strategy in such cases is to use EvictingSender from streamer crate (or equivalent logic). Straight up bounded channel has the downside that it will drop the freshest packets, and we would rather drop the oldest ones if we are going to drop anything. |
I see you are a man of culture :) |
Confirmed, this affects all branches, master and 2.2. The queuing problem is inherent to the current network stack, and is a fundamental flaw that we simply have been fortunate enough to avoid. Any re-allocation of CPU resources to different parts of the network stack will likely result in dramatically different queuing behavior. |
I do not have the resources, expertise, or ability to make it function properly without significant help. The main sticking point is that this change de-parallelizes verify, and makes its use single threaded, exacerbating the already bad queuing behavior noted above. The correct solution is to parallelize all of the easy checks (e.g. shred version checking) that can be done before verify so that they are parallel as well. @alexpyattaev mentioned the cheap checks don't actually have to be parallel, and he's probably right. I think just adding a different discard/backpressure approach is the easy fix, but unsure if it has other side effects. |
Moves expensive
par_verify
from initial packet handling (verify_packet
) to later in the processing pipeline (process_packets
) after cheaper checks.This avoids unnecessary cryptographic work for packets that might be dropped, improving performance under load.
Problem
Profiling the gossip code revealed that the majority of ingress packet processing was dedicated to calling
par_verify
on packets that were going to be discarded later anyway.Summary of Changes
Defer
par_verify
until the (majority) of ingress discards have already been completed.DO NOT MERGE YET
[eta] This change alters the relative speed at which various packet queues are processed, which results in the possibility of one of the queues in the pipeline (specifically, the channel between the
run_socket_consume
andrun_listen
threads) growing without bound, and the producer for that queue (therun_socket_consume
thread) may be faster than the consumer (therun_listen
thread, now performingpar_verify
insideprocess_packets
), depending on the type of machine it is running on and the network/CPU load.. This probably merits a ticket.Even worse, this is likely because the patch de-parallelizes
par_verify()
which was definitely unintended.