fix(unix): split oversized GSO sends into smaller chunks#2653
Conversation
FTR: I don't mind this in principle and I see you've marked it as draft for now. Let us know when you've reviewed it yourself and done some testing on it. I am glad this is being worked on. We do this on top of |
| state: &UdpSocketState, | ||
| io: SockRef<'_>, | ||
| transmit: &Transmit<'_>, | ||
| max_per_chunk: usize, |
There was a problem hiding this comment.
is there any practical value to making this dynamic rather than simply hardcoding this to 1?
There was a problem hiding this comment.
Yeah it's hard to follow, working on renaming stuff and avoiding some recursion now.
max_per_chunk should be called gso_segments or something. It's the maximum number of segments that can be contained in a single packet, initially starting at >1 (depending on the OS) and shrinking down to 1 on an error.
If it shrinks down to 1, we call send again to repeat this logic, but gso_segments will be 1 now.
| ) -> io::Result<()> { | ||
| let segment_size = transmit | ||
| .effective_segment_size() | ||
| .expect("caller verified segment_size is set"); |
There was a problem hiding this comment.
If there isn't, can this not safely create a single chunk and send it instead of having to panic? The invariants being held so far from each other often ends up in tears sometimes. I feel like we need to try harder to avoid these things.
There was a problem hiding this comment.
This should be passed in as an argument instead of recomputing it anyway.
e7cc1fb to
ae31e3a
Compare
|
kk I simplified the logic and removed the recursion. It's still untested (on my Mac and the customer has the failing Android anyway). Some of the comments are still AI generated, but I've closely reviewed everything. |
| // `send_batched` halting GSO mid-syscall. | ||
| if e.raw_os_error() == Some(libc::EINVAL) | ||
| && !state.sendmsg_einval() | ||
| && transmit.effective_segment_size().is_none() |
There was a problem hiding this comment.
This fixes an existing bug, where a GSO failure could also disable ECN. Can split into a separate PR if you'd prefer.
thomaseizinger
left a comment
There was a problem hiding this comment.
Left two initial comments.
I also think we should have integration-level test coverage for this.
| #[cfg(any(target_os = "linux", target_os = "android"))] | ||
| if let Some(segment_size) = transmit.effective_segment_size() { | ||
| // Get the (GSO) segment size, and compute the number of segments we could fit into a UDP payload. | ||
| let segment_count = max_segments_per_send(state.max_gso_segments(), segment_size); | ||
| return send_batched(state, io, transmit, segment_size, segment_count); | ||
| } | ||
|
|
||
| // Fall back to sending each packet individually. | ||
| send_packet(state, &io, transmit) |
There was a problem hiding this comment.
I don't really understand this. There should not be a need for branching here because a single packet is a batch of size one with segment_size == packet_len, right?
There was a problem hiding this comment.
It's mostly a micro-optimization to avoid running the send_batched loop/logic if GSO isn't enabled/supported.
You're right that we could effective_segment_size instead of transmit.segment_size. It would end up being a batch of 1 packet.
Lemme see if I can combine the code paths to avoid the branching.
| /// The kernel rejects any GSO send whose total `contents` length exceeds | ||
| /// `u16::MAX`, regardless of `max_gso_segments`. | ||
| #[cfg(any(target_os = "linux", target_os = "android", test))] | ||
| const MAX_SEND_BYTES: usize = u16::MAX as usize; |
There was a problem hiding this comment.
As I've already pointed out, it is not that easy. You need to take into account IP and UDP headers here. See the link I've left in the previous comment.
36af237 to
26ef777
Compare
|
@thomaseizinger better? |
When a `Transmit` exceeds what one `sendmsg` syscall can accept — either because it has more segments than `max_gso_segments` allows, or because its total size exceeds the kernel's per-call limit (`u16::MAX` bytes) — split it upfront into smaller batches dispatched via separate `sendmsg` calls. This catches three cases that today silently drop datagrams or rely on the QUIC layer's loss timers to recover: * Stale in-flight GSO transmits left over from a GSO halt (quinn-rs#2399). After `EIO` halts GSO via `max_gso_segments.store(1)`, any transmits queued while GSO was still enabled now violate the new limit and need to be re-batched as individual sends rather than dropped. * The first GSO send that triggers the halt itself (quinn-rs#2399). The `EIO` arm now re-dispatches the failing transmit through `send` after the halt, where the upfront chunking check splits it into individual datagrams without waiting for a PTO. * Callers that pass oversized `Transmit::contents` (partial quinn-rs#2201) — e.g. >65535 bytes total, which the kernel rejects with `EMSGSIZE`. These are now broken into kernel-acceptable batches, and each chunk still uses GSO when its segment count is > 1. The public contract of `max_gso_segments` is unchanged — callers should still query it and size their `Transmit`s accordingly. This path is a defensive net so that callers which don't (or which raced a GSO halt) don't silently drop datagrams. Cellular Android, where the kernel reports GSO support but the radio driver returns `EIO` on the first real send, is the original motivation: recovery now completes inside the `sendmsg` call instead of waiting for a ~1s handshake PTO. Firefox is hot-fixing the same case in D265812 and Firezone hit it independently (firezone/firezone#10279). Fixes quinn-rs#2399. Refs quinn-rs#2201.
26ef777 to
a24b09a
Compare
thomaseizinger
left a comment
There was a problem hiding this comment.
Overall better yes but I am not sure we are going into the right direction in general. Some questions to think about:
- Shouldn't this logic apply to all code-paths, i.e. Windows + Apple as well? The way I implemented this in Firezone, it is running above
quinn-udp, regardless of whether GSO / apple-fast path is active. - What about integration-level tests? Whether or not we are doing the slicing correctly is only surfaced if we actually hit a kernel and pass it a giant
Transmit, i.e. one with a content that is larger thanu16::MAX.
If we don't apply this loop globally, then I think we should inline this into the current send function. We already have a loop there and we just need to keep looping and call prepare_msg, right?
(untested, and mostly AI generated)
When a
Transmitexceeds what onesendmsgsyscall can accept — either because it has more segments thanmax_gso_segmentsallows, or because its total size exceeds the kernel's per-call limit (u16::MAXbytes) — split it upfront into smaller batches dispatched via separatesendmsgcalls. This catches three cases that today silently drop datagrams or rely on the QUIC layer's loss timers to recover:Stale in-flight GSO transmits left over from a GSO halt (udp: retry failed GSO sendmsg on EIO without GSO #2399). After
EIOhalts GSO viamax_gso_segments.store(1), any transmits queued while GSO was still enabled now violate the new limit and need to be re-batched as individual sends rather than dropped.The first GSO send that triggers the halt itself (udp: retry failed GSO sendmsg on EIO without GSO #2399). The
EIOarm now re-dispatches the failing transmit throughsendafter the halt, where the upfront chunking check splits it into individual datagrams without waiting for a PTO.Callers that pass oversized
Transmit::contents(partial Shouldquinn-udpautomatically chunkTransmit::contents? #2201) — e.g. >65535 bytes total, which the kernel rejects withEMSGSIZE. These are now broken into kernel-acceptable batches, and each chunk still uses GSO when its segment count is > 1.The public contract of
max_gso_segmentsis unchanged — callers should still query it and size theirTransmits accordingly. This path is a defensive net so that callers which don't (or which raced a GSO halt) don't silently drop datagrams.Cellular Android, where the kernel reports GSO support but the radio driver returns
EIOon the first real send, is the original motivation: recovery now completes inside thesendmsgcall instead of waiting for a ~1s handshake PTO. Firefox is hot-fixing the same case in D265812 and Firezone hit it independently (firezone/firezone#10279).Fixes #2399.
Refs #2201.