Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Basic GSO support #2532

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open

Conversation

larseggert
Copy link
Collaborator

@larseggert larseggert commented Mar 27, 2025

This simply collects batches of same-size, same-marked datagrams to the same destination together by copying. In essence, we trade more memory copies for fewer system calls. Let's see if this matters at all.

This simply collects batches of same-size, same-marked datagrams to the same destination together by copying. In essence, we trade more memory copies for fewer system calls. Let's see it this matters at all.
@larseggert
Copy link
Collaborator Author

All QNS tests are failing. I see this in the logs:

server  | 1.021 INFO `libc::sendmsg` failed with Input/output error (os error 5); halting segmentation offload
server  | Error: IoError(Os { code: 5, kind: Uncategorized, message: "Input/output error" })

Copy link

github-actions bot commented Mar 27, 2025

Benchmark results

Performance differences relative to e72ae0c.

1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client: 💚 Performance has improved.
       time:   [462.94 ms 469.38 ms 475.79 ms]
       thrpt:  [210.18 MiB/s 213.05 MiB/s 216.01 MiB/s]
change:
       time:   [-34.939% -34.000% -32.994%] (p = 0.00 < 0.05)
       thrpt:  [+49.240% +51.515% +53.701%]
1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client: 💔 Performance has regressed.
       time:   [392.73 ms 395.72 ms 398.70 ms]
       thrpt:  [25.082 Kelem/s 25.270 Kelem/s 25.463 Kelem/s]
change:
       time:   [+11.883% +12.780% +13.833%] (p = 0.00 < 0.05)
       thrpt:  [-12.152% -11.332% -10.621%]

Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) low mild
1 (1.00%) high mild

1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client: No change in performance detected.
       time:   [25.875 ms 26.032 ms 26.192 ms]
       thrpt:  [38.179  elem/s 38.415  elem/s 38.647  elem/s]
change:
       time:   [-0.4132% +0.4540% +1.3449%] (p = 0.31 > 0.05)
       thrpt:  [-1.3270% -0.4519% +0.4149%]
1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client: 💚 Performance has improved.
       time:   [1.6377 s 1.6606 s 1.6844 s]
       thrpt:  [59.369 MiB/s 60.218 MiB/s 61.062 MiB/s]
change:
       time:   [-30.319% -29.092% -27.911%] (p = 0.00 < 0.05)
       thrpt:  [+38.718% +41.028% +43.510%]
decode 4096 bytes, mask ff: No change in performance detected.
       time:   [12.006 µs 12.050 µs 12.101 µs]
       change: [-0.0256% +0.4012% +0.9031%] (p = 0.09 > 0.05)

Found 16 outliers among 100 measurements (16.00%)
2 (2.00%) low severe
3 (3.00%) low mild
11 (11.00%) high severe

decode 1048576 bytes, mask ff: Change within noise threshold.
       time:   [2.9423 ms 2.9428 ms 2.9433 ms]
       change: [-0.9268% -0.5577% -0.2400%] (p = 0.00 < 0.05)

Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild

decode 4096 bytes, mask 7f: No change in performance detected.
       time:   [19.999 µs 20.054 µs 20.117 µs]
       change: [-0.2866% +0.0175% +0.3048%] (p = 0.91 > 0.05)

Found 17 outliers among 100 measurements (17.00%)
4 (4.00%) low mild
5 (5.00%) high mild
8 (8.00%) high severe

decode 1048576 bytes, mask 7f: No change in performance detected.
       time:   [4.7918 ms 4.8036 ms 4.8170 ms]
       change: [-0.4027% -0.0148% +0.3371%] (p = 0.93 > 0.05)

Found 14 outliers among 100 measurements (14.00%)
14 (14.00%) high severe

decode 4096 bytes, mask 3f: No change in performance detected.
       time:   [6.3096 µs 6.3280 µs 6.3554 µs]
       change: [-0.2744% +0.0386% +0.3358%] (p = 0.81 > 0.05)

Found 14 outliers among 100 measurements (14.00%)
2 (2.00%) low severe
2 (2.00%) low mild
5 (5.00%) high mild
5 (5.00%) high severe

decode 1048576 bytes, mask 3f: No change in performance detected.
       time:   [2.1515 ms 2.1599 ms 2.1697 ms]
       change: [-0.4846% +0.0953% +0.6765%] (p = 0.73 > 0.05)

Found 13 outliers among 100 measurements (13.00%)
4 (4.00%) high mild
9 (9.00%) high severe

1 streams of 1 bytes/multistream: No change in performance detected.
       time:   [70.208 µs 70.433 µs 70.661 µs]
       change: [-2.8055% -1.1618% -0.1219%] (p = 0.08 > 0.05)
1000 streams of 1 bytes/multistream: No change in performance detected.
       time:   [25.364 ms 25.403 ms 25.442 ms]
       change: [-0.1763% +0.0601% +0.2874%] (p = 0.60 > 0.05)
10000 streams of 1 bytes/multistream: No change in performance detected.
       time:   [1.7053 s 1.7070 s 1.7088 s]
       change: [-0.1800% -0.0367% +0.1168%] (p = 0.63 > 0.05)

Found 15 outliers among 100 measurements (15.00%)
5 (5.00%) low mild
9 (9.00%) high mild
1 (1.00%) high severe

1 streams of 1000 bytes/multistream: No change in performance detected.
       time:   [72.150 µs 72.799 µs 73.877 µs]
       change: [-0.4061% +0.5686% +2.0806%] (p = 0.49 > 0.05)

Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) low mild
1 (1.00%) high mild
1 (1.00%) high severe

100 streams of 1000 bytes/multistream: No change in performance detected.
       time:   [3.3787 ms 3.3852 ms 3.3922 ms]
       change: [-0.5080% -0.2223% +0.0604%] (p = 0.14 > 0.05)

Found 22 outliers among 100 measurements (22.00%)
22 (22.00%) high severe

1000 streams of 1000 bytes/multistream: Change within noise threshold.
       time:   [145.52 ms 145.60 ms 145.68 ms]
       change: [-0.1841% -0.1054% -0.0275%] (p = 0.01 < 0.05)
coalesce_acked_from_zero 1+1 entries: No change in performance detected.
       time:   [94.978 ns 95.348 ns 95.716 ns]
       change: [-1.7661% -0.4432% +0.4581%] (p = 0.56 > 0.05)

Found 11 outliers among 100 measurements (11.00%)
9 (9.00%) high mild
2 (2.00%) high severe

coalesce_acked_from_zero 3+1 entries: No change in performance detected.
       time:   [112.97 ns 113.26 ns 113.58 ns]
       change: [-0.1930% +0.1205% +0.4916%] (p = 0.48 > 0.05)

Found 13 outliers among 100 measurements (13.00%)
1 (1.00%) low severe
1 (1.00%) low mild
3 (3.00%) high mild
8 (8.00%) high severe

coalesce_acked_from_zero 10+1 entries: No change in performance detected.
       time:   [112.55 ns 113.00 ns 113.53 ns]
       change: [-1.1088% -0.3812% +0.2357%] (p = 0.29 > 0.05)

Found 21 outliers among 100 measurements (21.00%)
5 (5.00%) low severe
4 (4.00%) low mild
3 (3.00%) high mild
9 (9.00%) high severe

coalesce_acked_from_zero 1000+1 entries: No change in performance detected.
       time:   [93.573 ns 94.028 ns 94.546 ns]
       change: [-1.3954% -0.0776% +1.2639%] (p = 0.91 > 0.05)

Found 10 outliers among 100 measurements (10.00%)
6 (6.00%) high mild
4 (4.00%) high severe

RxStreamOrderer::inbound_frame(): Change within noise threshold.
       time:   [117.37 ms 117.43 ms 117.48 ms]
       change: [+0.3430% +0.4077% +0.4741%] (p = 0.00 < 0.05)

Found 11 outliers among 100 measurements (11.00%)
4 (4.00%) low mild
7 (7.00%) high mild

SentPackets::take_ranges: No change in performance detected.
       time:   [8.2999 µs 8.5470 µs 8.7770 µs]
       change: [-2.6528% +1.1789% +5.2061%] (p = 0.57 > 0.05)

Found 22 outliers among 100 measurements (22.00%)
8 (8.00%) low severe
9 (9.00%) low mild
2 (2.00%) high mild
3 (3.00%) high severe

transfer/pacing-false/varying-seeds: Change within noise threshold.
       time:   [35.447 ms 35.517 ms 35.586 ms]
       change: [+0.2619% +0.5151% +0.7639%] (p = 0.00 < 0.05)

Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high mild

transfer/pacing-true/varying-seeds: Change within noise threshold.
       time:   [35.518 ms 35.580 ms 35.642 ms]
       change: [-0.5468% -0.3062% -0.0626%] (p = 0.01 < 0.05)

Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high mild

transfer/pacing-false/same-seed: Change within noise threshold.
       time:   [35.561 ms 35.623 ms 35.687 ms]
       change: [+0.6737% +0.9010% +1.1372%] (p = 0.00 < 0.05)
transfer/pacing-true/same-seed: No change in performance detected.
       time:   [35.728 ms 35.773 ms 35.819 ms]
       change: [-0.0864% +0.0948% +0.2767%] (p = 0.30 > 0.05)

Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild

Client/server transfer results

Performance differences relative to e72ae0c.

Transfer of 33554432 bytes over loopback, 30 runs. All unit-less numbers are in milliseconds.

Client Server CC Pacing Mean ± σ Min Max MiB/s ± σ Δ main Δ main
neqo neqo reno on 258.0 ± 46.3 209.9 424.1 124.1 ± 0.7 💚 -161.7 -38.5%
neqo neqo reno 275.2 ± 108.8 207.6 722.3 116.3 ± 0.3 💚 -156.5 -36.2%
neqo neqo cubic on 241.2 ± 25.6 210.8 355.3 132.7 ± 1.2 💚 -163.5 -40.4%
neqo neqo cubic 234.9 ± 12.8 210.3 265.0 136.2 ± 2.5 💚 -173.0 -42.4%
google neqo reno on 757.2 ± 131.5 509.0 1007.3 42.3 ± 0.2 -2.1 -0.3%
google neqo reno 750.6 ± 125.1 492.9 876.2 42.6 ± 0.3 -10.3 -1.4%
google neqo cubic on 855.3 ± 177.3 541.5 1072.4 37.4 ± 0.2 💔 86.8 11.3%
google neqo cubic 834.5 ± 174.4 520.7 1143.3 38.3 ± 0.2 💔 75.6 10.0%
google google 565.6 ± 14.5 546.6 616.3 56.6 ± 2.2 💚 -14.1 -2.4%
neqo msquic reno on 265.7 ± 17.6 243.2 333.1 120.5 ± 1.8 -6.5 -2.4%
neqo msquic reno 264.5 ± 14.2 245.3 309.2 121.0 ± 2.3 -13.0 -4.7%
neqo msquic cubic on 269.6 ± 34.5 241.4 437.9 118.7 ± 0.9 6.0 2.3%
neqo msquic cubic 260.6 ± 12.4 242.6 296.1 122.8 ± 2.6 -7.4 -2.8%
msquic msquic 175.0 ± 21.9 149.7 234.9 182.9 ± 1.5 -3.1 -1.7%

⬇️ Download logs

Copy link

github-actions bot commented Mar 28, 2025

Failed Interop Tests

QUIC Interop Runner, client vs. server, differences relative to e72ae0c.

neqo-latest as client

neqo-latest as server

All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

Unsupported Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

@larseggert larseggert marked this pull request as ready for review March 28, 2025 15:23
Copy link
Collaborator

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early benchmarks look promising. That said, I am not sure whether we will see similar improvements when benchmarked through Firefox with connection latency and bandwidth limit.

As discussed out-of-band, I would favor a more integrated implementation, moving all batching logic into neqo-transport::Connection. Connection can be more efficient at batching, having access to all known information of the connection, and being able to allocate all batcheable datagrams at once. In addition, this would allow a single batching implementation, then used by neqo-client, neqo-server, mozilla-central/http3server and lastly of course Firefox.

For others, past draft of the above mentioned integrated implementation: f25b0b7

@larseggert what are the next steps? I would suggest applying the same non-integrated optimization to neqo_glue/src/lib.rs. You can easily use a custom neqo-* version through a mozilla/central/Cargo.toml override. We can then either test Firefox upload speed against a local HTTP3 server, or using Andrew's upload automation (MacOS) for more reproducible results, using a real-world connection to Google's infrastructure instead of a localhost setup.

Comment on lines +115 to +116
/// When a datagram is pushed that does not match the meta data of the batch,
/// it is stored in `next` and a send indication is returned.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mechanism of next is not intuitive to me. Why doesn't push simply return the Datagram when it doesn't match?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because then the different users of the type would need to implement their own method for storing it and switching to it. I thought it would be simpler if the type handled that for the caller.

@larseggert
Copy link
Collaborator Author

larseggert commented Mar 31, 2025

I have started to do a version of this in the glue code. It's a bit challenging because the current mainline of neqo has picked up a bunch of dependencies beyond that of Firefox, and I need to figure out how to upgrade those...

Am wondering if we should cut a neqo release soon before there is more divergence.

@mxinden
Copy link
Collaborator

mxinden commented Mar 31, 2025

Am wondering if we should cut a neqo release soon before there is more divergence.

I was planning to cut a new release once #2492 is merged. @larseggert I am happy to cut a new release beforehand if you like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants