Skip to content

feat: Basic GSO support #2532

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 34 commits into from
Closed

Conversation

larseggert
Copy link
Collaborator

@larseggert larseggert commented Mar 27, 2025

This simply collects batches of same-size, same-marked datagrams to the same destination together by copying. In essence, we trade more memory copies for fewer system calls. Let's see if this matters at all.

This simply collects batches of same-size, same-marked datagrams to the same destination together by copying. In essence, we trade more memory copies for fewer system calls. Let's see it this matters at all.
@larseggert
Copy link
Collaborator Author

All QNS tests are failing. I see this in the logs:

server  | 1.021 INFO `libc::sendmsg` failed with Input/output error (os error 5); halting segmentation offload
server  | Error: IoError(Os { code: 5, kind: Uncategorized, message: "Input/output error" })

Copy link

github-actions bot commented Mar 27, 2025

Benchmark results

Performance differences relative to 06c007e.

1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client: 💚 Performance has improved.
       time:   [278.77 ms 282.23 ms 285.68 ms]
       thrpt:  [350.04 MiB/s 354.33 MiB/s 358.71 MiB/s]
change:
       time:   [-69.967% -68.844% -67.628%] (p = 0.00 < 0.05)
       thrpt:  [+208.91% +220.96% +232.97%]
1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client: 💔 Performance has regressed.
       time:   [388.79 ms 393.18 ms 398.26 ms]
       thrpt:  [25.110 Kelem/s 25.434 Kelem/s 25.721 Kelem/s]
change:
       time:   [+15.492% +16.873% +18.319%] (p = 0.00 < 0.05)
       thrpt:  [-15.483% -14.437% -13.414%]

Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe

1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client: 💔 Performance has regressed.
       time:   [26.470 ms 26.595 ms 26.736 ms]
       thrpt:  [37.403  elem/s 37.601  elem/s 37.779  elem/s]
change:
       time:   [+2.5803% +3.3466% +4.1706%] (p = 0.00 < 0.05)
       thrpt:  [-4.0036% -3.2382% -2.5154%]

Found 6 outliers among 100 measurements (6.00%)
4 (4.00%) high mild
2 (2.00%) high severe

1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client: 💚 Performance has improved.
       time:   [1.1900 s 1.2022 s 1.2144 s]
       thrpt:  [82.345 MiB/s 83.180 MiB/s 84.033 MiB/s]
change:
       time:   [-34.210% -33.125% -32.145%] (p = 0.00 < 0.05)
       thrpt:  [+47.372% +49.532% +51.999%]
decode 4096 bytes, mask ff: No change in performance detected.
       time:   [12.072 µs 12.105 µs 12.145 µs]
       change: [-0.9610% -0.1559% +0.4897%] (p = 0.72 > 0.05)

Found 19 outliers among 100 measurements (19.00%)
3 (3.00%) low severe
5 (5.00%) low mild
2 (2.00%) high mild
9 (9.00%) high severe

decode 1048576 bytes, mask ff: No change in performance detected.
       time:   [3.1327 ms 3.1428 ms 3.1543 ms]
       change: [-0.5210% -0.0295% +0.5089%] (p = 0.91 > 0.05)

Found 11 outliers among 100 measurements (11.00%)
1 (1.00%) low mild
10 (10.00%) high severe

decode 4096 bytes, mask 7f: No change in performance detected.
       time:   [20.152 µs 20.203 µs 20.260 µs]
       change: [-0.8010% -0.3795% +0.0250%] (p = 0.08 > 0.05)

Found 21 outliers among 100 measurements (21.00%)
4 (4.00%) low severe
3 (3.00%) low mild
2 (2.00%) high mild
12 (12.00%) high severe

decode 1048576 bytes, mask 7f: No change in performance detected.
       time:   [5.2468 ms 5.2583 ms 5.2714 ms]
       change: [-0.4201% -0.0685% +0.2965%] (p = 0.72 > 0.05)

Found 14 outliers among 100 measurements (14.00%)
14 (14.00%) high severe

decode 4096 bytes, mask 3f: No change in performance detected.
       time:   [7.0353 µs 7.0775 µs 7.1224 µs]
       change: [-1.8397% -0.2449% +1.1385%] (p = 0.78 > 0.05)

Found 18 outliers among 100 measurements (18.00%)
3 (3.00%) low severe
2 (2.00%) low mild
13 (13.00%) high severe

decode 1048576 bytes, mask 3f: No change in performance detected.
       time:   [1.7921 ms 1.7978 ms 1.8048 ms]
       change: [-0.4674% +0.0096% +0.5485%] (p = 0.92 > 0.05)

Found 6 outliers among 100 measurements (6.00%)
6 (6.00%) high severe

1000 streams of 1 bytes/multistream: 💔 Performance has regressed.
       time:   [24.129 ms 24.154 ms 24.180 ms]
       change: [+2.0800% +2.2246% +2.3872%] (p = 0.00 < 0.05)

Found 69 outliers among 500 measurements (13.80%)
65 (13.00%) high mild
4 (0.80%) high severe

1000 streams of 1000 bytes/multistream: Change within noise threshold.
       time:   [140.51 ms 140.55 ms 140.58 ms]
       change: [+0.0462% +0.0825% +0.1196%] (p = 0.00 < 0.05)

Found 13 outliers among 500 measurements (2.60%)
13 (2.60%) high mild

coalesce_acked_from_zero 1+1 entries: No change in performance detected.
       time:   [94.665 ns 94.974 ns 95.290 ns]
       change: [-0.6382% -0.1315% +0.4650%] (p = 0.64 > 0.05)

Found 11 outliers among 100 measurements (11.00%)
9 (9.00%) high mild
2 (2.00%) high severe

coalesce_acked_from_zero 3+1 entries: No change in performance detected.
       time:   [112.70 ns 113.01 ns 113.34 ns]
       change: [-0.0477% +0.3468% +0.7348%] (p = 0.09 > 0.05)

Found 15 outliers among 100 measurements (15.00%)
1 (1.00%) low mild
6 (6.00%) high mild
8 (8.00%) high severe

coalesce_acked_from_zero 10+1 entries: No change in performance detected.
       time:   [112.13 ns 112.65 ns 113.24 ns]
       change: [-0.3830% +0.1121% +0.6339%] (p = 0.67 > 0.05)

Found 17 outliers among 100 measurements (17.00%)
4 (4.00%) low severe
3 (3.00%) low mild
2 (2.00%) high mild
8 (8.00%) high severe

coalesce_acked_from_zero 1000+1 entries: No change in performance detected.
       time:   [93.091 ns 95.261 ns 99.710 ns]
       change: [-1.0510% +4.7836% +15.533%] (p = 0.51 > 0.05)

Found 6 outliers among 100 measurements (6.00%)
2 (2.00%) high mild
4 (4.00%) high severe

RxStreamOrderer::inbound_frame(): Change within noise threshold.
       time:   [117.44 ms 117.50 ms 117.56 ms]
       change: [+0.5315% +0.5961% +0.6604%] (p = 0.00 < 0.05)

Found 16 outliers among 100 measurements (16.00%)
1 (1.00%) low severe
6 (6.00%) low mild
5 (5.00%) high mild
4 (4.00%) high severe

SentPackets::take_ranges: No change in performance detected.
       time:   [8.2525 µs 8.4959 µs 8.7178 µs]
       change: [-4.0732% -0.5984% +3.1432%] (p = 0.75 > 0.05)

Found 20 outliers among 100 measurements (20.00%)
9 (9.00%) low severe
9 (9.00%) low mild
2 (2.00%) high mild

transfer/pacing-false/varying-seeds: Change within noise threshold.
       time:   [35.206 ms 35.274 ms 35.342 ms]
       change: [-2.4515% -2.1790% -1.9193%] (p = 0.00 < 0.05)

Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) low mild

transfer/pacing-true/varying-seeds: Change within noise threshold.
       time:   [36.298 ms 36.401 ms 36.504 ms]
       change: [-2.4249% -2.0121% -1.6339%] (p = 0.00 < 0.05)
transfer/pacing-false/same-seed: Change within noise threshold.
       time:   [35.134 ms 35.181 ms 35.229 ms]
       change: [-2.2420% -2.0430% -1.8495%] (p = 0.00 < 0.05)

Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild

transfer/pacing-true/same-seed: Change within noise threshold.
       time:   [36.650 ms 36.714 ms 36.778 ms]
       change: [-2.9347% -2.7334% -2.4915%] (p = 0.00 < 0.05)

Client/server transfer results

Performance differences relative to 06c007e.

Transfer of 33554432 bytes over loopback, 30 runs. All unit-less numbers are in milliseconds.

Client Server CC Pacing Mean ± σ Min Max MiB/s ± σ Δ main Δ main
neqo neqo reno on 225.5 ± 71.0 175.3 412.3 141.9 ± 0.5 💚 -132.3 -37.0%
neqo neqo reno 283.1 ± 223.1 176.8 1163.2 113.0 ± 0.1 💚 -104.5 -27.0%
neqo neqo cubic on 211.7 ± 51.1 178.9 400.0 151.2 ± 0.6 💚 -131.9 -38.4%
neqo neqo cubic 210.4 ± 56.5 175.7 449.6 152.1 ± 0.6 💚 -134.1 -38.9%
google neqo reno on 726.8 ± 120.8 441.1 988.5 44.0 ± 0.3 -55.9 -7.1%
google neqo reno 713.5 ± 120.5 448.1 959.5 44.8 ± 0.3 -51.9 -6.8%
google neqo cubic on 717.0 ± 115.3 487.2 970.1 44.6 ± 0.3 -46.5 -6.1%
google neqo cubic 710.6 ± 108.4 469.5 934.9 45.0 ± 0.3 💚 -55.0 -7.2%
google google 591.7 ± 69.3 547.2 864.4 54.1 ± 0.5 19.4 3.4%
neqo msquic reno on 271.7 ± 43.8 241.8 445.1 117.8 ± 0.7 -3.9 -1.4%
neqo msquic reno 272.8 ± 44.0 239.7 438.0 117.3 ± 0.7 5.7 2.1%
neqo msquic cubic on 265.5 ± 36.0 240.0 446.4 120.5 ± 0.9 -3.7 -1.4%
neqo msquic cubic 272.1 ± 45.3 243.0 474.1 117.6 ± 0.7 0.2 0.1%
msquic msquic 186.1 ± 26.4 158.9 291.3 172.0 ± 1.2 -10.1 -5.2%

⬇️ Download logs

Copy link

github-actions bot commented Mar 28, 2025

Failed Interop Tests

QUIC Interop Runner, client vs. server, differences relative to 06c007e.

neqo-latest as client

neqo-latest as server

All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

Unsupported Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

@larseggert larseggert marked this pull request as ready for review March 28, 2025 15:23
Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early benchmarks look promising. That said, I am not sure whether we will see similar improvements when benchmarked through Firefox with connection latency and bandwidth limit.

As discussed out-of-band, I would favor a more integrated implementation, moving all batching logic into neqo-transport::Connection. Connection can be more efficient at batching, having access to all known information of the connection, and being able to allocate all batcheable datagrams at once. In addition, this would allow a single batching implementation, then used by neqo-client, neqo-server, mozilla-central/http3server and lastly of course Firefox.

For others, past draft of the above mentioned integrated implementation: f25b0b7

@larseggert what are the next steps? I would suggest applying the same non-integrated optimization to neqo_glue/src/lib.rs. You can easily use a custom neqo-* version through a mozilla/central/Cargo.toml override. We can then either test Firefox upload speed against a local HTTP3 server, or using Andrew's upload automation (MacOS) for more reproducible results, using a real-world connection to Google's infrastructure instead of a localhost setup.

@larseggert
Copy link
Collaborator Author

larseggert commented Mar 31, 2025

I have started to do a version of this in the glue code. It's a bit challenging because the current mainline of neqo has picked up a bunch of dependencies beyond that of Firefox, and I need to figure out how to upgrade those...

Am wondering if we should cut a neqo release soon before there is more divergence.

@mxinden
Copy link
Member

mxinden commented Mar 31, 2025

Am wondering if we should cut a neqo release soon before there is more divergence.

I was planning to cut a new release once #2492 is merged. @larseggert I am happy to cut a new release beforehand if you like.

@larseggert larseggert marked this pull request as draft April 14, 2025 05:12
@larseggert
Copy link
Collaborator Author

Making this a draft PR, since we decided to test this in the glue first.

@mxinden mxinden mentioned this pull request Apr 18, 2025
@larseggert
Copy link
Collaborator Author

Closing here in favor of #2593.

@larseggert larseggert closed this May 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants