Generate payload for multiple datagrams at once #609

kazuho · 2025-06-16T05:34:58Z

Up until now, on_send_emit callback has been invoked for each STREAM frame being built. This has become a bottleneck, due to two reasons:

Applications might have high static cost for generating each payload. For examples, they might be calling pread for each call to on_send_emit.
Running Accounting and prioritization logic for each packet being built is also expensive.

To mitigate the issuse, this PR refactors the quicly_send_stream function to generate STREAM frames for as much as 10 packets at once.

This PR calls the on_send_emit callback that already exists, and scatters the data being read by calling memmove.

There are two alternatives that we might consider:

Introduce a new callback to read payload into vectors of vectors (i.e., like readv) that match to the payload section of multiple STREAM frames being generated.
Let the application provide a pointer to a contiguous temporary buffer that holds data to be sent, and scatter that.

It might turn out that we'd want to try these alternatives, but they require changes to the API. Therefore, as the first cut, we are trying the approach using memmove.

… overflow

kazuho · 2025-12-04T07:27:10Z

Performance analysis for using memmove:

A tiny benchmark on Zen 3 (Ryzen 7 5700G) tells us that, for each type of copy size and method, following clocks are needed.

1400B * 4 rep movsb	1400B rep movsb	1400B memmove^a
294 clks	92 clks	74s

Assume we are building 4 datagrams at once. If we interpret these numbers naively, it means that the copying overhead of using read and memmove is 516 clocks combined (294 + 74 * 3), while that of readv is 368 clocks (92 * 4).

However, if we convert these numbers to per-byte overhead, the difference is 0.026 clock / byte, which is pretty small, if not negligible.

Also, rep movsb - the instruction sequence used by the Linux kernel for read and readv - has performance issues that is not visible in this benchmark, they become 30x slower if the source and destination are on different pages but their deltas from the start of the page is below 32 bytes^b; see https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515.

To paraphrase, the difference is small and there are unknowns that causes hesitation to change the API.

note a: To emulate the use case, we measured the throughput of memmove doing backward copies with tiny distances between the destination and the source addresses.
note b: The bug report does not clarify the maximum delta for which slowdown is observed, but my benchmarks show that is is when the delta is below 32 bytes.

…frame space can be allocated speculatively

kazuho · 2026-01-14T04:15:33Z

Introduce a new callback to read payload into vectors of vectors (i.e., like readv) that match to the payload section of multiple STREAM frames being generated.

FWIW we did try this in the kazuho/scatter-stream2 branch, however it turned out to be slower, most likely due to the overhead of readv doing scattered reads greater than the cost of quicly memmove-ing the payload.

kazuho · 2026-01-14T04:41:12Z

lib/quicly.c

        dst += hp - header;
        len = s->dst_end - dst;
+        /* if sending in 1-RTT, generate stream payload past the end of current datagram and move them to build datagrams */
+        if (get_epoch(s->current.first_byte) == QUICLY_EPOCH_1RTT) {


Do we want a per-stream flag to toggle on/off this behavior?

When the application has its send data buffered in memory, it might make more sense to generate the datagrams one by one.

kazuho added 7 commits June 16, 2025 11:07

implement scattering of stream payload

f405da9

pass max_datagrams, otherwise partial packet will be generated due to…

e143b67

… overflow

oops

6e71d45

simplify

129b991

use correct types

d8e629b

pass correct payload address to probes

3fc826a

commit stream state when failing to allocate packet space

423532e

kazuho force-pushed the kazuho/scatter-stream branch from 0148c52 to 423532e Compare June 16, 2025 07:10

Merge branch 'master' into kazuho/scatter-stream

f8ed948

kazuho added 11 commits December 4, 2025 17:14

fix infinite loop

d13bda4

oops (amend prev. commit)

5561cff

use const to avoid dynamic allocation

b0b90b3

respect send window

1155134

make a better guess (include tag size)

47ba9c3

update test following the change in b0b90b3

d10b305

[refactor] split frame space allocation and context marking, so that …

fb50825

…frame space can be allocated speculatively

support delayed loading

5d959f0

Merge branch 'master' into kazuho/scatter-stream

6c676ef

follow the changes in this branch

e7b5e6a

avoid overflow when payloads are moved back

90ca81c

kazuho commented Jan 14, 2026

View reviewed changes

kazuho added 2 commits January 14, 2026 16:04

Merge branch 'master' into kazuho/scatter-stream

0fde4a0

add flag to enable / disable scattering emit

d97b585

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Generate payload for multiple datagrams at once #609

Generate payload for multiple datagrams at once #609

Uh oh!

kazuho commented Jun 16, 2025

Uh oh!

kazuho commented Dec 4, 2025 •

edited

Loading

Uh oh!

kazuho commented Jan 14, 2026 •

edited

Loading

Uh oh!

kazuho Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Generate payload for multiple datagrams at once #609

Are you sure you want to change the base?

Generate payload for multiple datagrams at once #609

Uh oh!

Conversation

kazuho commented Jun 16, 2025

Uh oh!

kazuho commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kazuho commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kazuho Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kazuho commented Dec 4, 2025 •

edited

Loading

kazuho commented Jan 14, 2026 •

edited

Loading