Skip to content

Conversation

@kazuho
Copy link
Member

@kazuho kazuho commented Jun 16, 2025

Up until now, on_send_emit callback has been invoked for each STREAM frame being built. This has become a bottleneck, due to two reasons:

  • Applications might have high static cost for generating each payload. For examples, they might be calling pread for each call to on_send_emit.
  • Running Accounting and prioritization logic for each packet being built is also expensive.

To mitigate the issuse, this PR refactors the quicly_send_stream function to generate STREAM frames for as much as 10 packets at once.

This PR calls the on_send_emit callback that already exists, and scatters the data being read by calling memmove.

There are two alternatives that we might consider:

  • Introduce a new callback to read payload into vectors of vectors (i.e., like readv) that match to the payload section of multiple STREAM frames being generated.
  • Let the application provide a pointer to a contiguous temporary buffer that holds data to be sent, and scatter that.

It might turn out that we'd want to try these alternatives, but they require changes to the API. Therefore, as the first cut, we are trying the approach using memmove.

@kazuho kazuho force-pushed the kazuho/scatter-stream branch from 0148c52 to 423532e Compare June 16, 2025 07:10
@kazuho
Copy link
Member Author

kazuho commented Dec 4, 2025

Performance analysis for using memmove:

A tiny benchmark on Zen 3 (Ryzen 7 5700G) tells us that, for each type of copy size and method, following clocks are needed.

1400B * 4 rep movsb 1400B rep movsb 1400B memmovea
294 clks 92 clks 74s

Assume we are building 4 datagrams at once. If we interpret these numbers naively, it means that the copying overhead of using read and memmove is 516 clocks combined (294 + 74 * 3), while that of readv is 368 clocks (92 * 4).

However, if we convert these numbers to per-byte overhead, the difference is 0.026 clock / byte, which is pretty small, if not negligible.

Also, rep movsb - the instruction sequence used by the Linux kernel for read and readv - has performance issues that is not visible in this benchmark, they become 30x slower if the source and destination are on different pages but their deltas from the start of the page is below 32 bytesb; see https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515.

To paraphrase, the difference is small and there are unknowns that causes hesitation to change the API.

note a: To emulate the use case, we measured the throughput of memmove doing backward copies with tiny distances between the destination and the source addresses.
note b: The bug report does not clarify the maximum delta for which slowdown is observed, but my benchmarks show that is is when the delta is below 32 bytes.

@kazuho
Copy link
Member Author

kazuho commented Jan 14, 2026

Introduce a new callback to read payload into vectors of vectors (i.e., like readv) that match to the payload section of multiple STREAM frames being generated.

FWIW we did try this in the kazuho/scatter-stream2 branch, however it turned out to be slower, most likely due to the overhead of readv doing scattered reads greater than the cost of quicly memmove-ing the payload.

dst += hp - header;
len = s->dst_end - dst;
/* if sending in 1-RTT, generate stream payload past the end of current datagram and move them to build datagrams */
if (get_epoch(s->current.first_byte) == QUICLY_EPOCH_1RTT) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want a per-stream flag to toggle on/off this behavior?

When the application has its send data buffered in memory, it might make more sense to generate the datagrams one by one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants