-
Notifications
You must be signed in to change notification settings - Fork 128
Generate payload for multiple datagrams at once #609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
0148c52 to
423532e
Compare
|
Performance analysis for using memmove: A tiny benchmark on Zen 3 (Ryzen 7 5700G) tells us that, for each type of copy size and method, following clocks are needed.
Assume we are building 4 datagrams at once. If we interpret these numbers naively, it means that the copying overhead of using However, if we convert these numbers to per-byte overhead, the difference is 0.026 clock / byte, which is pretty small, if not negligible. Also, To paraphrase, the difference is small and there are unknowns that causes hesitation to change the API. note a: To emulate the use case, we measured the throughput of memmove doing backward copies with tiny distances between the destination and the source addresses. |
…frame space can be allocated speculatively
FWIW we did try this in the kazuho/scatter-stream2 branch, however it turned out to be slower, most likely due to the overhead of |
| dst += hp - header; | ||
| len = s->dst_end - dst; | ||
| /* if sending in 1-RTT, generate stream payload past the end of current datagram and move them to build datagrams */ | ||
| if (get_epoch(s->current.first_byte) == QUICLY_EPOCH_1RTT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want a per-stream flag to toggle on/off this behavior?
When the application has its send data buffered in memory, it might make more sense to generate the datagrams one by one.
Up until now,
on_send_emitcallback has been invoked for each STREAM frame being built. This has become a bottleneck, due to two reasons:preadfor each call toon_send_emit.To mitigate the issuse, this PR refactors the
quicly_send_streamfunction to generate STREAM frames for as much as 10 packets at once.This PR calls the
on_send_emitcallback that already exists, and scatters the data being read by callingmemmove.There are two alternatives that we might consider:
readv) that match to the payload section of multiple STREAM frames being generated.It might turn out that we'd want to try these alternatives, but they require changes to the API. Therefore, as the first cut, we are trying the approach using
memmove.