Optimize read buffer compaction and reduce copying#294
Merged
zuiderkwast merged 4 commits intovalkey-io:mainfrom Mar 19, 2026
Merged
Optimize read buffer compaction and reduce copying#294zuiderkwast merged 4 commits intovalkey-io:mainfrom
zuiderkwast merged 4 commits intovalkey-io:mainfrom
Conversation
Previously, sdsrange() was called in valkeyReaderGetReply() to compact
the read buffer after consuming >= 1024 bytes. With pipelining, GetReply
is called many times per Feed call (once per reply), causing repeated
memmove() operations on the remaining buffer data.
Move the compaction to valkeyReaderFeed(), right before appending new
data. This ensures the memmove happens at most once per network read
instead of once per parsed reply.
Profiling
---------
Profiled using valkey-benchmark on macOS (Apple M4 Pro) with:
valkey-benchmark -P 32 -d 1024 -r 1000000 -n 3000000 -c 50 -t set,get
Server ran with --io-threads 4 to avoid being server-bottlenecked.
Flamegraph sampling (macOS `sample` tool, 10s at 1ms intervals) of the
valkey-benchmark process shows the following top userspace hotspots:
Before (top samples in valkey-benchmark):
250 valkeyReaderGetReply > sdsrange > _platform_memmove ← valkey-io#1 hotspot
246 __recvfrom (kernel)
188 __sendto (kernel)
38 valkeyReaderFeed > sdscatlen > _platform_memmove
25 createStringObject > _platform_memmove
24 sdscatlen > _platform_memmove (write path)
After:
0 valkeyReaderGetReply > sdsrange ← eliminated
266 __recvfrom (kernel)
210 __sendto (kernel)
43 valkeyReaderFeed > sdscatlen > _platform_memmove
28 sdscatlen > _platform_memmove (write path)
27 createStringObject > _platform_memmove
The sdsrange memmove in valkeyReaderGetReply, previously the valkey-io#1
userspace hotspot (250 samples), is completely eliminated. The
compaction cost is absorbed into valkeyReaderFeed with negligible
overhead increase (38 -> 43 samples).
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
valkeyBufferRead() previously read network data into a 16KB stack buffer,
then valkeyReaderFeed() copied it into the reader's sds buffer via
sdscatlen(). This memcpy showed up as a hotspot in flamegraph profiling
of pipelined workloads.
Add valkeyReaderGetReadBuf() and valkeyReaderCommitRead() to the public
reader API. valkeyReaderGetReadBuf() compacts consumed data, ensures
sufficient writable space, and returns a pointer directly into the
reader's internal buffer. The caller can then recv() into it and call
valkeyReaderCommitRead() to advance the buffer length.
valkeyBufferRead() now uses this API for the standard TCP read path,
eliminating the intermediate stack buffer and the memcpy entirely.
The existing valkeyReaderFeed() API is unchanged for external users.
Profiling (valkey-benchmark -P 32 -d 1024 -r 1000000, Apple M4 Pro):
Before: 43 samples in valkeyReaderFeed > sdscatlen > memmove
After: 0 samples (eliminated)
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
78e3930 to
270f569
Compare
Collaborator
Author
|
I added another optimization: direct read into the allocated reply buffer instead of via a static buffer.
The sdscatlen copy in the read path (previously 38-43 samples) is now zero — data goes directly from the kernel into the reader's sds buffer. The kernel The two optimizations combined eliminated the top two userspace hotspots in the read path. What remains is dominated by syscalls and the inherent cost of |
bjosv
reviewed
Mar 17, 2026
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Replace hardcoded allocation counts in the sync OOM tests with runtime discovery loops. Each test section now iterates from 0 allocations upward until the operation succeeds, removing the need to manually find correct counts after any change to internal allocation patterns. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
michael-grunder
approved these changes
Mar 18, 2026
bjosv
approved these changes
Mar 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Do compaction of the read buffer (memmove) before feeding more data to the reader, instead of after each parsed command.
Previously, sdsrange() was called in valkeyReaderGetReply() to compact the read buffer after consuming >= 1024 bytes. With pipelining, GetReply is called many times per Feed call (once per reply), causing repeated memmove() operations on the remaining buffer data.
Move the compaction to valkeyReaderFeed(), right before appending new data. This ensures the memmove happens at most once per network read instead of once per parsed reply.
Additionally, make space in the allocated read buffer before the read() call and read directly into the allocated reply buffer instead of copying it via a static buffer. A public API for this is added (valkeyReaderGetReadBuf + valkeyReaderCommitRead).
Profiling
Profiled using valkey-benchmark on macOS (Apple M4 Pro) with:
Server ran with --io-threads 4 to avoid being server-bottlenecked.
Flamegraph sampling (macOS
sampletool, 10s at 1ms intervals) of the valkey-benchmark process shows the following top userspace hotspots:After the first change, the sdsrange memmove (250 samples) is eliminated and the compaction cost is absorbed into valkeyReaderFeed with negligible overhead increase (38 -> 43 samples). After the second change, the sdscatlen -> memmove is also eliminated.