Skip to content

Optimize read buffer compaction and reduce copying#294

Merged
zuiderkwast merged 4 commits intovalkey-io:mainfrom
zuiderkwast:optimize-sdsrange-calls
Mar 19, 2026
Merged

Optimize read buffer compaction and reduce copying#294
zuiderkwast merged 4 commits intovalkey-io:mainfrom
zuiderkwast:optimize-sdsrange-calls

Conversation

@zuiderkwast
Copy link
Copy Markdown
Collaborator

@zuiderkwast zuiderkwast commented Mar 17, 2026

Do compaction of the read buffer (memmove) before feeding more data to the reader, instead of after each parsed command.

Previously, sdsrange() was called in valkeyReaderGetReply() to compact the read buffer after consuming >= 1024 bytes. With pipelining, GetReply is called many times per Feed call (once per reply), causing repeated memmove() operations on the remaining buffer data.

Move the compaction to valkeyReaderFeed(), right before appending new data. This ensures the memmove happens at most once per network read instead of once per parsed reply.

Additionally, make space in the allocated read buffer before the read() call and read directly into the allocated reply buffer instead of copying it via a static buffer. A public API for this is added (valkeyReaderGetReadBuf + valkeyReaderCommitRead).

Profiling

Profiled using valkey-benchmark on macOS (Apple M4 Pro) with:

valkey-benchmark -P 32 -d 1024 -r 1000000 -n 3000000 -c 50 -t set,get

Server ran with --io-threads 4 to avoid being server-bottlenecked.

Flamegraph sampling (macOS sample tool, 10s at 1ms intervals) of the valkey-benchmark process shows the following top userspace hotspots:

Read path hotspot Original After compaction fix After direct read
sdsrange → memmove (GetReply) 250 0 ← eliminated 0
valkeyReaderFeed → sdscatlen → memmove 38 43 0 ← eliminated
__recvfrom (kernel) 246 266 295
createStringObject → memmove 25 27 28

After the first change, the sdsrange memmove (250 samples) is eliminated and the compaction cost is absorbed into valkeyReaderFeed with negligible overhead increase (38 -> 43 samples). After the second change, the sdscatlen -> memmove is also eliminated.

@zuiderkwast zuiderkwast requested review from bjosv and michael-grunder and removed request for michael-grunder March 17, 2026 01:14
Previously, sdsrange() was called in valkeyReaderGetReply() to compact
the read buffer after consuming >= 1024 bytes. With pipelining, GetReply
is called many times per Feed call (once per reply), causing repeated
memmove() operations on the remaining buffer data.

Move the compaction to valkeyReaderFeed(), right before appending new
data. This ensures the memmove happens at most once per network read
instead of once per parsed reply.

Profiling
---------

Profiled using valkey-benchmark on macOS (Apple M4 Pro) with:

    valkey-benchmark -P 32 -d 1024 -r 1000000 -n 3000000 -c 50 -t set,get

Server ran with --io-threads 4 to avoid being server-bottlenecked.

Flamegraph sampling (macOS `sample` tool, 10s at 1ms intervals) of the
valkey-benchmark process shows the following top userspace hotspots:

Before (top samples in valkey-benchmark):

    250  valkeyReaderGetReply > sdsrange > _platform_memmove  ← valkey-io#1 hotspot
    246  __recvfrom (kernel)
    188  __sendto (kernel)
     38  valkeyReaderFeed > sdscatlen > _platform_memmove
     25  createStringObject > _platform_memmove
     24  sdscatlen > _platform_memmove (write path)

After:

      0  valkeyReaderGetReply > sdsrange  ← eliminated
    266  __recvfrom (kernel)
    210  __sendto (kernel)
     43  valkeyReaderFeed > sdscatlen > _platform_memmove
     28  sdscatlen > _platform_memmove (write path)
     27  createStringObject > _platform_memmove

The sdsrange memmove in valkeyReaderGetReply, previously the valkey-io#1
userspace hotspot (250 samples), is completely eliminated. The
compaction cost is absorbed into valkeyReaderFeed with negligible
overhead increase (38 -> 43 samples).

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
valkeyBufferRead() previously read network data into a 16KB stack buffer,
then valkeyReaderFeed() copied it into the reader's sds buffer via
sdscatlen(). This memcpy showed up as a hotspot in flamegraph profiling
of pipelined workloads.

Add valkeyReaderGetReadBuf() and valkeyReaderCommitRead() to the public
reader API. valkeyReaderGetReadBuf() compacts consumed data, ensures
sufficient writable space, and returns a pointer directly into the
reader's internal buffer. The caller can then recv() into it and call
valkeyReaderCommitRead() to advance the buffer length.

valkeyBufferRead() now uses this API for the standard TCP read path,
eliminating the intermediate stack buffer and the memcpy entirely.
The existing valkeyReaderFeed() API is unchanged for external users.

Profiling (valkey-benchmark -P 32 -d 1024 -r 1000000, Apple M4 Pro):

    Before:  43 samples in valkeyReaderFeed > sdscatlen > memmove
    After:    0 samples (eliminated)

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
@zuiderkwast zuiderkwast force-pushed the optimize-sdsrange-calls branch from 78e3930 to 270f569 Compare March 17, 2026 01:53
@zuiderkwast
Copy link
Copy Markdown
Collaborator Author

I added another optimization: direct read into the allocated reply buffer instead of via a static buffer.

Read path hotspot Original After compaction fix After direct read
sdsrange → memmove (GetReply) 250 0 ← eliminated 0
valkeyReaderFeed → sdscatlen → memmove 38 43 0 ← eliminated
__recvfrom (kernel) 246 266 295
createStringObject → memmove 25 27 28

The sdscatlen copy in the read path (previously 38-43 samples) is now zero — data goes directly from the kernel into the reader's sds buffer. The kernel
syscall samples went up proportionally since it's now a larger share of the remaining work.

The two optimizations combined eliminated the top two userspace hotspots in the read path. What remains is dominated by syscalls and the inherent cost of
creating reply objects.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Replace hardcoded allocation counts in the sync OOM tests with runtime
discovery loops. Each test section now iterates from 0 allocations upward
until the operation succeeds, removing the need to manually find correct
counts after any change to internal allocation patterns.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
@zuiderkwast zuiderkwast changed the title Move reader buffer compaction from GetReply to Feed Optimize read buffer compaction and reduce copying Mar 19, 2026
@zuiderkwast zuiderkwast merged commit b9285b5 into valkey-io:main Mar 19, 2026
48 checks passed
@zuiderkwast zuiderkwast deleted the optimize-sdsrange-calls branch March 20, 2026 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants