Optimize read buffer compaction and reduce copying by zuiderkwast · Pull Request #294 · valkey-io/libvalkey

zuiderkwast · 2026-03-17T01:14:07Z

Do compaction of the read buffer (memmove) before feeding more data to the reader, instead of after each parsed command.

Previously, sdsrange() was called in valkeyReaderGetReply() to compact the read buffer after consuming >= 1024 bytes. With pipelining, GetReply is called many times per Feed call (once per reply), causing repeated memmove() operations on the remaining buffer data.

Move the compaction to valkeyReaderFeed(), right before appending new data. This ensures the memmove happens at most once per network read instead of once per parsed reply.

Additionally, make space in the allocated read buffer before the read() call and read directly into the allocated reply buffer instead of copying it via a static buffer. A public API for this is added (valkeyReaderGetReadBuf + valkeyReaderCommitRead).

Profiling

Profiled using valkey-benchmark on macOS (Apple M4 Pro) with:

valkey-benchmark -P 32 -d 1024 -r 1000000 -n 3000000 -c 50 -t set,get

Server ran with --io-threads 4 to avoid being server-bottlenecked.

Flamegraph sampling (macOS sample tool, 10s at 1ms intervals) of the valkey-benchmark process shows the following top userspace hotspots:

Read path hotspot	Original	After compaction fix	After direct read
sdsrange → memmove (GetReply)	250	0 ← eliminated	0
valkeyReaderFeed → sdscatlen → memmove	38	43	0 ← eliminated
__recvfrom (kernel)	246	266	295
createStringObject → memmove	25	27	28

After the first change, the sdsrange memmove (250 samples) is eliminated and the compaction cost is absorbed into valkeyReaderFeed with negligible overhead increase (38 -> 43 samples). After the second change, the sdscatlen -> memmove is also eliminated.

Previously, sdsrange() was called in valkeyReaderGetReply() to compact the read buffer after consuming >= 1024 bytes. With pipelining, GetReply is called many times per Feed call (once per reply), causing repeated memmove() operations on the remaining buffer data. Move the compaction to valkeyReaderFeed(), right before appending new data. This ensures the memmove happens at most once per network read instead of once per parsed reply. Profiling --------- Profiled using valkey-benchmark on macOS (Apple M4 Pro) with: valkey-benchmark -P 32 -d 1024 -r 1000000 -n 3000000 -c 50 -t set,get Server ran with --io-threads 4 to avoid being server-bottlenecked. Flamegraph sampling (macOS `sample` tool, 10s at 1ms intervals) of the valkey-benchmark process shows the following top userspace hotspots: Before (top samples in valkey-benchmark): 250 valkeyReaderGetReply > sdsrange > _platform_memmove ← valkey-io#1 hotspot 246 __recvfrom (kernel) 188 __sendto (kernel) 38 valkeyReaderFeed > sdscatlen > _platform_memmove 25 createStringObject > _platform_memmove 24 sdscatlen > _platform_memmove (write path) After: 0 valkeyReaderGetReply > sdsrange ← eliminated 266 __recvfrom (kernel) 210 __sendto (kernel) 43 valkeyReaderFeed > sdscatlen > _platform_memmove 28 sdscatlen > _platform_memmove (write path) 27 createStringObject > _platform_memmove The sdsrange memmove in valkeyReaderGetReply, previously the valkey-io#1 userspace hotspot (250 samples), is completely eliminated. The compaction cost is absorbed into valkeyReaderFeed with negligible overhead increase (38 -> 43 samples). Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

valkeyBufferRead() previously read network data into a 16KB stack buffer, then valkeyReaderFeed() copied it into the reader's sds buffer via sdscatlen(). This memcpy showed up as a hotspot in flamegraph profiling of pipelined workloads. Add valkeyReaderGetReadBuf() and valkeyReaderCommitRead() to the public reader API. valkeyReaderGetReadBuf() compacts consumed data, ensures sufficient writable space, and returns a pointer directly into the reader's internal buffer. The caller can then recv() into it and call valkeyReaderCommitRead() to advance the buffer length. valkeyBufferRead() now uses this API for the standard TCP read path, eliminating the intermediate stack buffer and the memcpy entirely. The existing valkeyReaderFeed() API is unchanged for external users. Profiling (valkey-benchmark -P 32 -d 1024 -r 1000000, Apple M4 Pro): Before: 43 samples in valkeyReaderFeed > sdscatlen > memmove After: 0 samples (eliminated) Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

zuiderkwast · 2026-03-17T01:57:09Z

I added another optimization: direct read into the allocated reply buffer instead of via a static buffer.

Read path hotspot	Original	After compaction fix	After direct read
sdsrange → memmove (GetReply)	250	0 ← eliminated	0
valkeyReaderFeed → sdscatlen → memmove	38	43	0 ← eliminated
__recvfrom (kernel)	246	266	295
createStringObject → memmove	25	27	28

The sdscatlen copy in the read path (previously 38-43 samples) is now zero — data goes directly from the kernel into the reader's sds buffer. The kernel
syscall samples went up proportionally since it's now a larger share of the remaining work.

The two optimizations combined eliminated the top two userspace hotspots in the read path. What remains is dominated by syscalls and the inherent cost of
creating reply objects.

src/read.c

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

Replace hardcoded allocation counts in the sync OOM tests with runtime discovery loops. Each test section now iterates from 0 allocations upward until the operation succeeds, removing the need to manually find correct counts after any change to internal allocation patterns. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

zuiderkwast requested review from bjosv and michael-grunder and removed request for michael-grunder March 17, 2026 01:14

zuiderkwast added 2 commits March 17, 2026 02:52

zuiderkwast force-pushed the optimize-sdsrange-calls branch from 78e3930 to 270f569 Compare March 17, 2026 01:53

michael-grunder reviewed Mar 17, 2026

View reviewed changes

src/read.c Outdated Show resolved Hide resolved

bjosv reviewed Mar 17, 2026

View reviewed changes

src/read.c Outdated Show resolved Hide resolved

zuiderkwast added 2 commits March 17, 2026 21:01

Clean up types and refactor valkeyReaderFeed

adb6e95

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

michael-grunder approved these changes Mar 18, 2026

View reviewed changes

bjosv approved these changes Mar 18, 2026

View reviewed changes

zuiderkwast changed the title ~~Move reader buffer compaction from GetReply to Feed~~ Optimize read buffer compaction and reduce copying Mar 19, 2026

zuiderkwast merged commit b9285b5 into valkey-io:main Mar 19, 2026
48 checks passed

zuiderkwast deleted the optimize-sdsrange-calls branch March 20, 2026 12:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize read buffer compaction and reduce copying#294

Optimize read buffer compaction and reduce copying#294
zuiderkwast merged 4 commits intovalkey-io:mainfrom
zuiderkwast:optimize-sdsrange-calls

zuiderkwast commented Mar 17, 2026 •

edited

Loading

Uh oh!

zuiderkwast commented Mar 17, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zuiderkwast commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Profiling

Uh oh!

zuiderkwast commented Mar 17, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zuiderkwast commented Mar 17, 2026 •

edited

Loading