Prefetch keys in MSET and MGET #2013

zuiderkwast · 2025-04-27T09:17:56Z

For multikey commands, prefetching the keys from the hash table to L1 cache can speed up the commands.

This change executes MGET and MSET in batches of 16 keys. For each batch, the keys are prefetched.

With the following benchmark, I got 13% higher throughput for MSET and 17% for MGET, when MSET and MGET are called with 10 keys each time.

taskset -c 0 ./valkey-server --save ""
taskset -c 1-6 ./valkey-benchmark --csv -r 1000000 -n 10000000 --threads 6 -t mset,mget

The benchmark test for MGET is added in #2015.

Signed-off-by: Viktor Söderqvist <[email protected]>

codecov · 2025-04-27T09:33:05Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.10%. Comparing base (249495a) to head (89e5298).
Report is 1 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2013      +/-   ##
============================================
+ Coverage     71.00%   71.10%   +0.09%     
============================================
  Files           123      123              
  Lines         66103    66133      +30     
============================================
+ Hits          46939    47022      +83     
+ Misses        19164    19111      -53

Files with missing lines	Coverage Δ
src/memory_prefetch.c	`15.64% <100.00%> (+12.52%)`	⬆️
src/t_string.c	`96.82% <100.00%> (+0.04%)`	⬆️

... and 19 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Viktor Söderqvist <[email protected]>

ranshid · 2025-04-28T13:27:36Z

src/memory_prefetch.c

+    int i, arg;
+    for (i = 0, arg = first; i < n && arg < c->argc; i++, arg += step) {
+        sds key = c->argv[arg]->ptr;
+        int slot = server.cluster_enabled ? getKeySlot(key) : 0;


Isn't the slot when cluster mode is disabled be -1?

slot must be 0. -1 will can kvstoreGetHashtable (accessing array[-1])

I think getKVStoreIndexForKey can be used here.

getKVStoreIndexForKey is static in db.c so either we make it non-static or I can just rename slot to kv_index?

not worth changing db.c. I'd just rename slot.

xbasel · 2025-04-28T13:42:02Z

I performed a quick test, MGET with 1k keys, and observed degradation.
Perf shows prefetch version shifts some work from findBucket into hashtableIncrementalFindStep, but total mgetCommand CPU cost increased slightly.

@zuiderkwast, can you please double-check the scenario below?

Hardware: ARM (Graviton 3)
Results:

Before:
  100000 requests completed in 16.41 seconds
  50 parallel clients
  23017 bytes payload
  keep alive: 1
  host configuration "save": 3600 1 300 100 60 10000
  host configuration "appendonly": no
  multi-thread: no
...

Summary:
  throughput summary: 6094.59 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        7.330     2.136     7.319     8.591     9.071    10.431



After:
  100000 requests completed in 19.68 seconds
  50 parallel clients
  23017 bytes payload
  keep alive: 1
  host configuration "save": 3600 1 300 100 60 10000
  host configuration "appendonly": no
  multi-thread: no
...
Summary:
  throughput summary: 5080.27 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        8.934     1.192     8.943     9.735    10.031    15.151

Perf without prefetch:

               main
               aeMain
               |
               |--95.12%--connSocketEventHandler
               |          |
               |           --95.09%--readQueryFromClient
               |                     |
               |                     |--89.20%--processInputBuffer
               |                     |          |
               |                     |          |--56.24%--processCommand
               |                     |          |          |
               |                     |          |           --56.12%--call
               |                     |          |                     |
               |                     |          |                      --55.82%--mgetCommand
               |                     |          |                                |
               |                     |          |                                |--21.55%--findBucket.lto_priv.0
               |                     |          |                                |          |
               |                     |          |                                |          |--5.67%--hashtableSdsKeyCompare
               |                     |          |                                |          |          |
               |                     |          |                                |          |           --2.99%--sdscmp
               |                     |          |                                |          |                     |
               |                     |          |                                |          |                      --1.32%--memcmp
               |                     |          |                                |          |
               |                     |          |                                |           --3.61%--hashtableObjectGetKey
               |                     |          |                                |
               |                     |          |                                |--14.65%--addReplyBulk
               |                     |          |                                |          |
               |                     |          |                                |          |--4.31%--addReply
               |                     |          |                                |          |          |
               |                     |          |                                |          |           --1.20%--prepareClientToWrite
               |                     |          |                                |          |
               |                     |          |                                |          |--3.95%--_addReplyToBufferOrList.part.0
               |                     |          |                                |          |          |
               |                     |          |                                |          |           --1.44%--__GI___memcpy_simd
               |                     |          |                                |          |
               |                     |          |                                |          |--2.23%--prepareClientToWrite
               |                     |          |                                |          |          |
               |                     |          |                                |          |           --0.51%--clientHasPendingReplies
               |                     |          |                                |          |
               |                     |          |                                |          |--1.85%--_addReplyLongLongWithPrefix.lto_priv.0
               |                     |          |                                |          |
               |                     |          |                                |          |--0.74%--stringObjectLen
               |                     |          |                                |          |
               |                     |          |                                |           --0.57%--_addReplyToBufferOrList
               |                     |          |                                |
               |                     |          |                                |--3.02%--_addReplyToBufferOrList.part.0
               |                     |          |                                |          |
               |                     |          |                                |           --1.07%--__GI___memcpy_simd
               |                     |          |                                |
               |                     |          |                                |--2.74%--siphash
               |                     |          |                                |
               |                     |          |                                |--1.38%--expireIfNeededWithDictIndex.lto_priv.0
               |                     |          |                                |
               |                     |          |                                 --0.57%--_addReplyToBufferOrList

MGET with prefetch:

               main
               aeMain
               |
               |--96.17%--connSocketEventHandler
               |          |
               |           --96.13%--readQueryFromClient
               |                     |
               |                     |--90.71%--processInputBuffer
               |                     |          |
               |                     |          |--63.56%--processCommand
               |                     |          |          |
               |                     |          |           --63.46%--call
               |                     |          |                     |
               |                     |          |                      --63.08%--mgetCommand
               |                     |          |                                |
               |                     |          |                                |--15.45%--findBucket.lto_priv.0
               |                     |          |                                |          |
               |                     |          |                                |          |--4.99%--hashtableSdsKeyCompare
               |                     |          |                                |          |          |
               |                     |          |                                |          |           --2.87%--sdscmp
               |                     |          |                                |          |                     |
               |                     |          |                                |          |                      --1.19%--memcmp
               |                     |          |                                |          |
               |                     |          |                                |           --1.13%--hashtableObjectGetKey
               |                     |          |                                |
               |                     |          |                                |--14.61%--mgetCommand
               |                     |          |                                |
               |                     |          |                                |--12.52%--hashtableIncrementalFindStep
               |                     |          |                                |          |
               |                     |          |                                |          |--3.96%--hashtableSdsKeyCompare
               |                     |          |                                |          |          |
               |                     |          |                                |          |           --1.94%--sdscmp
               |                     |          |                                |          |                     |
               |                     |          |                                |          |                      --0.66%--memcmp
               |                     |          |                                |          |
               |                     |          |                                |           --1.24%--hashtableObjectGetKey
               |                     |          |                                |
               |                     |          |                                |--11.34%--addReplyBulk
               |                     |          |                                |          |
               |                     |          |                                |          |--3.25%--addReply
               |                     |          |                                |          |          |
               |                     |          |                                |          |           --0.98%--prepareClientToWrite
               |                     |          |                                |          |
               |                     |          |                                |          |--3.16%--_addReplyToBufferOrList.part.0
               |                     |          |                                |          |          |
               |                     |          |                                |          |           --0.98%--__GI___memcpy_simd
               |                     |          |                                |          |
               |                     |          |                                |          |--1.85%--prepareClientToWrite
               |                     |          |                                |          |
               |                     |          |                                |          |--1.32%--_addReplyLongLongWithPrefix.lto_priv.0
               |                     |          |                                |          |
               |                     |          |                                |           --0.56%--stringObjectLen
               |                     |          |                                |
               |                     |          |                                |--3.99%--siphash
               |                     |          |                                |
               |                     |          |                                |--2.40%--_addReplyToBufferOrList.part.0
               |                     |          |                                |          |
               |                     |          |                                |           --0.63%--__GI___memcpy_simd
               |                     |          |                                |
               |                     |          |                                |--1.20%--expireIfNeededWithDictIndex.lto_priv.0
               |                     |          |                                |
               |                     |          |                                |--0.68%--hashtableSdsHash
               |                     |          |                                |
               |                     |          |                                 --0.57%--_addReplyToBufferOrList

Command used:

src/valkey-benchmark -h <ip> -n 1000000    $(cat mget_command.txt)

mget_command.txt contains MGET 1k keys.
Each value is 3 bytes.

zuiderkwast · 2025-04-28T16:07:33Z

@xbasel That's surprising. I modified valkey-benchmark MSET and MGET tests (from #2015) to run with 1000 keys. I got much better throughput with prefetching, but I'm just running on a laptop...

unstable

$ taskset -c 1-6 ./valkey-benchmark --csv -r 1000000 -n 10000 --threads 6 -t mset,mget
"test","rps","avg_latency_ms","min_latency_ms","p50_latency_ms","p95_latency_ms","p99_latency_ms","max_latency_ms"
"MSET (1000 keys)","1592.36","28.423","3.520","27.231","32.079","85.567","159.231"
"MGET (1000 keys)","2210.43","19.773","5.416","20.015","21.903","22.863","24.671"

prefetch

$ taskset -c 1-6 ./valkey-benchmark --csv -r 1000000 -n 10000 --threads 6 -t mset,mget
"test","rps","avg_latency_ms","min_latency_ms","p50_latency_ms","p95_latency_ms","p99_latency_ms","max_latency_ms"
"MSET (1000 keys)","1987.68","22.103","4.304","21.023","24.831","68.415","150.655"
"MGET (1000 keys)","3599.71","11.182","3.512","11.175","13.871","16.063","19.807"

xbasel · 2025-04-28T18:53:09Z

@xbasel That's surprising. I modified valkey-benchmark MSET and MGET tests (from #2015) to run with 1000 keys. I got much better throughput with prefetching, but I'm just running on a laptop...

unstable

$ taskset -c 1-6 ./valkey-benchmark --csv -r 1000000 -n 10000 --threads 6 -t mset,mget
"test","rps","avg_latency_ms","min_latency_ms","p50_latency_ms","p95_latency_ms","p99_latency_ms","max_latency_ms"
"MSET (1000 keys)","1592.36","28.423","3.520","27.231","32.079","85.567","159.231"
"MGET (1000 keys)","2210.43","19.773","5.416","20.015","21.903","22.863","24.671"

prefetch

$ taskset -c 1-6 ./valkey-benchmark --csv -r 1000000 -n 10000 --threads 6 -t mset,mget
"test","rps","avg_latency_ms","min_latency_ms","p50_latency_ms","p95_latency_ms","p99_latency_ms","max_latency_ms"
"MSET (1000 keys)","1987.68","22.103","4.304","21.023","24.831","68.415","150.655"
"MGET (1000 keys)","3599.71","11.182","3.512","11.175","13.871","16.063","19.807"

Your input is random, I think this could be related. I'll test with random input.

xbasel · 2025-04-28T19:09:13Z

Your input is random, I think this could be related. I'll test with random input.

I tested with random input and saw improvement. When the data is already cached, prefetching is just overhead. The data set I used was too small.

sarthakaggarwal97 · 2025-04-29T16:49:37Z

src/t_string.c

@@ -606,10 +606,14 @@ void getrangeCommand(client *c) {
 }

 void mgetCommand(client *c) {
+    const int batch_size = 16;


should we make this global to avoid multiple declarations?

I considered using the existing config prefetch-batch-max-size but according to documentation, it's specifically for IO threading.

I can change to a define, but I don't know if we should use the same number in all multi-key commands. I guess it depends how many other operations happen in between that can evict the prefetched keys from L1 cache.

I don't think the batch size should be global - it should be per-command/unit, possibly hardcoded, since the optimal value likely varies across commands and code paths.

true, it should be per command, but does mget and mset can only have same batch_size?

xbasel · 2025-04-29T21:07:36Z

src/memory_prefetch.c

+/* A simple way to prefetch the keys for a client command execution. This can be
+ * used for optimizing multi-key commands like mget, mset, etc. */
+void prefetchKeys(client *c, int first, int step, int count) {
+    /* Skip this for IO threads. Keys are already batch-prefetched. */


Not all keys are prefetched in the context of io-threads. Up to 16 keys.

That's right. What do you suggest? Move the check to the caller and skip prefetching the first 16 if io-thread are used?

I wanted to optimize for ~10 keys. MGET with thousands of keys isn't a good pattern since it starves other clients, so that's less important IMO.

Maybe MGET with large number of keys isn't a good pattern, however, clients still send large key batches in practice.

I suggest either skipping the first 16 keys if IO threads is enabled or avoiding lookupKey entirely, see my other comment.

xbasel · 2025-04-30T00:55:41Z

This change effectively does this (at last for MGET, I only looked at MGET):

for (key in keys) {
    prefetch_16_keys_every_16_iterations();
    val = lookupKeyRead(key);
}

The prefetch phase already does the heavy lifting - walking the hashtable and locating entries. That work is then repeated by lookupKeyRead.

Yes, some (likely most) data stays in L1, but there's still wasted effort and potential stalls for uncached or evicted entries, or just due to the code path.

I don’t see why we can’t fetch the actual values during prefetch and reuse them. The current "scan and discard" pattern makes sense in IO threads context (pre-proessing) where lookup happens later, but here it would be cleaner and more efficient to return the data directly.

I profiled the execution and found:

--33.62%--mgetCommand
          |
          |--16.80%--prefetchKeys
          |          |
          |          |--13.97%--hashtableIncrementalFindStep
          |          |          |
          |          |          |--6.21%--compareKeys (inlined)
          |          |          |          |
          |          |          |           --6.06%--hashtableSdsKeyCompare
          |          |          |                     |
          |          |          |                      --5.49%--sdscmp
          |          |          |                                |
          |          |          |                                 --4.79%--__memcmp_evex_movbe
          |          |          |
          |          |           --2.55%--entryGetKey (inlined)
          |          |                     |
          |          |                      --2.45%--hashtableObjectGetKey
          |          |                                |
          |          |                                 --2.44%--objectGetKey (inlined)
          |          |
          |           --1.75%--hashtableIncrementalFindInit (inlined)
          |                     hashtableIncrementalFindInit (inlined)
          |                     |
          |                      --1.72%--hashKey (inlined)
          |                                |
          |                                 --1.33%--siphash
          |
          |--8.90%--lookupKeyRead (inlined)
          |          lookupKeyReadWithFlags (inlined)
          |          |
          |           --8.73%--lookupKey
          |                     |
          |                      --7.31%--dbFindWithDictIndex (inlined)
          |                                |
          |                                 --7.18%--kvstoreHashtableFind (inlined)
          |                                           |
          |                                            --7.09%--hashtableFind
          |                                                      |
          |                                                      |--4.73%--findBucket (inlined)
          |                                                      |          |
          |                                                      |          |--1.84%--compareKeys (inlined)
          |                                                      |          |          |
          |                                                      |          |           --1.67%--hashtableSdsKeyCompare
          |                                                      |          |                     |
          |                                                      |          |                     |--0.96%--sdscmp
          |                                                      |          |                     |
          |                                                      |          |                      --0.63%--sdslen (inl

          |                                                      |          |
          |                                                      |           --1.22%--entryGetKey (inlined)
          |                                                      |
          |                                                       --1.87%--hashKey (inlined)
          |                                                                 |
          |                                                                  --1.48%--siphash
          |

It seems hashtableFind is still causing memory stalls.

I modified the code slightly to make lookupKeyRead redundant by changing:

void prefetchKeys(client *c, int first, int step, int count, void **out)

and using out to iterate over the entries directly.

I'm not 100% sure my code is correct, it was a quick and dirty experiment, but I achieved this:

Summary:
  throughput: 4432.23 requests/sec
  latency (ms):
       avg     min     p50     p95     p99     max
     10.203   3.312  10.103  11.991  13.807  25.695

Compared to the original version using prefetchKeys and lookupKeyRead:

Summary:
  throughput: 3497.24 requests/sec
  latency (ms):
       avg     min     p50     p95     p99     max
     12.934   5.824  12.887  14.815  16.271  21.535

Again, no guarantees my code is correct, but it’s a direction worth exploring. If you're interested, I can try to clean it up and submit it here or in a separate branch.

zuiderkwast · 2025-04-30T18:56:36Z

I modified the code slightly to make lookupKeyRead redundant by changing:

Yeah, I'm aware that batch-prefetching is just a hack, not a perfect solution.

lookupKey() has side effects like eviction and key hits/misses stats. We would need a version of lookupKey that only does the that part, without actually looking up the key itself.

I also considered a lookupMultipleKeys() that looks up multiple keys in parallel.

MSET doesn't use lookupKey() but it uses setKey(). For this, we could have a setMultipleKeys() variant that does it in parallel. It's more optimal but more complex too, to get it nice and simple and avoid duplicating code.

xbasel · 2025-04-30T19:12:22Z

I modified the code slightly to make lookupKeyRead redundant by changing:

Yeah, I'm aware that batch-prefetching is just a hack, not a perfect solution.

lookupKey() has side effects like eviction and key hits/misses stats. We would need a version of lookupKey that only does the that part, without actually looking up the key itself.

I also considered a lookupMultipleKeys() that looks up multiple keys in parallel.

MSET doesn't use lookupKey() but it uses setKey(). For this, we could have a setMultipleKeys() variant that does it in parallel. It's more optimal but more complex too, to get it nice and simple and avoid duplicating code.

I might have slightly underestimated the complexity. I think it's fine to start with the current code and optimize later.

zuiderkwast · 2025-05-17T09:06:22Z

I believe this is a better and more generic approach:

Parse multiple commands and prefetch keys #2092

Prefetching keys in parallel in MSET and MGET

0ce0c4b

Signed-off-by: Viktor Söderqvist <[email protected]>

zuiderkwast requested a review from uriyage April 27, 2025 09:18

zuiderkwast force-pushed the prefetch-mset-mget branch 3 times, most recently from 591e36a to 9c0905b Compare April 27, 2025 19:28

Execute MGET and MSET in batches

89e5298

Signed-off-by: Viktor Söderqvist <[email protected]>

zuiderkwast force-pushed the prefetch-mset-mget branch from d10b2da to 89e5298 Compare April 27, 2025 19:35

zuiderkwast mentioned this pull request Apr 28, 2025

MSET and XADD Performance degrade 8.1.0 vs 8.0.2 #1999

Closed

ranshid reviewed Apr 28, 2025

View reviewed changes

xbasel self-requested a review April 28, 2025 13:42

sarthakaggarwal97 reviewed Apr 29, 2025

View reviewed changes

xbasel reviewed Apr 29, 2025

View reviewed changes

zuiderkwast marked this pull request as draft May 16, 2025 10:42

Prefetch keys in MSET and MGET #2013

Are you sure you want to change the base?

Prefetch keys in MSET and MGET #2013

Conversation

zuiderkwast commented Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zuiderkwast Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xbasel commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuiderkwast commented Apr 28, 2025

Uh oh!

xbasel commented Apr 28, 2025

Uh oh!

xbasel commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xbasel Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xbasel commented Apr 30, 2025

Uh oh!

zuiderkwast commented Apr 30, 2025

Uh oh!

xbasel commented Apr 30, 2025

Uh oh!

zuiderkwast commented May 17, 2025

Uh oh!

Uh oh!

zuiderkwast commented Apr 27, 2025 •

edited

Loading

codecov bot commented Apr 27, 2025 •

edited

Loading

zuiderkwast Apr 28, 2025 •

edited

Loading

xbasel commented Apr 28, 2025 •

edited

Loading

xbasel commented Apr 28, 2025 •

edited

Loading

xbasel Apr 30, 2025 •

edited

Loading