Skip to content

Prefetch keys in MSET and MGET #2013

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: unstable
Choose a base branch
from

Conversation

zuiderkwast
Copy link
Contributor

@zuiderkwast zuiderkwast commented Apr 27, 2025

For multikey commands, prefetching the keys from the hash table to L1 cache can speed up the commands.

This change executes MGET and MSET in batches of 16 keys. For each batch, the keys are prefetched.

With the following benchmark, I got 13% higher throughput for MSET and 17% for MGET, when MSET and MGET are called with 10 keys each time.

taskset -c 0 ./valkey-server --save ""
taskset -c 1-6 ./valkey-benchmark --csv -r 1000000 -n 10000000 --threads 6 -t mset,mget

The benchmark test for MGET is added in #2015.

@zuiderkwast zuiderkwast requested a review from uriyage April 27, 2025 09:18
Copy link

codecov bot commented Apr 27, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.10%. Comparing base (249495a) to head (89e5298).
Report is 1 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2013      +/-   ##
============================================
+ Coverage     71.00%   71.10%   +0.09%     
============================================
  Files           123      123              
  Lines         66103    66133      +30     
============================================
+ Hits          46939    47022      +83     
+ Misses        19164    19111      -53     
Files with missing lines Coverage Δ
src/memory_prefetch.c 15.64% <100.00%> (+12.52%) ⬆️
src/t_string.c 96.82% <100.00%> (+0.04%) ⬆️

... and 19 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@zuiderkwast zuiderkwast force-pushed the prefetch-mset-mget branch 3 times, most recently from 591e36a to 9c0905b Compare April 27, 2025 19:28
Signed-off-by: Viktor Söderqvist <[email protected]>
int i, arg;
for (i = 0, arg = first; i < n && arg < c->argc; i++, arg += step) {
sds key = c->argv[arg]->ptr;
int slot = server.cluster_enabled ? getKeySlot(key) : 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the slot when cluster mode is disabled be -1?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slot must be 0. -1 will can kvstoreGetHashtable (accessing array[-1])

I think getKVStoreIndexForKey can be used here.

Copy link
Contributor Author

@zuiderkwast zuiderkwast Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getKVStoreIndexForKey is static in db.c so either we make it non-static or I can just rename slot to kv_index?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not worth changing db.c. I'd just rename slot.

@xbasel
Copy link
Member

xbasel commented Apr 28, 2025

I performed a quick test, MGET with 1k keys, and observed degradation.
Perf shows prefetch version shifts some work from findBucket into hashtableIncrementalFindStep, but total mgetCommand CPU cost increased slightly.

@zuiderkwast, can you please double-check the scenario below?

Hardware: ARM (Graviton 3)
Results:

Before:
  100000 requests completed in 16.41 seconds
  50 parallel clients
  23017 bytes payload
  keep alive: 1
  host configuration "save": 3600 1 300 100 60 10000
  host configuration "appendonly": no
  multi-thread: no
...

Summary:
  throughput summary: 6094.59 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        7.330     2.136     7.319     8.591     9.071    10.431



After:
  100000 requests completed in 19.68 seconds
  50 parallel clients
  23017 bytes payload
  keep alive: 1
  host configuration "save": 3600 1 300 100 60 10000
  host configuration "appendonly": no
  multi-thread: no
...
Summary:
  throughput summary: 5080.27 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        8.934     1.192     8.943     9.735    10.031    15.151

Perf without prefetch:

               main
               aeMain
               |
               |--95.12%--connSocketEventHandler
               |          |
               |           --95.09%--readQueryFromClient
               |                     |
               |                     |--89.20%--processInputBuffer
               |                     |          |
               |                     |          |--56.24%--processCommand
               |                     |          |          |
               |                     |          |           --56.12%--call
               |                     |          |                     |
               |                     |          |                      --55.82%--mgetCommand
               |                     |          |                                |
               |                     |          |                                |--21.55%--findBucket.lto_priv.0
               |                     |          |                                |          |
               |                     |          |                                |          |--5.67%--hashtableSdsKeyCompare
               |                     |          |                                |          |          |
               |                     |          |                                |          |           --2.99%--sdscmp
               |                     |          |                                |          |                     |
               |                     |          |                                |          |                      --1.32%--memcmp
               |                     |          |                                |          |
               |                     |          |                                |           --3.61%--hashtableObjectGetKey
               |                     |          |                                |
               |                     |          |                                |--14.65%--addReplyBulk
               |                     |          |                                |          |
               |                     |          |                                |          |--4.31%--addReply
               |                     |          |                                |          |          |
               |                     |          |                                |          |           --1.20%--prepareClientToWrite
               |                     |          |                                |          |
               |                     |          |                                |          |--3.95%--_addReplyToBufferOrList.part.0
               |                     |          |                                |          |          |
               |                     |          |                                |          |           --1.44%--__GI___memcpy_simd
               |                     |          |                                |          |
               |                     |          |                                |          |--2.23%--prepareClientToWrite
               |                     |          |                                |          |          |
               |                     |          |                                |          |           --0.51%--clientHasPendingReplies
               |                     |          |                                |          |
               |                     |          |                                |          |--1.85%--_addReplyLongLongWithPrefix.lto_priv.0
               |                     |          |                                |          |
               |                     |          |                                |          |--0.74%--stringObjectLen
               |                     |          |                                |          |
               |                     |          |                                |           --0.57%--_addReplyToBufferOrList
               |                     |          |                                |
               |                     |          |                                |--3.02%--_addReplyToBufferOrList.part.0
               |                     |          |                                |          |
               |                     |          |                                |           --1.07%--__GI___memcpy_simd
               |                     |          |                                |
               |                     |          |                                |--2.74%--siphash
               |                     |          |                                |
               |                     |          |                                |--1.38%--expireIfNeededWithDictIndex.lto_priv.0
               |                     |          |                                |
               |                     |          |                                 --0.57%--_addReplyToBufferOrList

MGET with prefetch:

               main
               aeMain
               |
               |--96.17%--connSocketEventHandler
               |          |
               |           --96.13%--readQueryFromClient
               |                     |
               |                     |--90.71%--processInputBuffer
               |                     |          |
               |                     |          |--63.56%--processCommand
               |                     |          |          |
               |                     |          |           --63.46%--call
               |                     |          |                     |
               |                     |          |                      --63.08%--mgetCommand
               |                     |          |                                |
               |                     |          |                                |--15.45%--findBucket.lto_priv.0
               |                     |          |                                |          |
               |                     |          |                                |          |--4.99%--hashtableSdsKeyCompare
               |                     |          |                                |          |          |
               |                     |          |                                |          |           --2.87%--sdscmp
               |                     |          |                                |          |                     |
               |                     |          |                                |          |                      --1.19%--memcmp
               |                     |          |                                |          |
               |                     |          |                                |           --1.13%--hashtableObjectGetKey
               |                     |          |                                |
               |                     |          |                                |--14.61%--mgetCommand
               |                     |          |                                |
               |                     |          |                                |--12.52%--hashtableIncrementalFindStep
               |                     |          |                                |          |
               |                     |          |                                |          |--3.96%--hashtableSdsKeyCompare
               |                     |          |                                |          |          |
               |                     |          |                                |          |           --1.94%--sdscmp
               |                     |          |                                |          |                     |
               |                     |          |                                |          |                      --0.66%--memcmp
               |                     |          |                                |          |
               |                     |          |                                |           --1.24%--hashtableObjectGetKey
               |                     |          |                                |
               |                     |          |                                |--11.34%--addReplyBulk
               |                     |          |                                |          |
               |                     |          |                                |          |--3.25%--addReply
               |                     |          |                                |          |          |
               |                     |          |                                |          |           --0.98%--prepareClientToWrite
               |                     |          |                                |          |
               |                     |          |                                |          |--3.16%--_addReplyToBufferOrList.part.0
               |                     |          |                                |          |          |
               |                     |          |                                |          |           --0.98%--__GI___memcpy_simd
               |                     |          |                                |          |
               |                     |          |                                |          |--1.85%--prepareClientToWrite
               |                     |          |                                |          |
               |                     |          |                                |          |--1.32%--_addReplyLongLongWithPrefix.lto_priv.0
               |                     |          |                                |          |
               |                     |          |                                |           --0.56%--stringObjectLen
               |                     |          |                                |
               |                     |          |                                |--3.99%--siphash
               |                     |          |                                |
               |                     |          |                                |--2.40%--_addReplyToBufferOrList.part.0
               |                     |          |                                |          |
               |                     |          |                                |           --0.63%--__GI___memcpy_simd
               |                     |          |                                |
               |                     |          |                                |--1.20%--expireIfNeededWithDictIndex.lto_priv.0
               |                     |          |                                |
               |                     |          |                                |--0.68%--hashtableSdsHash
               |                     |          |                                |
               |                     |          |                                 --0.57%--_addReplyToBufferOrList

Command used:

src/valkey-benchmark -h <ip> -n 1000000    $(cat mget_command.txt)

mget_command.txt contains MGET 1k keys.
Each value is 3 bytes.

@xbasel xbasel self-requested a review April 28, 2025 13:42
@zuiderkwast
Copy link
Contributor Author

@xbasel That's surprising. I modified valkey-benchmark MSET and MGET tests (from #2015) to run with 1000 keys. I got much better throughput with prefetching, but I'm just running on a laptop...

unstable

$ taskset -c 1-6 ./valkey-benchmark --csv -r 1000000 -n 10000 --threads 6 -t mset,mget
"test","rps","avg_latency_ms","min_latency_ms","p50_latency_ms","p95_latency_ms","p99_latency_ms","max_latency_ms"
"MSET (1000 keys)","1592.36","28.423","3.520","27.231","32.079","85.567","159.231"
"MGET (1000 keys)","2210.43","19.773","5.416","20.015","21.903","22.863","24.671"

prefetch

$ taskset -c 1-6 ./valkey-benchmark --csv -r 1000000 -n 10000 --threads 6 -t mset,mget
"test","rps","avg_latency_ms","min_latency_ms","p50_latency_ms","p95_latency_ms","p99_latency_ms","max_latency_ms"
"MSET (1000 keys)","1987.68","22.103","4.304","21.023","24.831","68.415","150.655"
"MGET (1000 keys)","3599.71","11.182","3.512","11.175","13.871","16.063","19.807"

@xbasel
Copy link
Member

xbasel commented Apr 28, 2025

@xbasel That's surprising. I modified valkey-benchmark MSET and MGET tests (from #2015) to run with 1000 keys. I got much better throughput with prefetching, but I'm just running on a laptop...

unstable

$ taskset -c 1-6 ./valkey-benchmark --csv -r 1000000 -n 10000 --threads 6 -t mset,mget
"test","rps","avg_latency_ms","min_latency_ms","p50_latency_ms","p95_latency_ms","p99_latency_ms","max_latency_ms"
"MSET (1000 keys)","1592.36","28.423","3.520","27.231","32.079","85.567","159.231"
"MGET (1000 keys)","2210.43","19.773","5.416","20.015","21.903","22.863","24.671"

prefetch

$ taskset -c 1-6 ./valkey-benchmark --csv -r 1000000 -n 10000 --threads 6 -t mset,mget
"test","rps","avg_latency_ms","min_latency_ms","p50_latency_ms","p95_latency_ms","p99_latency_ms","max_latency_ms"
"MSET (1000 keys)","1987.68","22.103","4.304","21.023","24.831","68.415","150.655"
"MGET (1000 keys)","3599.71","11.182","3.512","11.175","13.871","16.063","19.807"

Your input is random, I think this could be related. I'll test with random input.

@xbasel
Copy link
Member

xbasel commented Apr 28, 2025

Your input is random, I think this could be related. I'll test with random input.

I tested with random input and saw improvement. When the data is already cached, prefetching is just overhead. The data set I used was too small.

@@ -606,10 +606,14 @@ void getrangeCommand(client *c) {
}

void mgetCommand(client *c) {
const int batch_size = 16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make this global to avoid multiple declarations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered using the existing config prefetch-batch-max-size but according to documentation, it's specifically for IO threading.

I can change to a define, but I don't know if we should use the same number in all multi-key commands. I guess it depends how many other operations happen in between that can evict the prefetched keys from L1 cache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the batch size should be global - it should be per-command/unit, possibly hardcoded, since the optimal value likely varies across commands and code paths.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, it should be per command, but does mget and mset can only have same batch_size?

/* A simple way to prefetch the keys for a client command execution. This can be
* used for optimizing multi-key commands like mget, mset, etc. */
void prefetchKeys(client *c, int first, int step, int count) {
/* Skip this for IO threads. Keys are already batch-prefetched. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all keys are prefetched in the context of io-threads. Up to 16 keys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. What do you suggest? Move the check to the caller and skip prefetching the first 16 if io-thread are used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to optimize for ~10 keys. MGET with thousands of keys isn't a good pattern since it starves other clients, so that's less important IMO.

Copy link
Member

@xbasel xbasel Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe MGET with large number of keys isn't a good pattern, however, clients still send large key batches in practice.

I suggest either skipping the first 16 keys if IO threads is enabled or avoiding lookupKey entirely, see my other comment.

@xbasel
Copy link
Member

xbasel commented Apr 30, 2025

This change effectively does this (at last for MGET, I only looked at MGET):

for (key in keys) {
    prefetch_16_keys_every_16_iterations();
    val = lookupKeyRead(key);
}

The prefetch phase already does the heavy lifting - walking the hashtable and locating entries. That work is then repeated by lookupKeyRead.

Yes, some (likely most) data stays in L1, but there's still wasted effort and potential stalls for uncached or evicted entries, or just due to the code path.

I don’t see why we can’t fetch the actual values during prefetch and reuse them. The current "scan and discard" pattern makes sense in IO threads context (pre-proessing) where lookup happens later, but here it would be cleaner and more efficient to return the data directly.

I profiled the execution and found:

--33.62%--mgetCommand
          |
          |--16.80%--prefetchKeys
          |          |
          |          |--13.97%--hashtableIncrementalFindStep
          |          |          |
          |          |          |--6.21%--compareKeys (inlined)
          |          |          |          |
          |          |          |           --6.06%--hashtableSdsKeyCompare
          |          |          |                     |
          |          |          |                      --5.49%--sdscmp
          |          |          |                                |
          |          |          |                                 --4.79%--__memcmp_evex_movbe
          |          |          |
          |          |           --2.55%--entryGetKey (inlined)
          |          |                     |
          |          |                      --2.45%--hashtableObjectGetKey
          |          |                                |
          |          |                                 --2.44%--objectGetKey (inlined)
          |          |
          |           --1.75%--hashtableIncrementalFindInit (inlined)
          |                     hashtableIncrementalFindInit (inlined)
          |                     |
          |                      --1.72%--hashKey (inlined)
          |                                |
          |                                 --1.33%--siphash
          |
          |--8.90%--lookupKeyRead (inlined)
          |          lookupKeyReadWithFlags (inlined)
          |          |
          |           --8.73%--lookupKey
          |                     |
          |                      --7.31%--dbFindWithDictIndex (inlined)
          |                                |
          |                                 --7.18%--kvstoreHashtableFind (inlined)
          |                                           |
          |                                            --7.09%--hashtableFind
          |                                                      |
          |                                                      |--4.73%--findBucket (inlined)
          |                                                      |          |
          |                                                      |          |--1.84%--compareKeys (inlined)
          |                                                      |          |          |
          |                                                      |          |           --1.67%--hashtableSdsKeyCompare
          |                                                      |          |                     |
          |                                                      |          |                     |--0.96%--sdscmp
          |                                                      |          |                     |
          |                                                      |          |                      --0.63%--sdslen (inl

          |                                                      |          |
          |                                                      |           --1.22%--entryGetKey (inlined)
          |                                                      |
          |                                                       --1.87%--hashKey (inlined)
          |                                                                 |
          |                                                                  --1.48%--siphash
          |

It seems hashtableFind is still causing memory stalls.

I modified the code slightly to make lookupKeyRead redundant by changing:

void prefetchKeys(client *c, int first, int step, int count, void **out)

and using out to iterate over the entries directly.

I'm not 100% sure my code is correct, it was a quick and dirty experiment, but I achieved this:

Summary:
  throughput: 4432.23 requests/sec
  latency (ms):
       avg     min     p50     p95     p99     max
     10.203   3.312  10.103  11.991  13.807  25.695

Compared to the original version using prefetchKeys and lookupKeyRead:

Summary:
  throughput: 3497.24 requests/sec
  latency (ms):
       avg     min     p50     p95     p99     max
     12.934   5.824  12.887  14.815  16.271  21.535

Again, no guarantees my code is correct, but it’s a direction worth exploring. If you're interested, I can try to clean it up and submit it here or in a separate branch.

@zuiderkwast
Copy link
Contributor Author

I modified the code slightly to make lookupKeyRead redundant by changing:

Yeah, I'm aware that batch-prefetching is just a hack, not a perfect solution.

lookupKey() has side effects like eviction and key hits/misses stats. We would need a version of lookupKey that only does the that part, without actually looking up the key itself.

I also considered a lookupMultipleKeys() that looks up multiple keys in parallel.

MSET doesn't use lookupKey() but it uses setKey(). For this, we could have a setMultipleKeys() variant that does it in parallel. It's more optimal but more complex too, to get it nice and simple and avoid duplicating code.

@xbasel
Copy link
Member

xbasel commented Apr 30, 2025

I modified the code slightly to make lookupKeyRead redundant by changing:

Yeah, I'm aware that batch-prefetching is just a hack, not a perfect solution.

lookupKey() has side effects like eviction and key hits/misses stats. We would need a version of lookupKey that only does the that part, without actually looking up the key itself.

I also considered a lookupMultipleKeys() that looks up multiple keys in parallel.

MSET doesn't use lookupKey() but it uses setKey(). For this, we could have a setMultipleKeys() variant that does it in parallel. It's more optimal but more complex too, to get it nice and simple and avoid duplicating code.

I might have slightly underestimated the complexity. I think it's fine to start with the current code and optimize later.

@zuiderkwast zuiderkwast marked this pull request as draft May 16, 2025 10:42
@zuiderkwast
Copy link
Contributor Author

I believe this is a better and more generic approach:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants