-
Notifications
You must be signed in to change notification settings - Fork 833
Prefetch keys in MSET and MGET #2013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
Signed-off-by: Viktor Söderqvist <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #2013 +/- ##
============================================
+ Coverage 71.00% 71.10% +0.09%
============================================
Files 123 123
Lines 66103 66133 +30
============================================
+ Hits 46939 47022 +83
+ Misses 19164 19111 -53
🚀 New features to boost your workflow:
|
591e36a
to
9c0905b
Compare
Signed-off-by: Viktor Söderqvist <[email protected]>
d10b2da
to
89e5298
Compare
int i, arg; | ||
for (i = 0, arg = first; i < n && arg < c->argc; i++, arg += step) { | ||
sds key = c->argv[arg]->ptr; | ||
int slot = server.cluster_enabled ? getKeySlot(key) : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the slot when cluster mode is disabled be -1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slot
must be 0. -1 will can kvstoreGetHashtable
(accessing array[-1])
I think getKVStoreIndexForKey
can be used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getKVStoreIndexForKey is static in db.c so either we make it non-static or I can just rename slot
to kv_index
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not worth changing db.c. I'd just rename slot.
I performed a quick test, @zuiderkwast, can you please double-check the scenario below? Hardware: ARM (Graviton 3)
Perf without prefetch:
MGET with prefetch:
Command used:
|
@xbasel That's surprising. I modified valkey-benchmark MSET and MGET tests (from #2015) to run with 1000 keys. I got much better throughput with prefetching, but I'm just running on a laptop... unstable
prefetch
|
Your input is random, I think this could be related. I'll test with random input. |
I tested with random input and saw improvement. When the data is already cached, prefetching is just overhead. The data set I used was too small. |
@@ -606,10 +606,14 @@ void getrangeCommand(client *c) { | |||
} | |||
|
|||
void mgetCommand(client *c) { | |||
const int batch_size = 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make this global to avoid multiple declarations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered using the existing config prefetch-batch-max-size
but according to documentation, it's specifically for IO threading.
I can change to a define, but I don't know if we should use the same number in all multi-key commands. I guess it depends how many other operations happen in between that can evict the prefetched keys from L1 cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the batch size should be global - it should be per-command/unit, possibly hardcoded, since the optimal value likely varies across commands and code paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, it should be per command, but does mget and mset can only have same batch_size?
/* A simple way to prefetch the keys for a client command execution. This can be | ||
* used for optimizing multi-key commands like mget, mset, etc. */ | ||
void prefetchKeys(client *c, int first, int step, int count) { | ||
/* Skip this for IO threads. Keys are already batch-prefetched. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all keys are prefetched in the context of io-threads. Up to 16 keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. What do you suggest? Move the check to the caller and skip prefetching the first 16 if io-thread are used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to optimize for ~10 keys. MGET with thousands of keys isn't a good pattern since it starves other clients, so that's less important IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe MGET with large number of keys isn't a good pattern, however, clients still send large key batches in practice.
I suggest either skipping the first 16 keys if IO threads is enabled or avoiding lookupKey entirely, see my other comment.
This change effectively does this (at last for
The prefetch phase already does the heavy lifting - walking the hashtable and locating entries. That work is then repeated by Yes, some (likely most) data stays in L1, but there's still wasted effort and potential stalls for uncached or evicted entries, or just due to the code path. I don’t see why we can’t fetch the actual values during prefetch and reuse them. The current "scan and discard" pattern makes sense in IO threads context (pre-proessing) where lookup happens later, but here it would be cleaner and more efficient to return the data directly. I profiled the execution and found:
It seems I modified the code slightly to make
and using out to iterate over the entries directly. I'm not 100% sure my code is correct, it was a quick and dirty experiment, but I achieved this:
Compared to the original version using prefetchKeys and lookupKeyRead:
Again, no guarantees my code is correct, but it’s a direction worth exploring. If you're interested, I can try to clean it up and submit it here or in a separate branch. |
Yeah, I'm aware that batch-prefetching is just a hack, not a perfect solution.
I also considered a MSET doesn't use |
I might have slightly underestimated the complexity. I think it's fine to start with the current code and optimize later. |
I believe this is a better and more generic approach: |
For multikey commands, prefetching the keys from the hash table to L1 cache can speed up the commands.
This change executes MGET and MSET in batches of 16 keys. For each batch, the keys are prefetched.
With the following benchmark, I got 13% higher throughput for MSET and 17% for MGET, when MSET and MGET are called with 10 keys each time.
The benchmark test for MGET is added in #2015.