Refactor some of ASM and slot-stats functions#26
Conversation
WalkthroughThis PR refactors cluster initialization and lifecycle management by introducing Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@src/cluster_legacy.c`:
- Around line 5259-5284: The clusterClaimUnassignedSlots function currently
calls clusterAddSlot(myself, i) and clusterSaveConfigOrDie but does not request
the usual state/broadcast updates; modify it so that after taking a slot you
mark the need to update state and notify peers by invoking
clusterDoBeforeSleep() with CLUSTER_TODO_UPDATE_STATE and
CLUSTER_TODO_BROADCAST_PONG (or accumulate those flags into update_config and
call clusterDoBeforeSleep(update_flags) after the loop), ensuring
clusterAddSlot, clusterDoBeforeSleep, CLUSTER_TODO_UPDATE_STATE, and
CLUSTER_TODO_BROADCAST_PONG are used exactly as in the analogous code paths so
peers are informed of the new slot ownership before saving config.
In `@src/cluster.c`:
- Around line 2137-2141: Replace the deprecated malloc call in clusterCommonInit
by allocating zeroed memory with zcalloc: allocate server.cluster_slot_stats
using zcalloc(CLUSTER_SLOTS, sizeof(clusterSlotStat)) instead of malloc(...),
keeping the subsequent resetClusterStats() and asmInit() calls unchanged; update
any necessary includes/headers if zcalloc isn’t already available in this
compilation unit.
- Around line 2171-2214: The function comment for verifyClusterConfigWithData is
inaccurate about handling keys for slots owned by other nodes: instead of
marking such slots IMPORTING, the implementation calls
clusterDeleteKeysInUnownedSlots() which removes those keys (safe with atomic
slot migration), then calls clusterClaimUnassignedSlots() to take truly
unassigned slots; update the comment to reflect that behavior change by
describing deletion via clusterDeleteKeysInUnownedSlots() and claiming via
clusterClaimUnassignedSlots() rather than setting IMPORTING.
🧹 Nitpick comments (1)
src/server.c (1)
1841-1844: Consider documenting the ASM vs. cluster before-sleep order.
Line 1841 now callsclusterBeforeSleep()thenasmBeforeSleep(), while cron runs ASM first. If this asymmetry is intentional, a short comment would help prevent regressions; if not, align the order.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
src/cluster.csrc/cluster.hsrc/cluster_legacy.csrc/server.c
🧰 Additional context used
🧬 Code graph analysis (3)
src/cluster.h (1)
src/cluster.c (1)
clusterCommonInit(2137-2141)
src/cluster_legacy.c (1)
src/cluster.c (1)
countKeysInSlot(820-822)
src/server.c (2)
src/cluster_asm.c (2)
asmCron(2507-2576)asmBeforeSleep(2463-2505)src/cluster.c (1)
clusterCommonInit(2137-2141)
🪛 GitHub Actions: Codecov
src/cluster.c
[error] 2138-2138: malloc is deprecated. Function call triggers -Werror=deprecated-declarations; build failure during lcov step due to deprecated malloc usage in clusterCommonInit.
🪛 GitHub Actions: External Server Tests
src/cluster.h
[error] 1-1: Command failed: make -C src all (due to compilation error in cluster.o).
src/cluster_legacy.c
[error] 1-1: Command failed: make -C src all (due to compilation error in cluster.o).
src/cluster.c
[error] 2138-2138: ClusterCommonInit: malloc is deprecated (-Werror=deprecated-declarations). Use calloc or ensure proper usage to avoid deprecation warning treated as an error.
src/server.c
[error] 1-1: Command failed: make -C src all (due to compilation error in cluster.o).
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: build-old-chain-jemalloc
- GitHub Check: Analyze (cpp)
- GitHub Check: build-old-chain-jemalloc
- GitHub Check: build-centos-jemalloc
🔇 Additional comments (5)
src/cluster.h (1)
89-92: LGTM!The new function declarations for
clusterCommonInit()andclusterClaimUnassignedSlots()follow the existing API pattern in the header and align with the implementations insrc/cluster.c.src/cluster.c (2)
2146-2169: LGTM!The logic correctly handles deletion of keys in slots that are no longer owned by this node, with appropriate early return for slave nodes and proper skip conditions for owned/importing slots.
1092-1095: BothclusterMigrationCommandandclusterSyncSlotsCommandare properly declared insrc/cluster_asm.h(lines 44-45) and implemented insrc/cluster_asm.c, making them correctly visible to the routing code in this file.src/server.c (2)
1657-1660: Verify ASM cron ordering against cluster state updates.
Line 1657 runsasmCron()beforeclusterCron(). If ASM cron logic depends on cluster state refreshed byclusterCron()(e.g., node health/slot ownership), this could act on stale data. Please confirm the ordering is intentional; otherwise swap or document the dependency.
7739-7741: Nice centralization of cluster init.
Line 7739 callsclusterCommonInit()beforeclusterInit(), which cleanly centralizes slot-stats/ASM setup prior to cluster startup. Looks good.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| /* This function is called after the node startup in order to check if there | ||
| * are any slots that we have keys for, but are assigned to no one. If so, | ||
| * we take ownership of them. */ | ||
| void clusterClaimUnassignedSlots(void) { | ||
| if (nodeIsSlave(myself)) return; | ||
|
|
||
| int update_config = 0; | ||
| for (int i = 0; i < CLUSTER_SLOTS; i++) { | ||
| /* Skip if: no keys, already has an owner, or we are importing it. */ | ||
| if (!countKeysInSlot(i) || | ||
| server.cluster->slots[i] != NULL || | ||
| server.cluster->importing_slots_from[i] != NULL) | ||
| { | ||
| continue; | ||
| } | ||
|
|
||
| /* If we are here data and cluster config don't agree, and we have | ||
| * slot 'i' populated even if we are not importing it, nor anyone else | ||
| * is assigned to it. Fix this condition by taking ownership. */ | ||
| update_config++; | ||
| serverLog(LL_NOTICE, "I have keys for unassigned slot %d. " | ||
| "Taking responsibility for it.", i); | ||
| clusterAddSlot(myself, i); | ||
| } | ||
| if (update_config) clusterSaveConfigOrDie(1); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
rg -n "clusterClaimUnassignedSlots" -C 5Repository: qodo-benchmark/redis-combined-coderabbit
Length of output: 1730
🏁 Script executed:
# Get more context around the call site to see which function it's in
sed -n '2190,2220p' src/cluster.c | cat -nRepository: qodo-benchmark/redis-combined-coderabbit
Length of output: 1266
🏁 Script executed:
# Find where verifyClusterConfigWithData is called
rg -n "verifyClusterConfigWithData" -C 3Repository: qodo-benchmark/redis-combined-coderabbit
Length of output: 1356
🏁 Script executed:
# Get context around line 7768 to identify the function
sed -n '7700,7780p' src/server.c | cat -nRepository: qodo-benchmark/redis-combined-coderabbit
Length of output: 3748
🏁 Script executed:
# Search for when state is broadcast after config changes in cluster mode
rg -n "clusterBeforeSleep|replicateCommandPropagate" -A 2 -B 2 | head -50Repository: qodo-benchmark/redis-combined-coderabbit
Length of output: 1344
🏁 Script executed:
# Find clusterSaveConfigOrDie implementation
rg -n "clusterSaveConfigOrDie" -A 10 | head -40Repository: qodo-benchmark/redis-combined-coderabbit
Length of output: 2648
🏁 Script executed:
# Check if clusterSaveConfigOrDie sets any broadcast flags
rg -n "todo_before_sleep" -B 2 -A 2 | grep -E "(todo_before_sleep|clusterSaveConfig)" | head -20Repository: qodo-benchmark/redis-combined-coderabbit
Length of output: 919
🏁 Script executed:
# Check clusterAddSlot and surrounding broadcast logic
rg -n "clusterAddSlot" -B 3 -A 8 | head -60Repository: qodo-benchmark/redis-combined-coderabbit
Length of output: 3994
🏁 Script executed:
# Check what happens in the caller after verifyClusterConfigWithData
sed -n '7760,7790p' src/server.c | cat -nRepository: qodo-benchmark/redis-combined-coderabbit
Length of output: 1584
🏁 Script executed:
# Check clusterBeforeSleep implementation for todo_before_sleep handling
sed -n '4942,5000p' src/cluster_legacy.c | cat -nRepository: qodo-benchmark/redis-combined-coderabbit
Length of output: 2538
🏁 Script executed:
# Search for what sets CLUSTER_TODO_SAVE_CONFIG or related flags
rg -n "CLUSTER_TODO" -B 1 | grep -E "(CLUSTER_TODO_|clusterDoBeforeSleep|clusterSaveConfig)" | head -30Repository: qodo-benchmark/redis-combined-coderabbit
Length of output: 2641
🏁 Script executed:
# Check if clusterClaimUnassignedSlots sets any todo flags
sed -n '5259,5284p' src/cluster_legacy.c | cat -nRepository: qodo-benchmark/redis-combined-coderabbit
Length of output: 1346
🏁 Script executed:
# Compare with similar slot claim/add patterns - look at line 2419-2426 and 4268-4279
sed -n '2415,2430p' src/cluster_legacy.c | cat -nRepository: qodo-benchmark/redis-combined-coderabbit
Length of output: 985
Add missing state and broadcast updates when claiming slots.
The function claims slots and saves config but lacks the state/broadcast updates that similar code performs. Compare with lines 2423–2425 and 4268–4279: after calling clusterAddSlot(), the code invokes clusterDoBeforeSleep() with CLUSTER_TODO_UPDATE_STATE and/or CLUSTER_TODO_BROADCAST_PONG flags. This ensures the cluster state is updated and peers are notified of the claim. Without these, peers remain unaware of the new slot assignments until the next periodic update.
🤖 Prompt for AI Agents
In `@src/cluster_legacy.c` around lines 5259 - 5284, The
clusterClaimUnassignedSlots function currently calls clusterAddSlot(myself, i)
and clusterSaveConfigOrDie but does not request the usual state/broadcast
updates; modify it so that after taking a slot you mark the need to update state
and notify peers by invoking clusterDoBeforeSleep() with
CLUSTER_TODO_UPDATE_STATE and CLUSTER_TODO_BROADCAST_PONG (or accumulate those
flags into update_config and call clusterDoBeforeSleep(update_flags) after the
loop), ensuring clusterAddSlot, clusterDoBeforeSleep, CLUSTER_TODO_UPDATE_STATE,
and CLUSTER_TODO_BROADCAST_PONG are used exactly as in the analogous code paths
so peers are informed of the new slot ownership before saving config.
| void clusterCommonInit(void) { | ||
| server.cluster_slot_stats = malloc(CLUSTER_SLOTS*sizeof(clusterSlotStat)); | ||
| resetClusterStats(); | ||
| asmInit(); | ||
| } |
There was a problem hiding this comment.
Use zcalloc instead of deprecated malloc — this is causing the build failure.
The pipeline fails with -Werror=deprecated-declarations because malloc is deprecated in this codebase. Redis uses zmalloc/zcalloc for memory allocation. Using zcalloc also zero-initializes the memory, which is appropriate before resetClusterStats().
🔧 Proposed fix
void clusterCommonInit(void) {
- server.cluster_slot_stats = malloc(CLUSTER_SLOTS*sizeof(clusterSlotStat));
+ server.cluster_slot_stats = zcalloc(CLUSTER_SLOTS*sizeof(clusterSlotStat));
resetClusterStats();
asmInit();
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| void clusterCommonInit(void) { | |
| server.cluster_slot_stats = malloc(CLUSTER_SLOTS*sizeof(clusterSlotStat)); | |
| resetClusterStats(); | |
| asmInit(); | |
| } | |
| void clusterCommonInit(void) { | |
| server.cluster_slot_stats = zcalloc(CLUSTER_SLOTS*sizeof(clusterSlotStat)); | |
| resetClusterStats(); | |
| asmInit(); | |
| } |
🧰 Tools
🪛 GitHub Actions: Codecov
[error] 2138-2138: malloc is deprecated. Function call triggers -Werror=deprecated-declarations; build failure during lcov step due to deprecated malloc usage in clusterCommonInit.
🪛 GitHub Actions: External Server Tests
[error] 2138-2138: ClusterCommonInit: malloc is deprecated (-Werror=deprecated-declarations). Use calloc or ensure proper usage to avoid deprecation warning treated as an error.
🤖 Prompt for AI Agents
In `@src/cluster.c` around lines 2137 - 2141, Replace the deprecated malloc call
in clusterCommonInit by allocating zeroed memory with zcalloc: allocate
server.cluster_slot_stats using zcalloc(CLUSTER_SLOTS, sizeof(clusterSlotStat))
instead of malloc(...), keeping the subsequent resetClusterStats() and asmInit()
calls unchanged; update any necessary includes/headers if zcalloc isn’t already
available in this compilation unit.
|
|
||
| /* This function is called after the node startup in order to verify that data | ||
| * loaded from disk is in agreement with the cluster configuration: | ||
| * | ||
| * 1) If we find keys about hash slots we have no responsibility for, the | ||
| * following happens: | ||
| * A) If no other node is in charge according to the current cluster | ||
| * configuration, we add these slots to our node. | ||
| * B) If according to our config other nodes are already in charge for | ||
| * this slots, we set the slots as IMPORTING from our point of view | ||
| * in order to justify we have those slots, and in order to make | ||
| * redis-cli aware of the issue, so that it can try to fix it. | ||
| * 2) If we find data in a DB different than DB0 we return C_ERR to | ||
| * signal the caller it should quit the server with an error message | ||
| * or take other actions. | ||
| * | ||
| * The function always returns C_OK even if it will try to correct | ||
| * the error described in "1". However if data is found in DB different | ||
| * from DB0, C_ERR is returned. | ||
| * | ||
| * The function also uses the logging facility in order to warn the user | ||
| * about desynchronizations between the data we have in memory and the | ||
| * cluster configuration. */ | ||
| int verifyClusterConfigWithData(void) { | ||
| /* Return ASAP if a module disabled cluster redirections. In that case | ||
| * every master can store keys about every possible hash slot. */ | ||
| if (server.cluster_module_flags & CLUSTER_MODULE_FLAG_NO_REDIRECTION) | ||
| return C_OK; | ||
|
|
||
| /* If this node is a slave, don't perform the check at all as we | ||
| * completely depend on the replication stream. */ | ||
| if (clusterNodeIsSlave(getMyClusterNode())) return C_OK; | ||
|
|
||
| /* Make sure we only have keys in DB0. */ | ||
| for (int i = 1; i < server.dbnum; i++) { | ||
| if (kvstoreSize(server.db[i].keys)) return C_ERR; | ||
| } | ||
|
|
||
| /* Delete keys in unowned slots */ | ||
| clusterDeleteKeysInUnownedSlots(); | ||
| /* Take over slots that we have keys for, but are assigned to no one. */ | ||
| clusterClaimUnassignedSlots(); | ||
| return C_OK; | ||
| } |
There was a problem hiding this comment.
Function comment at lines 2179-2182 is inconsistent with implementation.
The comment states that if other nodes are in charge of slots, "we set the slots as IMPORTING from our point of view." However, the implementation at line 2210 calls clusterDeleteKeysInUnownedSlots(), which deletes keys in those slots rather than setting IMPORTING state.
The comment in clusterDeleteKeysInUnownedSlots() explains this is safe with atomic slot migration, but the parent function's documentation should be updated to reflect this behavior change.
📝 Suggested comment update
/* This function is called after the node startup in order to verify that data
* loaded from disk is in agreement with the cluster configuration:
*
* 1) If we find keys about hash slots we have no responsibility for, the
* following happens:
* A) If no other node is in charge according to the current cluster
* configuration, we add these slots to our node.
- * B) If according to our config other nodes are already in charge for
- * this slots, we set the slots as IMPORTING from our point of view
- * in order to justify we have those slots, and in order to make
- * redis-cli aware of the issue, so that it can try to fix it.
+ * B) If according to our config other nodes are already in charge for
+ * these slots, we delete the keys in those slots. With atomic slot
+ * migration, it is safe to drop keys from slots that are not owned.
* 2) If we find data in a DB different than DB0 we return C_ERR to
* signal the caller it should quit the server with an error message
* or take other actions.🤖 Prompt for AI Agents
In `@src/cluster.c` around lines 2171 - 2214, The function comment for
verifyClusterConfigWithData is inaccurate about handling keys for slots owned by
other nodes: instead of marking such slots IMPORTING, the implementation calls
clusterDeleteKeysInUnownedSlots() which removes those keys (safe with atomic
slot migration), then calls clusterClaimUnassignedSlots() to take truly
unassigned slots; update the comment to reflect that behavior change by
describing deletion via clusterDeleteKeysInUnownedSlots() and claiming via
clusterClaimUnassignedSlots() rather than setting IMPORTING.
Benchmark PR from agentic-review-benchmarks#4
Summary by CodeRabbit
New Features
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.