Skip to content

Refactor some of ASM and slot-stats functions#26

Open
tomerqodo wants to merge 3 commits intocoderabbit_full_base_refactor_some_of_asm_and_slot-stats_functions_pr4from
coderabbit_full_head_refactor_some_of_asm_and_slot-stats_functions_pr4
Open

Refactor some of ASM and slot-stats functions#26
tomerqodo wants to merge 3 commits intocoderabbit_full_base_refactor_some_of_asm_and_slot-stats_functions_pr4from
coderabbit_full_head_refactor_some_of_asm_and_slot-stats_functions_pr4

Conversation

@tomerqodo
Copy link
Copy Markdown

@tomerqodo tomerqodo commented Jan 29, 2026

Benchmark PR from agentic-review-benchmarks#4

Summary by CodeRabbit

  • New Features

    • Added cluster migration and slot synchronization commands for improved cluster management.
    • Enhanced cluster initialization with improved data consistency verification and slot ownership handling.
  • Refactor

    • Optimized cluster startup and maintenance sequences.
    • Improved slot assignment and recovery mechanisms.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jan 29, 2026

Walkthrough

This PR refactors cluster initialization and lifecycle management by introducing clusterCommonInit() to centralize slot-state setup, relocating ASM lifecycle hooks from cluster_legacy.c to server.c's cron/before-sleep routines, and replacing verification logic with explicit unassigned slot claiming. Command routing for "migration" and "syncslots" is consolidated in cluster.c.

Changes

Cohort / File(s) Summary
Cluster Common Initialization
src/cluster.c, src/cluster.h, src/cluster_legacy.c, src/server.c
Introduced clusterCommonInit() to centralize shared cluster data and slot-state initialization. Removed legacy resetClusterStats() calls and relocated initialization flow from cluster_legacy.c to server.c, replacing direct cluster_slot_stats allocation with the new common init routine.
Unassigned Slot Management
src/cluster.c, src/cluster_legacy.c
Replaced verifyClusterConfigWithData() with clusterClaimUnassignedSlots() for scanning and claiming slots with keys but no owner. Added clusterDeleteKeysInUnownedSlots() as an internal helper to purge keys in non-owned slots during initialization.
ASM Lifecycle Reorganization
src/cluster_legacy.c, src/server.c
Removed asmInit(), asmCron(), and asmBeforeSleep() calls from cluster lifecycle functions. Reintegrated asmCron() and asmBeforeSleep() into server.c's main cron loop and before-sleep hook, changing invocation order to prioritize ASM tasks.
Cluster Command Handlers
src/cluster.c, src/cluster_legacy.c
Added "migration" and "syncslots" subcommand routing in clusterCommand(). Removed legacy command-handling blocks for these subcommands from clusterCommandSpecial() in cluster_legacy.c, consolidating command dispatch.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Slots now claim themselves with pride,
Init flows reorganized far and wide,
ASM hops to server's eager hands,
Cluster harmony across the lands!

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 58.82% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main refactoring work: moving ASM and slot-stats initialization logic from legacy cluster code to centralized functions, and restructuring cron/sleep hooks across multiple files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@src/cluster_legacy.c`:
- Around line 5259-5284: The clusterClaimUnassignedSlots function currently
calls clusterAddSlot(myself, i) and clusterSaveConfigOrDie but does not request
the usual state/broadcast updates; modify it so that after taking a slot you
mark the need to update state and notify peers by invoking
clusterDoBeforeSleep() with CLUSTER_TODO_UPDATE_STATE and
CLUSTER_TODO_BROADCAST_PONG (or accumulate those flags into update_config and
call clusterDoBeforeSleep(update_flags) after the loop), ensuring
clusterAddSlot, clusterDoBeforeSleep, CLUSTER_TODO_UPDATE_STATE, and
CLUSTER_TODO_BROADCAST_PONG are used exactly as in the analogous code paths so
peers are informed of the new slot ownership before saving config.

In `@src/cluster.c`:
- Around line 2137-2141: Replace the deprecated malloc call in clusterCommonInit
by allocating zeroed memory with zcalloc: allocate server.cluster_slot_stats
using zcalloc(CLUSTER_SLOTS, sizeof(clusterSlotStat)) instead of malloc(...),
keeping the subsequent resetClusterStats() and asmInit() calls unchanged; update
any necessary includes/headers if zcalloc isn’t already available in this
compilation unit.
- Around line 2171-2214: The function comment for verifyClusterConfigWithData is
inaccurate about handling keys for slots owned by other nodes: instead of
marking such slots IMPORTING, the implementation calls
clusterDeleteKeysInUnownedSlots() which removes those keys (safe with atomic
slot migration), then calls clusterClaimUnassignedSlots() to take truly
unassigned slots; update the comment to reflect that behavior change by
describing deletion via clusterDeleteKeysInUnownedSlots() and claiming via
clusterClaimUnassignedSlots() rather than setting IMPORTING.
🧹 Nitpick comments (1)
src/server.c (1)

1841-1844: Consider documenting the ASM vs. cluster before-sleep order.
Line 1841 now calls clusterBeforeSleep() then asmBeforeSleep(), while cron runs ASM first. If this asymmetry is intentional, a short comment would help prevent regressions; if not, align the order.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c52fabf and 865827b.

📒 Files selected for processing (4)
  • src/cluster.c
  • src/cluster.h
  • src/cluster_legacy.c
  • src/server.c
🧰 Additional context used
🧬 Code graph analysis (3)
src/cluster.h (1)
src/cluster.c (1)
  • clusterCommonInit (2137-2141)
src/cluster_legacy.c (1)
src/cluster.c (1)
  • countKeysInSlot (820-822)
src/server.c (2)
src/cluster_asm.c (2)
  • asmCron (2507-2576)
  • asmBeforeSleep (2463-2505)
src/cluster.c (1)
  • clusterCommonInit (2137-2141)
🪛 GitHub Actions: Codecov
src/cluster.c

[error] 2138-2138: malloc is deprecated. Function call triggers -Werror=deprecated-declarations; build failure during lcov step due to deprecated malloc usage in clusterCommonInit.

🪛 GitHub Actions: External Server Tests
src/cluster.h

[error] 1-1: Command failed: make -C src all (due to compilation error in cluster.o).

src/cluster_legacy.c

[error] 1-1: Command failed: make -C src all (due to compilation error in cluster.o).

src/cluster.c

[error] 2138-2138: ClusterCommonInit: malloc is deprecated (-Werror=deprecated-declarations). Use calloc or ensure proper usage to avoid deprecation warning treated as an error.

src/server.c

[error] 1-1: Command failed: make -C src all (due to compilation error in cluster.o).

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: build-old-chain-jemalloc
  • GitHub Check: Analyze (cpp)
  • GitHub Check: build-old-chain-jemalloc
  • GitHub Check: build-centos-jemalloc
🔇 Additional comments (5)
src/cluster.h (1)

89-92: LGTM!

The new function declarations for clusterCommonInit() and clusterClaimUnassignedSlots() follow the existing API pattern in the header and align with the implementations in src/cluster.c.

src/cluster.c (2)

2146-2169: LGTM!

The logic correctly handles deletion of keys in slots that are no longer owned by this node, with appropriate early return for slave nodes and proper skip conditions for owned/importing slots.


1092-1095: Both clusterMigrationCommand and clusterSyncSlotsCommand are properly declared in src/cluster_asm.h (lines 44-45) and implemented in src/cluster_asm.c, making them correctly visible to the routing code in this file.

src/server.c (2)

1657-1660: Verify ASM cron ordering against cluster state updates.
Line 1657 runs asmCron() before clusterCron(). If ASM cron logic depends on cluster state refreshed by clusterCron() (e.g., node health/slot ownership), this could act on stale data. Please confirm the ordering is intentional; otherwise swap or document the dependency.


7739-7741: Nice centralization of cluster init.
Line 7739 calls clusterCommonInit() before clusterInit(), which cleanly centralizes slot-stats/ASM setup prior to cluster startup. Looks good.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment thread src/cluster_legacy.c
Comment on lines +5259 to +5284
/* This function is called after the node startup in order to check if there
* are any slots that we have keys for, but are assigned to no one. If so,
* we take ownership of them. */
void clusterClaimUnassignedSlots(void) {
if (nodeIsSlave(myself)) return;

int update_config = 0;
for (int i = 0; i < CLUSTER_SLOTS; i++) {
/* Skip if: no keys, already has an owner, or we are importing it. */
if (!countKeysInSlot(i) ||
server.cluster->slots[i] != NULL ||
server.cluster->importing_slots_from[i] != NULL)
{
continue;
}

/* If we are here data and cluster config don't agree, and we have
* slot 'i' populated even if we are not importing it, nor anyone else
* is assigned to it. Fix this condition by taking ownership. */
update_config++;
serverLog(LL_NOTICE, "I have keys for unassigned slot %d. "
"Taking responsibility for it.", i);
clusterAddSlot(myself, i);
}
if (update_config) clusterSaveConfigOrDie(1);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

rg -n "clusterClaimUnassignedSlots" -C 5

Repository: qodo-benchmark/redis-combined-coderabbit

Length of output: 1730


🏁 Script executed:

# Get more context around the call site to see which function it's in
sed -n '2190,2220p' src/cluster.c | cat -n

Repository: qodo-benchmark/redis-combined-coderabbit

Length of output: 1266


🏁 Script executed:

# Find where verifyClusterConfigWithData is called
rg -n "verifyClusterConfigWithData" -C 3

Repository: qodo-benchmark/redis-combined-coderabbit

Length of output: 1356


🏁 Script executed:

# Get context around line 7768 to identify the function
sed -n '7700,7780p' src/server.c | cat -n

Repository: qodo-benchmark/redis-combined-coderabbit

Length of output: 3748


🏁 Script executed:

# Search for when state is broadcast after config changes in cluster mode
rg -n "clusterBeforeSleep|replicateCommandPropagate" -A 2 -B 2 | head -50

Repository: qodo-benchmark/redis-combined-coderabbit

Length of output: 1344


🏁 Script executed:

# Find clusterSaveConfigOrDie implementation
rg -n "clusterSaveConfigOrDie" -A 10 | head -40

Repository: qodo-benchmark/redis-combined-coderabbit

Length of output: 2648


🏁 Script executed:

# Check if clusterSaveConfigOrDie sets any broadcast flags
rg -n "todo_before_sleep" -B 2 -A 2 | grep -E "(todo_before_sleep|clusterSaveConfig)" | head -20

Repository: qodo-benchmark/redis-combined-coderabbit

Length of output: 919


🏁 Script executed:

# Check clusterAddSlot and surrounding broadcast logic
rg -n "clusterAddSlot" -B 3 -A 8 | head -60

Repository: qodo-benchmark/redis-combined-coderabbit

Length of output: 3994


🏁 Script executed:

# Check what happens in the caller after verifyClusterConfigWithData
sed -n '7760,7790p' src/server.c | cat -n

Repository: qodo-benchmark/redis-combined-coderabbit

Length of output: 1584


🏁 Script executed:

# Check clusterBeforeSleep implementation for todo_before_sleep handling
sed -n '4942,5000p' src/cluster_legacy.c | cat -n

Repository: qodo-benchmark/redis-combined-coderabbit

Length of output: 2538


🏁 Script executed:

# Search for what sets CLUSTER_TODO_SAVE_CONFIG or related flags
rg -n "CLUSTER_TODO" -B 1 | grep -E "(CLUSTER_TODO_|clusterDoBeforeSleep|clusterSaveConfig)" | head -30

Repository: qodo-benchmark/redis-combined-coderabbit

Length of output: 2641


🏁 Script executed:

# Check if clusterClaimUnassignedSlots sets any todo flags
sed -n '5259,5284p' src/cluster_legacy.c | cat -n

Repository: qodo-benchmark/redis-combined-coderabbit

Length of output: 1346


🏁 Script executed:

# Compare with similar slot claim/add patterns - look at line 2419-2426 and 4268-4279
sed -n '2415,2430p' src/cluster_legacy.c | cat -n

Repository: qodo-benchmark/redis-combined-coderabbit

Length of output: 985


Add missing state and broadcast updates when claiming slots.

The function claims slots and saves config but lacks the state/broadcast updates that similar code performs. Compare with lines 2423–2425 and 4268–4279: after calling clusterAddSlot(), the code invokes clusterDoBeforeSleep() with CLUSTER_TODO_UPDATE_STATE and/or CLUSTER_TODO_BROADCAST_PONG flags. This ensures the cluster state is updated and peers are notified of the claim. Without these, peers remain unaware of the new slot assignments until the next periodic update.

🤖 Prompt for AI Agents
In `@src/cluster_legacy.c` around lines 5259 - 5284, The
clusterClaimUnassignedSlots function currently calls clusterAddSlot(myself, i)
and clusterSaveConfigOrDie but does not request the usual state/broadcast
updates; modify it so that after taking a slot you mark the need to update state
and notify peers by invoking clusterDoBeforeSleep() with
CLUSTER_TODO_UPDATE_STATE and CLUSTER_TODO_BROADCAST_PONG (or accumulate those
flags into update_config and call clusterDoBeforeSleep(update_flags) after the
loop), ensuring clusterAddSlot, clusterDoBeforeSleep, CLUSTER_TODO_UPDATE_STATE,
and CLUSTER_TODO_BROADCAST_PONG are used exactly as in the analogous code paths
so peers are informed of the new slot ownership before saving config.

Comment thread src/cluster.c
Comment on lines +2137 to +2141
void clusterCommonInit(void) {
server.cluster_slot_stats = malloc(CLUSTER_SLOTS*sizeof(clusterSlotStat));
resetClusterStats();
asmInit();
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Use zcalloc instead of deprecated malloc — this is causing the build failure.

The pipeline fails with -Werror=deprecated-declarations because malloc is deprecated in this codebase. Redis uses zmalloc/zcalloc for memory allocation. Using zcalloc also zero-initializes the memory, which is appropriate before resetClusterStats().

🔧 Proposed fix
 void clusterCommonInit(void) {
-    server.cluster_slot_stats = malloc(CLUSTER_SLOTS*sizeof(clusterSlotStat));
+    server.cluster_slot_stats = zcalloc(CLUSTER_SLOTS*sizeof(clusterSlotStat));
     resetClusterStats();
     asmInit();
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
void clusterCommonInit(void) {
server.cluster_slot_stats = malloc(CLUSTER_SLOTS*sizeof(clusterSlotStat));
resetClusterStats();
asmInit();
}
void clusterCommonInit(void) {
server.cluster_slot_stats = zcalloc(CLUSTER_SLOTS*sizeof(clusterSlotStat));
resetClusterStats();
asmInit();
}
🧰 Tools
🪛 GitHub Actions: Codecov

[error] 2138-2138: malloc is deprecated. Function call triggers -Werror=deprecated-declarations; build failure during lcov step due to deprecated malloc usage in clusterCommonInit.

🪛 GitHub Actions: External Server Tests

[error] 2138-2138: ClusterCommonInit: malloc is deprecated (-Werror=deprecated-declarations). Use calloc or ensure proper usage to avoid deprecation warning treated as an error.

🤖 Prompt for AI Agents
In `@src/cluster.c` around lines 2137 - 2141, Replace the deprecated malloc call
in clusterCommonInit by allocating zeroed memory with zcalloc: allocate
server.cluster_slot_stats using zcalloc(CLUSTER_SLOTS, sizeof(clusterSlotStat))
instead of malloc(...), keeping the subsequent resetClusterStats() and asmInit()
calls unchanged; update any necessary includes/headers if zcalloc isn’t already
available in this compilation unit.

Comment thread src/cluster.c
Comment on lines +2171 to +2214

/* This function is called after the node startup in order to verify that data
* loaded from disk is in agreement with the cluster configuration:
*
* 1) If we find keys about hash slots we have no responsibility for, the
* following happens:
* A) If no other node is in charge according to the current cluster
* configuration, we add these slots to our node.
* B) If according to our config other nodes are already in charge for
* this slots, we set the slots as IMPORTING from our point of view
* in order to justify we have those slots, and in order to make
* redis-cli aware of the issue, so that it can try to fix it.
* 2) If we find data in a DB different than DB0 we return C_ERR to
* signal the caller it should quit the server with an error message
* or take other actions.
*
* The function always returns C_OK even if it will try to correct
* the error described in "1". However if data is found in DB different
* from DB0, C_ERR is returned.
*
* The function also uses the logging facility in order to warn the user
* about desynchronizations between the data we have in memory and the
* cluster configuration. */
int verifyClusterConfigWithData(void) {
/* Return ASAP if a module disabled cluster redirections. In that case
* every master can store keys about every possible hash slot. */
if (server.cluster_module_flags & CLUSTER_MODULE_FLAG_NO_REDIRECTION)
return C_OK;

/* If this node is a slave, don't perform the check at all as we
* completely depend on the replication stream. */
if (clusterNodeIsSlave(getMyClusterNode())) return C_OK;

/* Make sure we only have keys in DB0. */
for (int i = 1; i < server.dbnum; i++) {
if (kvstoreSize(server.db[i].keys)) return C_ERR;
}

/* Delete keys in unowned slots */
clusterDeleteKeysInUnownedSlots();
/* Take over slots that we have keys for, but are assigned to no one. */
clusterClaimUnassignedSlots();
return C_OK;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Function comment at lines 2179-2182 is inconsistent with implementation.

The comment states that if other nodes are in charge of slots, "we set the slots as IMPORTING from our point of view." However, the implementation at line 2210 calls clusterDeleteKeysInUnownedSlots(), which deletes keys in those slots rather than setting IMPORTING state.

The comment in clusterDeleteKeysInUnownedSlots() explains this is safe with atomic slot migration, but the parent function's documentation should be updated to reflect this behavior change.

📝 Suggested comment update
 /* This function is called after the node startup in order to verify that data
  * loaded from disk is in agreement with the cluster configuration:
  *
  * 1) If we find keys about hash slots we have no responsibility for, the
  *    following happens:
  *    A) If no other node is in charge according to the current cluster
  *       configuration, we add these slots to our node.
- *    B) If according to our config other nodes are already in charge for
- *       this slots, we set the slots as IMPORTING from our point of view
- *       in order to justify we have those slots, and in order to make
- *       redis-cli aware of the issue, so that it can try to fix it.
+ *    B) If according to our config other nodes are already in charge for
+ *       these slots, we delete the keys in those slots. With atomic slot
+ *       migration, it is safe to drop keys from slots that are not owned.
  * 2) If we find data in a DB different than DB0 we return C_ERR to
  *    signal the caller it should quit the server with an error message
  *    or take other actions.
🤖 Prompt for AI Agents
In `@src/cluster.c` around lines 2171 - 2214, The function comment for
verifyClusterConfigWithData is inaccurate about handling keys for slots owned by
other nodes: instead of marking such slots IMPORTING, the implementation calls
clusterDeleteKeysInUnownedSlots() which removes those keys (safe with atomic
slot migration), then calls clusterClaimUnassignedSlots() to take truly
unassigned slots; update the comment to reflect that behavior change by
describing deletion via clusterDeleteKeysInUnownedSlots() and claiming via
clusterClaimUnassignedSlots() rather than setting IMPORTING.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants