Change the same shard failover assert to if condition to avoid crash (valkey-io#2431)

enjoy-binbin · web-flow · commit 7a9ef29f1ae6 · 2025-08-12T10:37:11.000+08:00
The assert was added in valkey-io#2301 and we found that there are some situations would trigger assert and crash the server. The reason we added the assert is because, in the code: 1. sender_claimed_primary and sender are in the same shard 2. and sender is the old primary, sender_claimed_primary is the old replica 3. and now sender become a replica, sender_claimed_primary become a primary That means a failover happend in the shard, and sender should be the primary of sender_claimed_primary. But obviously this assumption may be wrong, we rely on shard_id to determine whether it is in a same shard, and assume that a shard can only have one primary. But this is wrong, from valkey-io#2279 we can know there will be a case that we can create two primaries in the same shard due to the untimely update of shard_id. So we can create a test that trigger the assert in this way: 1. pre condition: two primaries in the same shard, one has slots and one is empty. 2. replica doing a cluster failover 3. the empty primary doing a cluster replicate with the replica (new primary) We change the assert to an if condition to fix it. Closes valkey-io#2423. Note that the test written here also exposes the issue in valkey-io#2441, so these two may need to be addressed together. Signed-off-by: Binbin <binloveplay1314@qq.com>
diff --git a/src/cluster_legacy.c b/src/cluster_legacy.c
@@ -3798,8 +3798,8 @@ int clusterProcessPacket(clusterLink *link) {
                             /* This packet is stale so we avoid processing it anymore. Otherwise
                              * this may cause a primary-replica chain issue. */
                             return 1;
-                        } else if (nodeIsReplica(sender_claimed_primary)) {
-                            serverAssert(sender_claimed_primary->replicaof == sender);
+                        } else if (nodeIsReplica(sender_claimed_primary) &&
+                                   sender_claimed_primary->replicaof == sender) {
                             /* A failover occurred in the shard where `sender` belongs to and `sender` is
                              * no longer a primary. Update slot assignment to `sender_claimed_config_epoch`,
                              * which is the new primary in the shard. */
diff --git a/tests/unit/cluster/manual-failover.tcl b/tests/unit/cluster/manual-failover.tcl
@@ -583,3 +583,196 @@ start_cluster 3 2 {tags {external:skip cluster}} {
         assert_equal [dict get [cluster_get_node_by_id 4 $R4_nodeid] slaveof] $R3_nodeid
     }
 }
+
+start_cluster 3 2 {tags {external:skip cluster}} {
+    # This test consists of two phases.
+    # The first phase, we will create a scenario where two primary are on the same shard. See #2279 for more details.
+    # The second phase, we will test the behavior of the node when packets arrive out of order. See #2301 for more details.
+    #
+    # The first phase.
+    # In the R0/R3/R4 shard, R0 is the primary (cluster-allow-replica-migration no), R3 is the replica, R4 will be a replica later.
+    # 1. R0 goes down, and R3 trigger a failover and become the new primary.
+    # 2. R0 (old primary) continues to be down while another R4 is added as a replica of R3 (new primary).
+    # 3. R3 (new primary) goes down, and R4 trigger a failover and become the new primary.
+    # 4. R0 (old primary) and R3 (old primary) come back up and start learning about the new topology.
+    # 5. R0 (old primary) comes up thinking it was the primary, but has an older config epoch compared to R4 (new primary).
+    # 6. R0 (old primary) learns about R4 (new primary) as a new node via gossip and assigns it a random shard_id.
+    # 7. R0 (old primary) receives a direct ping from R4 (new primary).
+    #    a. R4 (new primary) advertises the same set of slots that R0 (old primary) was earlier owning.
+    #    b. Since R0 (old primary) assigns a random shard_id to R4 (new primary) early, R0 (old primary) thinks
+    #       that it is still a primary and it lost all its slots to R4 (new primary), which is in another shard.
+    #       R0 (old primary) become an empty primary.
+    #    c. R0 (empty primary) then updates the actual shard_id of R4 (new primary) while processing the ping extensions.
+    # 9. R0 (empty primary) and R4 (new primary) end up being primaries in the same shard while R4 continues to own slots.
+    #
+    # The second phase.
+    # In the R0/R3/R4 shard, R4 is the primary, R3 is the replica, and R0 is en empty primary.
+    # 1. We will perform a failover on R3, and perform a replicate on R0 to make R0 a replica of R3.
+    # 2. When R3 becomes the new primary node, it will broadcast a message to all nodes in the cluster.
+    # 3. When R4 receives the message, it becomes the new replica and also will broadcast the message to all nodes in the cluster.
+    # 4. When R0 becomes a replica after the replication, it will broadcast a message to all nodes in the cluster.
+    # 5. Let's assume that R1 and R2 receive the message from R0 and R4 first and then the message from R3 (new primary) later.
+    # 6. R1 will receive messages from R0 after the replication, R0 is a replica, and its primary is R3.
+    # 7. R2 will receive messages from R4 after the failover, R4 is a replica, and its primary is R3.
+    test "Combined the test cases of #2279 and #2301 to test #2431" {
+        # ============== Phase 1 start ==============
+        R 0 config set cluster-allow-replica-migration no
+
+        set CLUSTER_PACKET_TYPE_NONE -1
+        set CLUSTER_PACKET_TYPE_ALL -2
+
+        # We make R4 become a fresh new node.
+        isolate_node 4
+
+        # Set debug to R0 so that no packets can be exchanged when we resume it.
+        R 0 debug disable-cluster-reconnection 1
+        R 0 debug close-cluster-link-on-packet-drop 1
+        R 0 debug drop-cluster-packet-filter $CLUSTER_PACKET_TYPE_ALL
+
+        # Pause R0 and wait for R3 to become a new primary.
+        pause_process [srv 0 pid]
+        R 3 cluster failover force
+        wait_for_condition 1000 50 {
+            [s -3 role] eq {master}
+        } else {
+            fail "Failed waiting for R3 to takeover primaryship"
+        }
+
+        # Add R4 and wait for R4 to become a replica of R3.
+        R 4 cluster meet [srv -3 host] [srv -3 port]
+        wait_for_condition 50 100 {
+            [cluster_get_node_by_id 4 [R 3 cluster myid]] != {}
+        } else {
+            fail "Node R4 never learned about node R3"
+        }
+        R 4 cluster replicate [R 3 cluster myid]
+        wait_for_sync [srv -4 client]
+
+        # Pause R3 and wait for R4 to become a new primary.
+        pause_process [srv -3 pid]
+        R 4 cluster failover takeover
+        wait_for_condition 1000 50 {
+            [s -4 role] eq {master}
+        } else {
+            fail "Failed waiting for R4 to become primary"
+        }
+
+        # Resume R0 and R3
+        resume_process [srv 0 pid]
+        resume_process [srv -3 pid]
+
+        # Make sure R0 drop all the links so that it won't get the pending packets.
+        wait_for_condition 1000 50 {
+            [R 0 cluster links] eq {}
+        } else {
+            fail "Failed waiting for A to drop all cluster links"
+        }
+
+        # Un-debug R0 and let's start exchanging packets.
+        R 0 debug disable-cluster-reconnection 0
+        R 0 debug close-cluster-link-on-packet-drop 0
+        R 0 debug drop-cluster-packet-filter $CLUSTER_PACKET_TYPE_NONE
+
+        # ============== Phase 1 end ==============
+
+        wait_for_cluster_propagation
+
+        # ============== Phase 2 start ==============
+
+        set R0_nodeid [R 0 cluster myid]
+        set R1_nodeid [R 1 cluster myid]
+        set R2_nodeid [R 2 cluster myid]
+        set R3_nodeid [R 3 cluster myid]
+        set R4_nodeid [R 4 cluster myid]
+
+        set R0_shardid [R 0 cluster myshardid]
+        set R3_shardid [R 3 cluster myshardid]
+        set R4_shardid [R 4 cluster myshardid]
+
+        # R0 now is an empty primary, R4 is the primary, R3 is the replica.
+        # They are both in the same shard, this may be changed in #2279, and
+        # the assert can be removed then.
+        assert_equal [s 0 role] "master"
+        assert_equal [s -3 role] "slave"
+        assert_equal [s -4 role] "master"
+        assert_equal $R0_shardid $R3_shardid
+        assert_equal $R0_shardid $R4_shardid
+
+        # Ensure that related nodes do not reconnect after we kill the cluster links.
+        R 1 debug disable-cluster-reconnection 1
+        R 2 debug disable-cluster-reconnection 1
+        R 3 debug disable-cluster-reconnection 1
+        R 4 debug disable-cluster-reconnection 1
+
+        # R3 doing the failover, and R0 doing the replicate with R3.
+        # R3 become the new primary after the failover.
+        # R4 become a replica after the failover.
+        # R0 become a replica after the replicate.
+        # Before we do that, kill the cluster link to create test conditions.
+        # Ensure that R1 and R2 of the other shards do not receive packets from R3 (new primary),
+        # but receive packets from R0 and R4 respectively first.
+
+        # R1 first receives the packet from R0.
+        # Kill the cluster links between R1 and R3, and between R1 and R4 ensure that:
+        # R1 can not receive messages from R3 (new primary),
+        # R1 can not receive messages from R4 (replica),
+        # and R1 can receive message from R0 (new replica).
+        R 1 debug clusterlink kill all $R3_nodeid
+        R 3 debug clusterlink kill all $R1_nodeid
+        R 1 debug clusterlink kill all $R4_nodeid
+        R 4 debug clusterlink kill all $R1_nodeid
+
+        # R2 first receives the packet from R4.
+        # Kill the cluster links between R2 and R3, and between R2 and R0 ensure that:
+        # R2 can not receive messages from R3 (new primary),
+        # R2 can not receive messages from R0 (new replica),
+        # and R2 can receive message from R4 (replica).
+        R 2 debug clusterlink kill all $R3_nodeid
+        R 3 debug clusterlink kill all $R2_nodeid
+        R 2 debug clusterlink kill all $R0_nodeid
+        R 0 debug clusterlink kill all $R2_nodeid
+
+        # R3 doing the failover, and R0 doing the replicate with R3
+        R 3 cluster failover takeover
+        wait_for_condition 1000 10 {
+            [cluster_has_flag [cluster_get_node_by_id 0 $R3_nodeid] master] eq 1
+        } else {
+            fail "R3 does not become the primary node"
+        }
+        R 0 cluster replicate $R3_nodeid
+
+        # Check that from the perspective of R1 and R2, when they first receive the
+        # replica's packet, they correctly fix the sender's and its primary's role.
+
+        # Check that from the perspectives of R1, when receiving the packet from R0,
+        # R0 is a replica, and its primary is R3, this is due to replicate.
+        wait_for_condition 1000 10 {
+            [cluster_has_flag [cluster_get_node_by_id 1 $R0_nodeid] slave] eq 1 &&
+            [cluster_has_flag [cluster_get_node_by_id 1 $R3_nodeid] master] eq 1
+        } else {
+            puts "R1 cluster nodes:"
+            puts [R 1 cluster nodes]
+            fail "The node is not marked with the correct flag in R1's view"
+        }
+
+        # Check that from the perspectives of R2, when receiving the packet from R4,
+        # R4 is a replica, and its primary is R4, this is due to failover.
+        wait_for_condition 1000 10 {
+            [cluster_has_flag [cluster_get_node_by_id 2 $R4_nodeid] slave] eq 1 &&
+            [cluster_has_flag [cluster_get_node_by_id 2 $R3_nodeid] master] eq 1
+        } else {
+            puts "R2 cluster nodes:"
+            puts [R 2 cluster nodes]
+            fail "The node is not marked with the correct flag in R2's view"
+        }
+
+        # ============== Phase 2 end ==============
+
+        R 0 debug disable-cluster-reconnection 0
+        R 1 debug disable-cluster-reconnection 0
+        R 2 debug disable-cluster-reconnection 0
+        R 3 debug disable-cluster-reconnection 0
+        R 4 debug disable-cluster-reconnection 0
+        wait_for_cluster_propagation
+    }
+}