-
Notifications
You must be signed in to change notification settings - Fork 14.4k
KAFKA-19048: Minimal movement replica balancing algorithm for reassignment #19297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
KAFKA-19048: Minimal movement replica balancing algorithm for reassignment #19297
Conversation
The new replica rebalancing strategy aims to achieve the following objectives: 1. Minimal Movement: Minimize the number of replica relocations during rebalancing. 2. Replica Balancing: Ensure that replicas are evenly distributed across brokers. 3. Anti-Affinity Support: Support rack-aware allocation when enabled. 4. Leader Balancing: Distribute leader replicas evenly across brokers. 5. ISR Order Optimization: Optimize adjacency relationships to prevent failover traffic concentration in case of broker failures.
cd24c1c
to
7fefd66
Compare
The new replica rebalancing strategy aims to achieve the following objectives: 1. Minimal Movement: Minimize the number of replica relocations during rebalancing. 2. Replica Balancing: Ensure that replicas are evenly distributed across brokers. 3. Anti-Affinity Support: Support rack-aware allocation when enabled. 4. Leader Balancing: Distribute leader replicas evenly across brokers. 5. ISR Order Optimization: Optimize adjacency relationships to prevent failover traffic concentration in case of broker failures.
The new replica rebalancing strategy aims to achieve the following objectives: 1. Minimal Movement: Minimize the number of replica relocations during rebalancing. 2. Replica Balancing: Ensure that replicas are evenly distributed across brokers. 3. Anti-Affinity Support: Support rack-aware allocation when enabled. 4. Leader Balancing: Distribute leader replicas evenly across brokers. 5. ISR Order Optimization: Optimize adjacency relationships to prevent failover traffic concentration in case of broker failures. 6. Leader Stability: Keep the original partition leader unchanged as much as possible to minimize leader transitions. This objective has a lower priority than the first five.
A label of 'needs-attention' was automatically added to this PR in order to raise the |
A label of 'needs-attention' was automatically added to this PR in order to raise the |
# Conflicts: # tools/src/main/java/org/apache/kafka/tools/reassign/ReassignPartitionsCommand.java
A label of 'needs-attention' was automatically added to this PR in order to raise the |
Motivation
Kafka clusters typically require rebalancing of topic replicas after horizontal scaling to evenly distribute the load across new and existing brokers. The current rebalancing approach does not consider the existing replica distribution, often resulting in excessive and unnecessary replica movements. These unnecessary movements increase rebalance duration, consume significant bandwidth and CPU resources, and potentially disrupt ongoing production and consumption operations. Thus, a replica rebalancing strategy that minimizes movements while achieving an even distribution of replicas is necessary.
Goals
The new replica rebalancing strategy aims to achieve the following objectives:
Proposed Changes
Rack-Level Replica Distribution
The following rules ensure balanced replica allocation at the rack level:
**rackCount = replicationFactor**
:partitionCount
replicas.**rackCount > replicationFactor**
:(rackBrokers/totalBrokers × totalReplicas) ≥ partitionCount
: each rack receives exactlypartitionCount
replicas.< partitionCount
: distribute remaining replicas using a weighted remainder allocation.Node-Level Replica Distribution
**rackCount = replicationFactor**
:**rackCount > replicationFactor**
:Anti-Affinity Support
When anti-affinity is enabled, the rebalance algorithm ensures that replicas of the same partition do not colocate on the same rack. Brokers without rack configuration are excluded from anti-affinity checks.
Replica Balancing Algorithm
Through the above steps, we can calculate the ideal replica count for each node and rack.
Based on the initial replica distribution of topics, we obtain the current replica partition allocation across nodes and racks, allowing us to identify which nodes violate anti-affinity rules.
We iterate through nodes with the following priority:
For these identified nodes, we relocate their replicas to target nodes that:
Satisfy all anti-affinity constraints
Have a current replica count below their ideal allocation
This process continues iteratively until:
No nodes violate anti-affinity rules
All nodes' current replica counts match their desired replica counts
Upon satisfying these conditions, we achieve balanced replica distribution across nodes.
Leader Balancing Algorithm
Target Leader Calculation:
Compute baseline average:
leader_avg = total_partitions / total_nodes
Identify broker where
replica_count ≤ leader_avg
:Designate all replicas as leaders on these brokers
Subtract allocated leaders:
remaining_partitions -= assigned_leaders
Exclude nodes:
remaining_brokers -= processed_brokers
Iteratively recalculate
leader_avg
until minimum replica nodes satisfyreplica_count ≥ leader_avg
Leader Assignment Constraints:
Final targets:
Light
brokers
:target_leaders = replica_count
Normal
broker
s:target_leaders = leader_avg
For each partition, select the
broker
with the largest difference between itstarget_leaders
and current leader count to become that partition's leader. Upon completing this traversal, we achieve uniform leader distribution across all brokers`.Optimizing ISR Order
During Leader Rebalancing, the leader of each partition is fixed and does not change.
Tracking Node Pair Frequencies:
Iterate through all partitions and record the first replica (which is the leader).
Track the occurrences of broker pairs (broker pairs) formed by the first and second replicas of each partition.
Optimized Selection of Subsequent Replicas:
For each partition, when selecting the second replica, choose a broker that forms the least frequent node pair with the first replica.
Continue this process iteratively for all replicas in the partition.