-
Notifications
You must be signed in to change notification settings - Fork 14.3k
KAFKA-1792: change behavior of --generate to produce assignment config with fair replica distribution and minimal number of reassignments #18903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
…e the amount of moves required to reach an equilibrium of replicas across brokers
A label of 'needs-attention' was automatically added to this PR in order to raise the |
A label of 'needs-attention' was automatically added to this PR in order to raise the |
This PR introduces an alternative algorithm to the one used in the
kafka-reassign-partitions
script to minimize the number of replica-movement needed to balance out brokers.It tries to improve on the following properties in order:
The algorithm guarantees that the proposed assignments will never worsen properties one and two.
Property three is considered for ranking changes, but not enforced.
It will modify the initial assignments in two stages, executed in the following order:
In the second stage replicas are moved iteratively between brokers until there is no possible move for any replica.
A move is considered possible if at least one of the properties is improved while none of the others is worsened.
In each iteration the best possible move is chosen and executed.
Moves for replicas that have already been moved in a previous iteration are preferred to other moves.
Moving already moved replicas will not increase the overall number of moves required (
1 -> 2
+2 -> 3
reduces to1 -> 3
).Moves are ranked by improvement scores.
Each move has one improvement score for each of the properties defined above.
Improvement scores are compared in the order of priority defined above.
An improvement score is computed by subtracting the score of the current state from the score of the hypothetical new state after the move was executed.$\sum_{b \in brokers} replicaCount(b)^2$ .
The score of a state is computed by a slightly modified sum of residual squares computation to avoid floating point arithmetic.
For example: The score for
1) replica balance across brokers for the cluster
isThe resulting score will get smaller the smaller the variance of the values.
During the first stage, the algorithm will assign all unassigned replicas.
An unassigned replica is a replica assigned to a broker not contained in the broker metadata.
Possible initial assignments for each replica will be ranked similarly to the ranking while moving replicas.
Instead of using the improvement score, the initial assignment solely uses the score of the new state for each property.
Therefore, moves that are considered impossible in the move stage can be executed in the initial assignment stage iff there are no preferred alternatives.
Each replica assigned is considered moved in the move stage.
A new flag arg
--sticky
is added to thegenerate
command of the existingkafka-reassign-partitions
tool.If the flag is present, the tool will use the algorithm defined above.
If the flag is not present, the tool will default to the existing algorithm.
In a rack-unaware setup, the algorithm will find the replica equilibrium.
Limitations
The proposed algorithm might not find the optimal solution, because improvement scores for properties one and two must not worsen for any move.
Therefore, if
> 1
moves would be necessary to not worsen any improvement scores, the algorithm will not find the solution.Consider the example where we have four brokers spread evenly across two racks (two per rack), one topic partition with two replicas
spread across rack one and another topic partition with two replicas spread across rack two.
The
1) replica balance across brokers for the cluster
score would be optimal in this situation.There is no move for any replica that would not worsen this property.
However, the
2) replica balance across racks for a topic partition
property would be suboptimal for both topic partitions.Swapping one replica of the first topic partition for another replica of the second topic partition would improve property
2
, whilenot worsening property
1
.This optimal chain of moves can not be executed, because the first move in the chain would require the
1
property to worsen.Even though this limitation exists, we still think that the proposed algorithm would be a valid alternative, since the existing one can not fulfill any of
the properties above.
Moreover, the existing algorithm reassigns replicas randomly which results in unnecessary replica moves in most cases.
Testing strategy
The properties of the algorithm are unit tested using
jqwik
(property based testing). Following property based tests are defined:1) replica balance across brokers for the cluster
never worsens2) replica balance across racks for a topic partition
never worsensIntegration into the
kafka-reassign-partitions
is not tested, since the algorithm is stateless and just a replacement for the existing algorithm in the tool.All other aspects of the script, like acquiring cluster metadata, parsing command opts, etc., stay the same.
Committer Checklist (excluded from commit message)
This is my first contribution to the kafka project.
If there are things I'm missing I would be more than happy to address them.
Any ideas for improvement on the algorithm, testing, or anything related to the PR are appreciated as well. Thanks a lot!