Skip to content

KAFKA-1792: change behavior of --generate to produce assignment config with fair replica distribution and minimal number of reassignments #18903

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: trunk
Choose a base branch
from

Conversation

ArneSchulze
Copy link

This PR introduces an alternative algorithm to the one used in the kafka-reassign-partitions script to minimize the number of replica-movement needed to balance out brokers.
It tries to improve on the following properties in order:

  1. replica balance across brokers for the cluster
  2. replica balance across racks for a topic partition
  3. leader balance across brokers for the cluster

The algorithm guarantees that the proposed assignments will never worsen properties one and two.
Property three is considered for ranking changes, but not enforced.
It will modify the initial assignments in two stages, executed in the following order:

  1. Assign unassigned replicas
  2. Move already assigned replicas

In the second stage replicas are moved iteratively between brokers until there is no possible move for any replica.
A move is considered possible if at least one of the properties is improved while none of the others is worsened.

In each iteration the best possible move is chosen and executed.
Moves for replicas that have already been moved in a previous iteration are preferred to other moves.
Moving already moved replicas will not increase the overall number of moves required (1 -> 2 + 2 -> 3 reduces to 1 -> 3).
Moves are ranked by improvement scores.
Each move has one improvement score for each of the properties defined above.
Improvement scores are compared in the order of priority defined above.

An improvement score is computed by subtracting the score of the current state from the score of the hypothetical new state after the move was executed.
The score of a state is computed by a slightly modified sum of residual squares computation to avoid floating point arithmetic.
For example: The score for 1) replica balance across brokers for the cluster is $\sum_{b \in brokers} replicaCount(b)^2$.
The resulting score will get smaller the smaller the variance of the values.

During the first stage, the algorithm will assign all unassigned replicas.
An unassigned replica is a replica assigned to a broker not contained in the broker metadata.
Possible initial assignments for each replica will be ranked similarly to the ranking while moving replicas.
Instead of using the improvement score, the initial assignment solely uses the score of the new state for each property.
Therefore, moves that are considered impossible in the move stage can be executed in the initial assignment stage iff there are no preferred alternatives.
Each replica assigned is considered moved in the move stage.

A new flag arg --sticky is added to the generate command of the existing kafka-reassign-partitions tool.
If the flag is present, the tool will use the algorithm defined above.
If the flag is not present, the tool will default to the existing algorithm.

In a rack-unaware setup, the algorithm will find the replica equilibrium.

Limitations

The proposed algorithm might not find the optimal solution, because improvement scores for properties one and two must not worsen for any move.
Therefore, if > 1 moves would be necessary to not worsen any improvement scores, the algorithm will not find the solution.

Consider the example where we have four brokers spread evenly across two racks (two per rack), one topic partition with two replicas
spread across rack one and another topic partition with two replicas spread across rack two.
The 1) replica balance across brokers for the cluster score would be optimal in this situation.
There is no move for any replica that would not worsen this property.
However, the 2) replica balance across racks for a topic partition property would be suboptimal for both topic partitions.
Swapping one replica of the first topic partition for another replica of the second topic partition would improve property 2, while
not worsening property 1.
This optimal chain of moves can not be executed, because the first move in the chain would require the 1 property to worsen.

Even though this limitation exists, we still think that the proposed algorithm would be a valid alternative, since the existing one can not fulfill any of
the properties above.
Moreover, the existing algorithm reassigns replicas randomly which results in unnecessary replica moves in most cases.

Testing strategy

The properties of the algorithm are unit tested using jqwik (property based testing). Following property based tests are defined:

  • 1) replica balance across brokers for the cluster never worsens
  • 2) replica balance across racks for a topic partition never worsens
  • if rack unaware, the algorithm will find a replica broker equilibrium

Integration into the kafka-reassign-partitions is not tested, since the algorithm is stateless and just a replacement for the existing algorithm in the tool.
All other aspects of the script, like acquiring cluster metadata, parsing command opts, etc., stay the same.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

This is my first contribution to the kafka project.
If there are things I'm missing I would be more than happy to address them.
Any ideas for improvement on the algorithm, testing, or anything related to the PR are appreciated as well. Thanks a lot!

@github-actions github-actions bot added triage PRs from the community tools build Gradle build or GitHub Actions labels Feb 14, 2025
@ArneSchulze ArneSchulze marked this pull request as ready for review February 20, 2025 14:32
Copy link

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

@ArneSchulze ArneSchulze changed the title KAFKA-1792 [WIP]: change behavior of --generate to produce assignment config with fair replica distribution and minimal number of reassignments KAFKA-1792: change behavior of --generate to produce assignment config with fair replica distribution and minimal number of reassignments Feb 24, 2025
Copy link

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Gradle build or GitHub Actions needs-attention tools triage PRs from the community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant