Fix k-means premature convergence bug #1473
Open
+59
−14
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.

When k-means++ initialisation selects data points as initial centroids, points at those locations have
upper_bound=0in Hamerly's algorithm, causing them to be incorrectly pruned from reassignment checks. This can cause the algorithm to declare convergence on the first iteration without ever computing true cluster centroids.This fix updates centroids to be cluster means immediately after the initial assignment, before entering the main convergence loop. This ensures Hamerly bounds are computed against true centroids rather than the data points selected by k-means++.
Other improvements:
Added docs explaining that k-means converges to local minima and may produce suboptimal results depending on initialisation
Update
test_kmeans_three_clustersto use a fixed seed (42) for deterministic testing which should avoid intermittent failures from unlucky k-means++ initialisationI agree to follow the project's code of conduct.
I added an entry to
CHANGES.mdif knowledge of this change could be valuable to users.Fixes #1469