Fix k-means premature convergence bug #1473

urschrei · 2025-12-09T14:11:34Z

When k-means++ initialisation selects data points as initial centroids, points at those locations have upper_bound=0 in Hamerly's algorithm, causing them to be incorrectly pruned from reassignment checks. This can cause the algorithm to declare convergence on the first iteration without ever computing true cluster centroids.

This fix updates centroids to be cluster means immediately after the initial assignment, before entering the main convergence loop. This ensures Hamerly bounds are computed against true centroids rather than the data points selected by k-means++.

Other improvements:

Added docs explaining that k-means converges to local minima and may produce suboptimal results depending on initialisation
Update test_kmeans_three_clusters to use a fixed seed (42) for deterministic testing which should avoid intermittent failures from unlucky k-means++ initialisation
I agree to follow the project's code of conduct.
I added an entry to CHANGES.md if knowledge of this change could be valuable to users.

Fixes #1469

michaelkirk

I've spent over an hour trying to understand this, and confess I still don't fully. So here's some notes:

Ultimately, kmeans/++ is known to be non-optimal, and susceptible to initial conditions, so these are somewhat expected.

The "bad" clustering looks like this:

In 30b4916 I counted how many seeds lead to failure before your changes: 603 failures in 100_000.

And after your changes, this was reduced to 133 failures in 100_000

So that seems good.

It seems like you're now pulling out part of the main loop to run initialization. I'm not sure why the difference between the initial-point-as-centroid and the "true" initial cluster centroid makes this difference - I'd assume it's just another loop of the iteration, but clearly something more is going on there.

At the end of the day, it seems like strictly better results, and the code is not unreasonable, so I'm happy to LGTM. But maybe wait a couple days before merging in case someone smarter or more dedicated can review.

When k-means++ initialisation selects data points as initial centroids, points at those locations have upper_bound=0 in Hamerly's algorithm, causing them to be incorrectly pruned from reassignment checks. This could cause the algorithm to declare convergence on the first iteration without ever computing true cluster centroids. This fix updates centroids to be cluster means immediately after the initial assignment, before entering the main convergence loop. This ensures Hamerly bounds are computed against true centroids rather than the k-means++ selected data points. Document k-means local minima and use fixed seed in test - Add documentation explaining that k-means converges to local minima and may produce suboptimal results depending on initialisation - Update test_kmeans_three_clusters to use a fixed seed (42) for deterministic testing which should avoid intermittent failures from unlucky k-means++ initialisation Signed-off-by: Stephan Hügel <[email protected]>

michaelkirk reviewed Dec 12, 2025

View reviewed changes

michaelkirk approved these changes Dec 12, 2025

View reviewed changes

urschrei force-pushed the kmeans_fixes branch from c670835 to c53d8b7 Compare December 13, 2025 14:23

urschrei force-pushed the kmeans_fixes branch from c53d8b7 to 09b3611 Compare December 13, 2025 15:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix k-means premature convergence bug #1473

Fix k-means premature convergence bug #1473

urschrei commented Dec 9, 2025

Uh oh!

michaelkirk left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix k-means premature convergence bug #1473

Are you sure you want to change the base?

Fix k-means premature convergence bug #1473

Conversation

urschrei commented Dec 9, 2025

Uh oh!

michaelkirk left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants