Skip to content

Conversation

@urschrei
Copy link
Member

@urschrei urschrei commented Dec 9, 2025

When k-means++ initialisation selects data points as initial centroids, points at those locations have upper_bound=0 in Hamerly's algorithm, causing them to be incorrectly pruned from reassignment checks. This can cause the algorithm to declare convergence on the first iteration without ever computing true cluster centroids.

This fix updates centroids to be cluster means immediately after the initial assignment, before entering the main convergence loop. This ensures Hamerly bounds are computed against true centroids rather than the data points selected by k-means++.

Other improvements:

  • Added docs explaining that k-means converges to local minima and may produce suboptimal results depending on initialisation

  • Update test_kmeans_three_clusters to use a fixed seed (42) for deterministic testing which should avoid intermittent failures from unlucky k-means++ initialisation

  • I agree to follow the project's code of conduct.

  • I added an entry to CHANGES.md if knowledge of this change could be valuable to users.


Fixes #1469

Copy link
Member

@michaelkirk michaelkirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've spent over an hour trying to understand this, and confess I still don't fully. So here's some notes:

Ultimately, kmeans/++ is known to be non-optimal, and susceptible to initial conditions, so these are somewhat expected.

The "bad" clustering looks like this:

Image

In 30b4916 I counted how many seeds lead to failure before your changes: 603 failures in 100_000.

And after your changes, this was reduced to 133 failures in 100_000

So that seems good.

It seems like you're now pulling out part of the main loop to run initialization. I'm not sure why the difference between the initial-point-as-centroid and the "true" initial cluster centroid makes this difference - I'd assume it's just another loop of the iteration, but clearly something more is going on there.

At the end of the day, it seems like strictly better results, and the code is not unreasonable, so I'm happy to LGTM. But maybe wait a couple days before merging in case someone smarter or more dedicated can review.

When k-means++ initialisation selects data points as initial centroids,
points at those locations have upper_bound=0 in Hamerly's algorithm,
causing them to be incorrectly pruned from reassignment checks. This
could cause the algorithm to declare convergence on the first iteration
without ever computing true cluster centroids.

This fix updates centroids to be cluster means immediately after the initial
assignment, before entering the main convergence loop. This ensures
Hamerly bounds are computed against true centroids rather than the
k-means++ selected data points.

Document k-means local minima and use fixed seed in test

- Add documentation explaining that k-means converges to local minima and
  may produce suboptimal results depending on initialisation
- Update test_kmeans_three_clusters to use a fixed seed (42) for
  deterministic testing which should avoid intermittent failures from unlucky
  k-means++ initialisation


Signed-off-by: Stephan Hügel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kmeans instability

3 participants