fix(kmeans): reject k values that produce singleton clusters#119
Merged
nicobao merged 1 commit intoMar 17, 2026
Merged
Conversation
Modify the silhouette scoring function in find_best_kmeans to return -1 for any k value where the smallest cluster has fewer than 2 members. This prevents singleton clusters from being selected as the optimal k. This is the same approach used by upstream Polis: https://github.com/compdemocracy/polis/blob/edge/math/src/polismath/math/clusters.clj#L350-L354 Closes polis-community#2, closes polis-community#97
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
find_best_kmeansuses silhouette scores to select the optimal number of clusters (k). However, it can select a k value where one or more clusters contain only a single participant. A singleton cluster is statistically meaningless — you can't derive group-level insights from one person — and it degrades downstream calculations like representative statements and group-aware consensus.Related issues:
Real-world example
In production with 58 clusterable participants, the algorithm selected k=2 producing groups of
[1, 57]— effectively a single-group result. The silhouette score for this degenerate split was higher than alternatives with more balanced distributions.How upstream Polis handles this
Polis invalidates any k value that produces a singleton cluster during silhouette scoring:
https://github.com/compdemocracy/polis/blob/edge/math/src/polismath/math/clusters.clj#L350-L354
Solution
Modify the
scoring_functioninsidefind_best_kmeansto return-1(worst possible silhouette score) for any k where the smallest cluster has fewer than 2 members. This prevents singleton-producing k values from being selected as optimal.This is the same approach as Polis (Option 2 from #97). A more comprehensive solution (e.g., constrained k-means with min cluster sizes) could follow later.
Changes
reddwarf/utils/clusterer/kmeans.py: Add singleton check beforesilhouette_scorein the scoring functiontests/utils/clusterer/test_kmeans.py: Add test with synthetic data (two dense clusters + one outlier) verifying singleton k values are rejected