Skip to content

fix(kmeans): reject k values that produce singleton clusters#119

Merged
nicobao merged 1 commit into
polis-community:mainfrom
nicobao:fix/reject-singleton-clusters
Mar 17, 2026
Merged

fix(kmeans): reject k values that produce singleton clusters#119
nicobao merged 1 commit into
polis-community:mainfrom
nicobao:fix/reject-singleton-clusters

Conversation

@nicobao
Copy link
Copy Markdown
Member

@nicobao nicobao commented Mar 17, 2026

Problem

find_best_kmeans uses silhouette scores to select the optimal number of clusters (k). However, it can select a k value where one or more clusters contain only a single participant. A singleton cluster is statistically meaningless — you can't derive group-level insights from one person — and it degrades downstream calculations like representative statements and group-aware consensus.

Related issues:

Real-world example

In production with 58 clusterable participants, the algorithm selected k=2 producing groups of [1, 57] — effectively a single-group result. The silhouette score for this degenerate split was higher than alternatives with more balanced distributions.

How upstream Polis handles this

Polis invalidates any k value that produces a singleton cluster during silhouette scoring:
https://github.com/compdemocracy/polis/blob/edge/math/src/polismath/math/clusters.clj#L350-L354

Solution

Modify the scoring_function inside find_best_kmeans to return -1 (worst possible silhouette score) for any k where the smallest cluster has fewer than 2 members. This prevents singleton-producing k values from being selected as optimal.

This is the same approach as Polis (Option 2 from #97). A more comprehensive solution (e.g., constrained k-means with min cluster sizes) could follow later.

Changes

  • reddwarf/utils/clusterer/kmeans.py: Add singleton check before silhouette_score in the scoring function
  • tests/utils/clusterer/test_kmeans.py: Add test with synthetic data (two dense clusters + one outlier) verifying singleton k values are rejected

Modify the silhouette scoring function in find_best_kmeans to return -1
for any k value where the smallest cluster has fewer than 2 members.
This prevents singleton clusters from being selected as the optimal k.

This is the same approach used by upstream Polis:
https://github.com/compdemocracy/polis/blob/edge/math/src/polismath/math/clusters.clj#L350-L354

Closes polis-community#2, closes polis-community#97
@nicobao nicobao merged commit f19b625 into polis-community:main Mar 17, 2026
8 checks passed
@nicobao nicobao deleted the fix/reject-singleton-clusters branch March 17, 2026 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant