Skip to content

K-Means Clustering Algorithm (With Constraints)#60

Merged
ChloeW125 merged 13 commits intomainfrom
clustering-algorithm
Nov 28, 2025
Merged

K-Means Clustering Algorithm (With Constraints)#60
ChloeW125 merged 13 commits intomainfrom
clustering-algorithm

Conversation

@ChloeW125
Copy link
Contributor

@ChloeW125 ChloeW125 commented Nov 17, 2025

JIRA ticket link

https://f4kblueprint.atlassian.net/jira/software/projects/F4KRP/boards/1?selectedIssue=F4KRP-105
Clustering

Implementation description

  • K-means clustering algorithm implementation
  • Includes handling for optional max locations per cluster and max boxes per cluster constraints
  • Note: it was determined that timeout does not need to be enforced right now

Steps to test

  1. I (with Chat's help for formatting) made a testing file (called k_means_test.py), if you go to lines 75-77 you'll see you can modify 3 params related to the algorithm. Feel free to play with them to test out how the algorithm performs in different situations! (And then you can run it via: docker-compose exec backend python -m app.services.implementations.k_means_test)

What should reviewers focus on?

  • Implementation (Does it make sense? Any edge cases not yet covered?)
  • NOTE 1: Linter is currently mad because the Clustering Protocol signature does not match that of the currently-implemented algorithm (current algorithm includes max-boxes-per-cluster handling, but the protocol doesn't YET (changes pending))
  • NOTE 2: For now, we can assume that at most one of the constraints (i.e. one of max boxes and max locations per cluster) will be applied when the algorithm is called - that is, there should not be a situation where the user calls the algorithm with both max boxes and max locations per cluster constraints

Checklist

  • My PR name is descriptive and in imperative tense
  • My commit messages are descriptive and in imperative tense. My commits are atomic and trivial commits are squashed or fixup'd into non-trivial commits
  • I have requested a review from the PL, as well as other devs who have background knowledge on this PR or who will be building on top of this PR

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a K-Means clustering algorithm with support for optional constraints on maximum locations and boxes per cluster. The implementation uses scikit-learn's KMeans with a greedy assignment strategy to handle constraints when specified.

Key Changes:

  • Implements K-Means clustering with constraint handling for max locations/boxes per cluster
  • Adds numpy, scikit-learn, and scikit-learn-extra dependencies with pinned versions
  • Includes a test script to verify clustering functionality with real database locations

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
backend/python/requirements.txt Pins versions for numpy (1.26.4), scikit-learn (1.3.2), and adds scikit-learn-extra (0.2.0)
backend/python/app/services/implementations/k_means_test.py Adds test script for K-Means clustering with database location queries
backend/python/app/services/implementations/k_means_clustering_algorithm.py Implements KMeans clustering with constraint handling via greedy assignment algorithm

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ludavidca
Copy link
Collaborator

Lint errors are kinda Ruff

Copy link
Collaborator

@ludavidca ludavidca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things ran on my machine and looked fine, great work! Only a few small nits and I will approve the PR. I also added a method to plot the output on seaborn


# If no locations to cluster, return empty list
if not locations:
return [[] for _ in range(num_clusters)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, not sure if we want to display k means if there are no locations lol

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asked Colin and the consensus was to just leave it like this and let the route gen algo do the empty locations list error handling

statement = (
select(Location)
.where(Location.latitude is not None, Location.longitude is not None)
.limit(20)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure we can edit the number 20 in test file (ie have them be capitalized parameter variables at the top of the file)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah i think constants at the top might be nice

Copy link
Collaborator

@ColinToft ColinToft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff, LGTM!!

max_locations_per_cluster: Maximum number of locations
per cluster. Validates that the clustering is
possible and raises an error if violated.
timeout_seconds: Not enforced in this
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change this to say that we raise an error. Sorry I realize I probably wasn't super clear in the convo before, I def agree that we don't need to worry that much about timeout which is why we should just raise an error instead of trying to have a quicker fallback approach (this is pretty quick anyway)


if total_locations > max_possible:
raise ValueError(
"Max locations per cluster + number of clusters clustering parameters cannot be simultaneously satisfied"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

W edge case

statement = (
select(Location)
.where(Location.latitude is not None, Location.longitude is not None)
.limit(20)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah i think constants at the top might be nice

@ludavidca ludavidca self-requested a review November 28, 2025 00:50
@ChloeW125 ChloeW125 merged commit 542b5b1 into main Nov 28, 2025
2 checks passed
Copy link
Collaborator

@ludavidca ludavidca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants