K-Means Clustering Algorithm (With Constraints)#60
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR implements a K-Means clustering algorithm with support for optional constraints on maximum locations and boxes per cluster. The implementation uses scikit-learn's KMeans with a greedy assignment strategy to handle constraints when specified.
Key Changes:
- Implements K-Means clustering with constraint handling for max locations/boxes per cluster
- Adds numpy, scikit-learn, and scikit-learn-extra dependencies with pinned versions
- Includes a test script to verify clustering functionality with real database locations
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| backend/python/requirements.txt | Pins versions for numpy (1.26.4), scikit-learn (1.3.2), and adds scikit-learn-extra (0.2.0) |
| backend/python/app/services/implementations/k_means_test.py | Adds test script for K-Means clustering with database location queries |
| backend/python/app/services/implementations/k_means_clustering_algorithm.py | Implements KMeans clustering with constraint handling via greedy assignment algorithm |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
backend/python/app/services/implementations/k_means_clustering_algorithm.py
Outdated
Show resolved
Hide resolved
backend/python/app/services/implementations/k_means_clustering_algorithm.py
Outdated
Show resolved
Hide resolved
backend/python/app/services/implementations/k_means_clustering_algorithm.py
Outdated
Show resolved
Hide resolved
backend/python/app/services/implementations/k_means_clustering_algorithm.py
Outdated
Show resolved
Hide resolved
|
Lint errors are kinda Ruff |
ludavidca
left a comment
There was a problem hiding this comment.
Things ran on my machine and looked fine, great work! Only a few small nits and I will approve the PR. I also added a method to plot the output on seaborn
backend/python/app/services/implementations/k_means_clustering_algorithm.py
Outdated
Show resolved
Hide resolved
backend/python/app/services/implementations/k_means_clustering_algorithm.py
Show resolved
Hide resolved
|
|
||
| # If no locations to cluster, return empty list | ||
| if not locations: | ||
| return [[] for _ in range(num_clusters)] |
There was a problem hiding this comment.
Similar to above, not sure if we want to display k means if there are no locations lol
There was a problem hiding this comment.
Asked Colin and the consensus was to just leave it like this and let the route gen algo do the empty locations list error handling
| statement = ( | ||
| select(Location) | ||
| .where(Location.latitude is not None, Location.longitude is not None) | ||
| .limit(20) |
There was a problem hiding this comment.
Make sure we can edit the number 20 in test file (ie have them be capitalized parameter variables at the top of the file)
There was a problem hiding this comment.
Yeah i think constants at the top might be nice
backend/python/app/services/implementations/k_means_clustering_algorithm.py
Show resolved
Hide resolved
backend/python/app/services/implementations/k_means_clustering_algorithm.py
Outdated
Show resolved
Hide resolved
backend/python/app/services/implementations/k_means_clustering_algorithm.py
Show resolved
Hide resolved
| max_locations_per_cluster: Maximum number of locations | ||
| per cluster. Validates that the clustering is | ||
| possible and raises an error if violated. | ||
| timeout_seconds: Not enforced in this |
There was a problem hiding this comment.
We should change this to say that we raise an error. Sorry I realize I probably wasn't super clear in the convo before, I def agree that we don't need to worry that much about timeout which is why we should just raise an error instead of trying to have a quicker fallback approach (this is pretty quick anyway)
|
|
||
| if total_locations > max_possible: | ||
| raise ValueError( | ||
| "Max locations per cluster + number of clusters clustering parameters cannot be simultaneously satisfied" |
| statement = ( | ||
| select(Location) | ||
| .where(Location.latitude is not None, Location.longitude is not None) | ||
| .limit(20) |
There was a problem hiding this comment.
Yeah i think constants at the top might be nice
JIRA ticket link
https://f4kblueprint.atlassian.net/jira/software/projects/F4KRP/boards/1?selectedIssue=F4KRP-105
Clustering
Implementation description
Steps to test
What should reviewers focus on?
Checklist