Skip to content

#5: Add improved outlier detection #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

pierrepebay
Copy link

Fixes: #5

@pierrepebay pierrepebay self-assigned this Mar 6, 2025
@pierrepebay pierrepebay force-pushed the 5-add-improved-outlier-detection branch 2 times, most recently from 63c7091 to fef20cd Compare March 18, 2025 14:50
@pierrepebay pierrepebay force-pushed the 5-add-improved-outlier-detection branch from 8f84394 to b48c308 Compare March 18, 2025 17:47
@pierrepebay pierrepebay marked this pull request as ready for review March 18, 2025 17:48
@pierrepebay pierrepebay force-pushed the 5-add-improved-outlier-detection branch from 7d178d2 to f1aa013 Compare March 21, 2025 17:49
Comment on lines +66 to +73
self.expected_slow_ranks = set([
1702,1462,1902,1222,1182,1262,1862,1342,1382,1422,1742,1502,1102,1582,
1142,1782,1662,1022,1062,1542,1622,982,1302,1822,1381,1501,1021,1341,
1541,1301,1701,1821,1261,1621,1901,1012,1461,1861,1181,1221,981,1661,
1061,1781,1421,1741,1101,1581,1141,902,1812,1252,1492,1532,1772,1292,
1332,1852,302,382,182,702,1652,1212,582,1172,1452,1692,972,1572,1732,
62,1412
])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to brainstorm a different way to test this so we don't have to hard-code the expected values (that's likely outside the scope of this PR though).

threshold = representative_center + 3 * np.std(cluster_to_times[representative_cluster])

problematic_clusters = [cluster_id for cluster_id, center in cluster_centers.items() if center > threshold]
return data,clusters,cluster_to_times,cluster_to_ranks,cluster_centers,representative_cluster,representative_center,threshold,problematic_clusters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of returning a huge tuple like this we should use a class since this is very error prone.


representative_cluster = max(cluster_to_times.items(), key=lambda v: len(v[1]))[0]
representative_center = cluster_centers[representative_cluster]
threshold = representative_center + 3 * np.std(cluster_to_times[representative_cluster])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be make this 3 a parameter instead of a constant used directly in the code?

if representative_cluster_is_slowest:
if representative_center - 3 * np.std(cluster_to_times[representative_cluster]) > slowest_non_representative_center:
print()
print(f" WARNING: Clustering results found most times to be slower than others. No outliers will be detected.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that instead of a warning here, we should stop the job from running by not giving the node list with a bunch of slow nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add improved outlier detection
3 participants