#5: Add improved outlier detection #6

pierrepebay · 2025-03-06T13:11:10Z

Fixes: #5

detection/detect_slow_nodes.py

…file write

detection/detect_slow_nodes.py

tests/unit/detection/test_slow_node_detector.py

cwschilly · 2025-03-24T20:07:24Z

tests/unit/detection/test_slow_node_detector.py

+        self.expected_slow_ranks = set([
+            1702,1462,1902,1222,1182,1262,1862,1342,1382,1422,1742,1502,1102,1582,
+            1142,1782,1662,1022,1062,1542,1622,982,1302,1822,1381,1501,1021,1341,
+            1541,1301,1701,1821,1261,1621,1901,1012,1461,1861,1181,1221,981,1661,
+            1061,1781,1421,1741,1101,1581,1141,902,1812,1252,1492,1532,1772,1292,
+            1332,1852,302,382,182,702,1652,1212,582,1172,1452,1692,972,1572,1732,
+            62,1412
+        ])


We may want to brainstorm a different way to test this so we don't have to hard-code the expected values (that's likely outside the scope of this PR though).

tests/unit/detection/test_slow_node_detector.py

detection/detect_slow_nodes.py

lifflander · 2025-04-08T19:20:15Z

detection/detect_slow_nodes.py

+        threshold = representative_center + 3 * np.std(cluster_to_times[representative_cluster])
+
+        problematic_clusters = [cluster_id for cluster_id, center in cluster_centers.items() if center > threshold]
+        return data,clusters,cluster_to_times,cluster_to_ranks,cluster_centers,representative_cluster,representative_center,threshold,problematic_clusters


I think instead of returning a huge tuple like this we should use a class since this is very error prone.

lifflander · 2025-04-08T19:28:15Z

detection/detect_slow_nodes.py

+
+        representative_cluster = max(cluster_to_times.items(), key=lambda v: len(v[1]))[0]
+        representative_center = cluster_centers[representative_cluster]
+        threshold = representative_center + 3 * np.std(cluster_to_times[representative_cluster])


Should be make this 3 a parameter instead of a constant used directly in the code?

lifflander · 2025-04-08T19:30:25Z

detection/detect_slow_nodes.py

+            if representative_cluster_is_slowest:
+                if representative_center - 3 * np.std(cluster_to_times[representative_cluster]) > slowest_non_representative_center:
+                    print()
+                    print(f"     WARNING: Clustering results found most times to be slower than others. No outliers will be detected.")


I'm thinking that instead of a warning here, we should stop the job from running by not giving the node list with a bunch of slow nodes.

pierrepebay self-assigned this Mar 6, 2025

cwschilly reviewed Mar 10, 2025

View reviewed changes

detection/detect_slow_nodes.py Outdated Show resolved Hide resolved

pierrepebay force-pushed the 5-add-improved-outlier-detection branch 2 times, most recently from 63c7091 to fef20cd Compare March 18, 2025 14:50

pierrepebay requested review from lifflander and nlslatt March 18, 2025 17:46

pierrepebay added 8 commits March 18, 2025 10:47

#5: progress checkpoint: Add clustering approach and plotting

89631c6

#5: Complete clustering outlier detection pipeline

8661284

#5: Add warning if representative cluster is slowest

8bd90ef

#5: Fix after rebase

704299e

#5: Fix f string and requirements

1354bb7

#5: Break long lines and fix more f strings

5ec3e30

#5: Fix rank to cluster association and add tests

c1dd202

#5: Remove trailing whitespace

b48c308

pierrepebay force-pushed the 5-add-improved-outlier-detection branch from 8f84394 to b48c308 Compare March 18, 2025 17:47

pierrepebay marked this pull request as ready for review March 18, 2025 17:48

pierrepebay added 2 commits March 21, 2025 10:48

#5: clustering: handle single cluster case

aaaf619

#5: clustering: add number of ranks in cluster per node in print and …

f1aa013

…file write

pierrepebay force-pushed the 5-add-improved-outlier-detection branch from 7d178d2 to f1aa013 Compare March 21, 2025 17:49

pierrepebay added 2 commits March 21, 2025 11:50

#5: add output directory command line argument

b94e277

#5: clustering: fix single cluster handling

96cf6ad

lifflander approved these changes Mar 24, 2025

View reviewed changes

cwschilly reviewed Mar 24, 2025

View reviewed changes

cwschilly and others added 2 commits March 25, 2025 10:38

#5: small changes/fixes

6b06acf

#5: Add uniformity based second step node filtering

2764e2a

lifflander requested changes Apr 8, 2025

View reviewed changes

pierrepebay added 3 commits April 10, 2025 09:12

#5: Turn off parallel clustering for tests

7a520ab

#5: Change tuple return to class

bcf3fc1

#5: Define outlier threshold as class attribute

9b5fd77

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

#5: Add improved outlier detection #6

#5: Add improved outlier detection #6

Uh oh!

pierrepebay commented Mar 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cwschilly Mar 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lifflander Apr 8, 2025

Uh oh!

lifflander Apr 8, 2025

Uh oh!

lifflander Apr 8, 2025

Uh oh!

Uh oh!

#5: Add improved outlier detection #6

Are you sure you want to change the base?

#5: Add improved outlier detection #6

Uh oh!

Conversation

pierrepebay commented Mar 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cwschilly Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lifflander Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

lifflander Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

lifflander Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!