#7: Fix iteration breakdown, add frequency analysis, and plot dropped nodes #9

cwschilly · 2025-03-26T19:53:01Z

Fixes #7
Fixes #8

Also refactors the Python code to be more modular and adds both function timers and a new plot for dropped nodes.

The plot the dropped nodes for previous runs, run from the project dir:

python detection/plot_dropped_nodes.py -s slownode.log -a slownodenanalysis.log -o /path/to/output_dir

Where:

-s <slownode.log>         - The driver output (from slow_node.cc)
-a <slownodeanalysis.log> - The analysis output (from detect_slow_nodes.py)

cwschilly · 2025-03-26T19:57:55Z

@nlslatt @lifflander Should we ignore the first iteration during analysis?

nlslatt · 2025-03-28T18:06:02Z

@nlslatt @lifflander Should we ignore the first iteration during analysis?

We should probably do one more than the requested number of iterations and completely throw out the first one. I think the first iteration should not be printed or added to the total time.

nlslatt · 2025-03-28T18:08:44Z

@cwschilly Are you planning to fix the sensors node name problem as a separate PR or part of this? There are a lot of things that need to be addressed before I can start collecting new data and I'd really like to start collecting it now.

nlslatt · 2025-03-28T18:28:25Z

It would be helpful if the list of slowest iterations included the iteration number and not just the time.

cwschilly · 2025-03-28T18:42:35Z

@cwschilly Are you planning to fix the sensors node name problem as a separate PR or part of this? There are a lot of things that need to be addressed before I can start collecting new data and I'd really like to start collecting it now.

@nlslatt I can put everything in this PR. Going to try to get everything fixed up by the end of the day--I'll ping you when it's ready

nlslatt · 2025-03-28T18:43:52Z

@cwschilly Jonathan had asked me for the barrier after the random initialization (i.e., before the for loop within runBenchmark), which is where I've been using one

src/slow_node.cc

nlslatt · 2025-03-31T21:05:23Z

@cwschilly The node names in the sensors files are still incorrect. All nodes have been given the name of the first node in the allocation.

src/freq.cc

nlslatt · 2025-04-15T16:13:32Z

detection/plot_slow_nodes.py

+    parser.add_argument('-a', '--analysis', help='Absolute or relative path to the output from detect_slow_nodes', required=True)
+    parser.add_argument('-o', '--output', help='Absolute or relative path to the output from detect_slow_nodes', default=None)


The help needs to distinguish between the original output (to be used as input) and the new output.

Is the new output a directory for the plot? That's not clear above.

Fixed, sorry about that. -o is the path to where you want to save the plot.

#7: fix iteration breakdown and analysis; add timers to all functions

932e28a

cwschilly linked an issue Mar 26, 2025 that may be closed by this pull request

Fix iteration breakdown and analysis #7

Open

cwschilly requested review from nlslatt, lifflander and pierrepebay March 26, 2025 19:53

#8: (wip) initial commit with cpu frequencies

b61e56b

nlslatt reviewed Mar 28, 2025

View reviewed changes

src/slow_node.cc Outdated Show resolved Hide resolved

cwschilly added 5 commits March 28, 2025 16:58

#8: (wip) write out cpu frequency with sensors

cd5ce80

#7: do not count first iteration in total time

1da87d5

#7: print iteration id with time in final write out

107ace5

#7: fix MPI issues and update tests/sensors log

7c988b8

#7: add units to iteration breakdown output

0f1c8e6

cwschilly requested a review from nlslatt March 31, 2025 15:29

cwschilly added 3 commits April 1, 2025 07:02

#7: fix node names in sensors output

8420444

#7: use fopen to read cpu frequencies instead of popen + cat

6939503

#7: add newline

cea0444

nlslatt reviewed Apr 4, 2025

View reviewed changes

src/freq.cc Show resolved Hide resolved

#7: add plot of dropped nodes; refactor code to be more modular

80b88a8

cwschilly changed the title ~~#7: Fix iteration breakdown and analysis~~ #7: Fix iteration breakdown, add frequency analysis, and plot dropped nodes Apr 9, 2025

cwschilly requested a review from nlslatt April 9, 2025 17:11

cwschilly added 2 commits April 9, 2025 10:12

#7: rename plotting script

c51ef21

#7: (wip) debug linking error when using Trilinos

a93282e

nlslatt reviewed Apr 15, 2025

View reviewed changes

#7: fix description of output dir

b56bde1

cwschilly self-assigned this Apr 15, 2025

cwschilly added 2 commits April 15, 2025 10:33

#7: add cli options for controlling error bars

b235444

#7: use direct call to mkl gemm instead of KokkosBlas

cc518be

cwschilly requested a review from nlslatt April 15, 2025 18:51

lifflander approved these changes Apr 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#7: Fix iteration breakdown, add frequency analysis, and plot dropped nodes #9

#7: Fix iteration breakdown, add frequency analysis, and plot dropped nodes #9

cwschilly commented Mar 26, 2025 •

edited

Loading

cwschilly commented Mar 26, 2025

nlslatt commented Mar 28, 2025

nlslatt commented Mar 28, 2025

nlslatt commented Mar 28, 2025

cwschilly commented Mar 28, 2025

nlslatt commented Mar 28, 2025

nlslatt commented Mar 31, 2025

nlslatt Apr 15, 2025

nlslatt Apr 15, 2025

cwschilly Apr 15, 2025

		parser.add_argument('-a', '--analysis', help='Absolute or relative path to the output from detect_slow_nodes', required=True)
		parser.add_argument('-o', '--output', help='Absolute or relative path to the output from detect_slow_nodes', default=None)

#7: Fix iteration breakdown, add frequency analysis, and plot dropped nodes #9

Are you sure you want to change the base?

#7: Fix iteration breakdown, add frequency analysis, and plot dropped nodes #9

Conversation

cwschilly commented Mar 26, 2025 • edited Loading

cwschilly commented Mar 26, 2025

nlslatt commented Mar 28, 2025

nlslatt commented Mar 28, 2025

nlslatt commented Mar 28, 2025

cwschilly commented Mar 28, 2025

nlslatt commented Mar 28, 2025

nlslatt commented Mar 31, 2025

nlslatt Apr 15, 2025

Choose a reason for hiding this comment

nlslatt Apr 15, 2025

Choose a reason for hiding this comment

cwschilly Apr 15, 2025

Choose a reason for hiding this comment

cwschilly commented Mar 26, 2025 •

edited

Loading