SpaceTimeStack: Account for MPI imbalance when applying printing threshold by vbrunini · Pull Request #284 · kokkos/kokkos-tools

vbrunini · 2025-04-29T14:01:19Z

So that entries with a low average time but high imbalance are reported since they are often the cause of significant performance problems due to load imbalance.

Also fix missing deep_copy hook exposure.

vlkale · 2025-04-29T14:26:00Z

…shold.

So that entries with a low average time but high imbalance are reported since they are often the cause of significant performance problems due to load imbalance.

Also fix missing deep_copy hook exposure.

@vbrunini

Thanks for this. I am curious if you can provide an example code with the problem here in the PR description. Even better would be a set of tests via, say, Google Test, to check that those entries with low average time but high imbalance are indeed being reported.

A side note is that the fix for the missing deep_copy hook you have should ideally be a separate PR, as it looks conceptually unrelated to the change related to MPI load imbalance?

Thanks!

…shold. So that entries with a low average time but high imbalance are reported since they are often the cause of significant performance problems due to load imbalance.

vbrunini · 2025-05-01T19:22:36Z

…shold.
So that entries with a low average time but high imbalance are reported since they are often the cause of significant performance problems due to load imbalance.
Also fix missing deep_copy hook exposure.

@vbrunini

Thanks for this. I am curious if you can provide an example code with the problem here in the PR description. Even better would be a set of tests via, say, Google Test, to check that those entries with low average time but high imbalance are indeed being reported.

A side note is that the fix for the missing deep_copy hook you have should ideally be a separate PR, as it looks conceptually unrelated to the change related to MPI load imbalance?

Thanks!

Added a test (and split off the deep copy parts into #285)

masterleinad · 2025-05-01T19:51:36Z

+    const double comm_size = total_runtime / avg_runtime;
+    auto threshold_percent = ((max_runtime * comm_size) / tree_time) * 100.0;
+    auto percent           = (total_runtime / tree_time) * 100.0;


Can you elaborate on why this is sensible? I understand the previous code but now we essentially do

((max_runtime / avg_runtime) * percent < outout_threshold

In other words, why does multiplying by max_runtime / avg_runtime do what you want? Wouldn't you want to do something like

if (percent > first_threshold || (high_imbalance && percent > second_threshold)) report

? It's not quite clear to me that this second threshold should be the same as the orignal one divided by the imbalance. Maybe it's sensible to report all kernels that have an imbalance greater than a certain factor?

The thinking is to identify any kernels/regions that would be above the 0.1% threshold if we only looked at the runtimes on that rank. Assuming that total_runtime is evenly distributed across ranks (which I think is basically always true), then rank_wall_time = tree_time / comm_size and therefore max_runtime / rank_wall_time == (max_runtime * comm_size) / tree_time.

OK. You basically want to use max_runtime instead of avg_runtime for comparing with the threshold value. I think that's sensible. Not sure if we need a configuration option.
Note that there also is print_json_recursive that needs consistent logic.

masterleinad · 2025-05-01T21:08:10Z

                       std::string const& child_indent,
                       double tree_time) const {
-    auto percent = (total_runtime / tree_time) * 100.0;
+    const double comm_size = total_runtime / avg_runtime;


You might as well compute this properly once and store it in a member variable.

vbrunini · 2025-05-06T01:04:32Z

Any suggestions for what to do about the CI failure from the tests running as root?

--------------------------------------------------------------------------
mpirun has detected an attempt to run as root.

Running as root is *strongly* discouraged as any mistake (e.g., in
defining TMPDIR) or bug can result in catastrophic damage to the OS
file system, leaving your system in an unusable state.

We strongly suggest that you run mpirun as a non-root user.

You can override this protection by adding the --allow-run-as-root option
to the cmd line or by setting two environment variables in the following way:
the variable OMPI_ALLOW_RUN_AS_ROOT=1 to indicate the desire to override this
protection, and OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 to confirm the choice and
add one more layer of certainty that you want to do so.
We reiterate our advice against doing so - please proceed at your own risk.
--------------------------------------------------------------------------

Remove debugging message. Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

vbrunini mentioned this pull request Apr 29, 2025

SpaceTimeStack fix deep copy hooks #285

Open

vbrunini force-pushed the vbrunini/ststack_imbalance_threshold branch from 15203a5 to 6f44a6d Compare May 1, 2025 19:17

vbrunini added 4 commits May 1, 2025 13:19

SpaceTimeStack: Account for MPI imbalance when applying printing thre…

3f05b4c

…shold. So that entries with a low average time but high imbalance are reported since they are often the cause of significant performance problems due to load imbalance.

SpaceTimeStack: Fix format.

1e16db8

SpaceTimeStack: Fix failure when MPI not initialized.

3988d6e

SpaceTimeStack: Add test for thresholding including imbalance.

7c17a08

vbrunini force-pushed the vbrunini/ststack_imbalance_threshold branch from 6f44a6d to 7c17a08 Compare May 1, 2025 19:21

SpaceTimeStack: Format.

5cef33b

masterleinad reviewed May 1, 2025

View reviewed changes

Comment thread tests/space-time-stack/test_deep_copy.cpp.orig Outdated

SpaceTimeStack: Fix test compile & delete extra file.

986f6ed

masterleinad changed the title ~~SpaceTimeStack: Account for MPI imbalance when applying printing thre…~~ SpaceTimeStack: Account for MPI imbalance when applying printing threshold May 1, 2025

masterleinad reviewed May 1, 2025

View reviewed changes

Merge branch 'develop' into vbrunini/ststack_imbalance_threshold

8442565

masterleinad reviewed May 1, 2025

View reviewed changes

vbrunini added 2 commits May 2, 2025 08:15

SpaceTimeStack: Address review comments.

dcb58cd

SpaceTimeStack: Format.

2425310

Allow MPI to run as root

f3350c9

masterleinad reviewed May 6, 2025

View reviewed changes

Comment thread tests/CMakeLists.txt Outdated

Update tests/CMakeLists.txt

04b560f

Remove debugging message. Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SpaceTimeStack: Account for MPI imbalance when applying printing threshold#284

SpaceTimeStack: Account for MPI imbalance when applying printing threshold#284
vbrunini wants to merge 11 commits intokokkos:developfrom
vbrunini:vbrunini/ststack_imbalance_threshold

vbrunini commented Apr 29, 2025 •

edited by masterleinad

Loading

Uh oh!

vlkale commented Apr 29, 2025 •

edited

Loading

Uh oh!

vbrunini commented May 1, 2025

Uh oh!

Uh oh!

masterleinad May 1, 2025

Uh oh!

vbrunini May 1, 2025

Uh oh!

masterleinad May 1, 2025

Uh oh!

masterleinad May 1, 2025

Uh oh!

vbrunini commented May 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vbrunini commented Apr 29, 2025 • edited by masterleinad Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vlkale commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vbrunini commented May 1, 2025

Uh oh!

Uh oh!

masterleinad May 1, 2025

Choose a reason for hiding this comment

Uh oh!

vbrunini May 1, 2025

Choose a reason for hiding this comment

Uh oh!

masterleinad May 1, 2025

Choose a reason for hiding this comment

Uh oh!

masterleinad May 1, 2025

Choose a reason for hiding this comment

Uh oh!

vbrunini commented May 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vbrunini commented Apr 29, 2025 •

edited by masterleinad

Loading

vlkale commented Apr 29, 2025 •

edited

Loading