SpaceTimeStack: Account for MPI imbalance when applying printing threshold#284
SpaceTimeStack: Account for MPI imbalance when applying printing threshold#284vbrunini wants to merge 11 commits intokokkos:developfrom
Conversation
Thanks for this. I am curious if you can provide an example code with the problem here in the PR description. Even better would be a set of tests via, say, Google Test, to check that those entries with low average time but high imbalance are indeed being reported. A side note is that the fix for the missing deep_copy hook you have should ideally be a separate PR, as it looks conceptually unrelated to the change related to MPI load imbalance? Thanks! |
15203a5 to
6f44a6d
Compare
…shold. So that entries with a low average time but high imbalance are reported since they are often the cause of significant performance problems due to load imbalance.
6f44a6d to
7c17a08
Compare
Added a test (and split off the deep copy parts into #285) |
| const double comm_size = total_runtime / avg_runtime; | ||
| auto threshold_percent = ((max_runtime * comm_size) / tree_time) * 100.0; | ||
| auto percent = (total_runtime / tree_time) * 100.0; |
There was a problem hiding this comment.
Can you elaborate on why this is sensible? I understand the previous code but now we essentially do
((max_runtime / avg_runtime) * percent < outout_threshold
In other words, why does multiplying by max_runtime / avg_runtime do what you want? Wouldn't you want to do something like
if (percent > first_threshold || (high_imbalance && percent > second_threshold))
report
? It's not quite clear to me that this second threshold should be the same as the orignal one divided by the imbalance. Maybe it's sensible to report all kernels that have an imbalance greater than a certain factor?
There was a problem hiding this comment.
The thinking is to identify any kernels/regions that would be above the 0.1% threshold if we only looked at the runtimes on that rank. Assuming that total_runtime is evenly distributed across ranks (which I think is basically always true), then rank_wall_time = tree_time / comm_size and therefore max_runtime / rank_wall_time == (max_runtime * comm_size) / tree_time.
There was a problem hiding this comment.
OK. You basically want to use max_runtime instead of avg_runtime for comparing with the threshold value. I think that's sensible. Not sure if we need a configuration option.
Note that there also is print_json_recursive that needs consistent logic.
| std::string const& child_indent, | ||
| double tree_time) const { | ||
| auto percent = (total_runtime / tree_time) * 100.0; | ||
| const double comm_size = total_runtime / avg_runtime; |
There was a problem hiding this comment.
You might as well compute this properly once and store it in a member variable.
|
Any suggestions for what to do about the CI failure from the tests running as root? |
Remove debugging message. Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
So that entries with a low average time but high imbalance are reported since they are often the cause of significant performance problems due to load imbalance.
Also fix missing deep_copy hook exposure.