More generic check for CUDA-aware MPI #1793

mrfh92 · 2025-02-14T10:21:10Z

fixes #1787

CUDA-awareness of OpenMPI is now checked within a try-except-construction to ensure that it is really checked in any case
a warning is issued if PyTorch supports CUDA but there is no CUDA-aware MPI detected

Due Diligence

General:
- title of the PR is suitable to appear in the Release Notes
Implementation:
- unit tests: all split configurations tested
- unit tests: multiple dtypes tested
- ~~[ ] benchmarks: created for new functionality~~
- ~~[ ] benchmarks: performance improved or maintained~~
- documentation updated where needed

Does this change modify the behaviour of other functions? If so, which?

no

…; warning is issued if PyTorch supports GPUs but no cuda-aware MPI is found.

github-actions · 2025-02-14T10:26:26Z

Thank you for the PR!

codecov · 2025-02-14T11:12:01Z

Codecov Report

Attention: Patch coverage is 94.73684% with 2 lines in your changes missing coverage. Please review.

Project coverage is 92.07%. Comparing base (cf33a11) to head (3e76ab8).
Report is 14 commits behind head on main.

Files with missing lines	Patch %	Lines
heat/core/_config.py	94.28%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1793      +/-   ##
==========================================
+ Coverage   92.06%   92.07%   +0.01%     
==========================================
  Files          85       86       +1     
  Lines       13111    13140      +29     
==========================================
+ Hits        12070    12099      +29     
  Misses       1041     1041

Flag	Coverage Δ
unit	`92.07% <94.73%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…t_fail

ClaudiaComito

As far as I'm concerned this can be merged, thanks for looking into this!

github-actions · 2025-02-17T05:09:34Z

Thank you for the PR!

github-actions · 2025-02-17T08:50:05Z

Thank you for the PR!

JuanPedroGHM · 2025-02-17T10:28:04Z

Benchmarks results - Sponsored by perun

function	mpi_ranks	device	metric	value	ref_value	std	% change	type	alert	lower_quantile	upper_quantile
concatenate	4	GPU	RUNTIME	0.0765067	0.0915707	0.00987731	-16.4507	jump-detection	True	nan	nan
apply_inplace_normalizer	4	GPU	RUNTIME	0.0037091	0.00180075	0.0057156	105.975	jump-detection	True	nan	nan

Grafana Dashboard
Last updated: 2025-04-07T16:17:46Z

heat/core/communication.py

mrfh92 · 2025-02-24T09:10:18Z

idea:

check PyTorch version for string cu or rocm to determine which is installed
check cuda_is_available() to determine whether cuda or rocm are really there
check mpi version
warning at initialization if something doesnt fit
Update doc's

…t_fail

JuanPedroGHM · 2025-03-19T14:50:41Z

Some more things:

GPU tests are not using CUDA_AWARE_MPI, link to the logs
There is a new module configuration to use CUDA AWARE MPI in horeka, it can be found here
When running the tests using cuda aware mpi, the tests fail when running async functions, like Iallreduce. Obviously those are not supported by OMPI+CUDA (link to the docs here), but it looks like the try/except blocks that we have around those functions are not properly catching the errors. We might need to find a proper way to catch that, or alternatively change the behaviour of those functions when using CUDA buffers.

…t_fail

github-actions · 2025-03-24T12:35:40Z

Thank you for the PR!

…t_fail

github-actions · 2025-03-28T14:43:56Z

Thank you for the PR!

…t_fail

github-actions · 2025-04-07T04:17:24Z

Thank you for the PR!

…nd ROCm versioning

mrfh92 · 2025-04-07T15:57:24Z

@JuanPedroGHM @ClaudiaComito I have added a module _config in core that handles the MPI- and CUDA-/ROCm-versioning as discussed this morning. I would suggest to have a new variable
GPU_AWARE_MPI that combines both CUDA- and ROCm-awarenes provided that Torch allows for the respective backend.

github-actions · 2025-04-07T16:01:56Z

Thank you for the PR!

github-actions · 2025-04-07T16:04:47Z

Thank you for the PR!

github-actions · 2025-04-08T08:17:48Z

Thank you for the PR!

JuanPedroGHM · 2025-04-08T13:13:13Z

Examples of possible global variables, and where to use them

PLATFORM
- Options: “linux/unix”, “win”, “osx/mps”
- Affects:
  - Avalilable datatypes and their promotion rules
  - Available functions
  - Tests to run
MPI_VERSION
- Options: “ompi”, “mvpich”, “intel”, “parastation”, “ucx”
- Affects:
  - Available collective ops, special rules
  - heat/core/communication.py
ACCELERATOR_BACKEND / DEVICE
- Options: “None”, “CUDA”, “ROCM”, “Intel GPU” (eventually), “MPS”
- Affects:
  - heat/core/devices.py
GPU_AWARE_MPI
- Options: True, False
- Depends: MPI_VERSION
- Affects:
  - heat/core/communication.py
HEAT_DEFAULT_DEVICE
- Preferred user device
- Depends:
  - Env var os.envvar["HEAT_DEFAULT_DEVICE"]
  - Available devices
- Affects:
  - heat/core/devices.py
HEAT_TEST_USE_DEVICE
- Default device for tests
- Depends
  - Env var
  - Available devices
- Affects:
  - Tests
  - Overrides HEAT_DEFAULT_DEVICE
HEAT_PRINT
- Options: LOCAL, GLOBAL
- Affects:
  - heat/core/printing.py

Things we should do with those variables

Selection of backup functions/routines when not available in hardware
Test filtering based purely on global variables
- Limiting the parameters
Info command
- Print the configuration/env
Topology command
- Print the ranks and assigned devices

mrfh92 · 2025-05-19T07:20:46Z

@JuanPedroGHM @ClaudiaComito idea: could we merge this PR as it is (as quick bugfix) and open an new issue for extensive refactoring as suggested by @JuanPedroGHM ?

…t_fail

github-actions · 2025-05-19T07:26:31Z

Thank you for the PR!

…t_fail

github-actions · 2025-05-26T07:49:27Z

Thank you for the PR!

changes have been done in the meantime

github-actions · 2025-05-26T11:17:43Z

Backport failed for stable, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin stable
git worktree add -d .worktree/backport-1793-to-stable origin/stable
cd .worktree/backport-1793-to-stable
git switch --create backport-1793-to-stable
git cherry-pick -x 8a3ae51adcb4a997346e8f379e48a9a72b9f9517 c96a4e427c124d94df5bb6d1bd64908b997cc4f4 a423f485678ede58f194b1d3f1692c143342487e 70766976c20b7720c090b06f28b178556e3283f6 e44c43b2b51773c9781d37ee27496da03f33ce72

* cuda-awareness of openmpi is now checked in a try-except-construction; warning is issued if PyTorch supports GPUs but no cuda-aware MPI is found. (cherry picked from commit 8a3ae51) * Update communication.py (cherry picked from commit c96a4e4) * added module _config in core which is intended to handle MPI, CUDA, and ROCm versioning (cherry picked from commit a423f48) * added variable GPU_AWARE_MPI (cherry picked from commit 7076697) * added MPICH (cherry picked from commit e44c43b) --------- Co-authored-by: Hoppe <[email protected]> Co-authored-by: Fabian Hoppe <[email protected]>

cuda-awareness of openmpi is now checked in a try-except-construction…

8a3ae51

…; warning is issued if PyTorch supports GPUs but no cuda-aware MPI is found.

mrfh92 self-assigned this Feb 14, 2025

github-actions bot added backport stable bug Something isn't working core labels Feb 14, 2025

mrfh92 added MPI Anything related to MPI communication communication labels Feb 14, 2025

mrfh92 requested review from JuanPedroGHM and ClaudiaComito February 14, 2025 12:34

ClaudiaComito added this to the 1.5.1 milestone Feb 17, 2025

Merge branch 'main' into bugs/1787-_Bug_Check_for_CUDA-aware_MPI_migh…

5bb8ce6

…t_fail

ClaudiaComito changed the title ~~Fix check for CUDA-aware MPI~~ More generic check for CUDA-aware MPI Feb 17, 2025

ClaudiaComito previously approved these changes Feb 17, 2025

View reviewed changes

ClaudiaComito added the PR talk label Feb 17, 2025

JuanPedroGHM added the benchmark PR label Feb 17, 2025

Update communication.py

c96a4e4

mrfh92 dismissed ClaudiaComito’s stale review via c96a4e4 February 17, 2025 08:44

ClaudiaComito approved these changes Feb 17, 2025

View reviewed changes

ClaudiaComito modified the milestones: 1.5.1, 1.6 Feb 17, 2025

ClaudiaComito modified the milestones: 1.6, 1.5.2 Feb 17, 2025

ClaudiaComito previously requested changes Feb 17, 2025

View reviewed changes

heat/core/communication.py Outdated Show resolved Hide resolved

Merge branch 'main' into bugs/1787-_Bug_Check_for_CUDA-aware_MPI_migh…

acce65c

…t_fail

Merge branch 'main' into bugs/1787-_Bug_Check_for_CUDA-aware_MPI_migh…

f701c80

…t_fail

Merge branch 'main' into bugs/1787-_Bug_Check_for_CUDA-aware_MPI_migh…

66612a3

…t_fail

Merge branch 'main' into bugs/1787-_Bug_Check_for_CUDA-aware_MPI_migh…

131a90b

…t_fail

added module _config in core which is intended to handle MPI, CUDA, a…

a423f48

…nd ROCm versioning

added variable GPU_AWARE_MPI

7076697

added MPICH

e44c43b

Merge branch 'main' into bugs/1787-_Bug_Check_for_CUDA-aware_MPI_migh…

039851c

…t_fail

JuanPedroGHM removed the benchmark PR label May 19, 2025

JuanPedroGHM mentioned this pull request May 23, 2025

[RFC] Technical Debt TODO #1874

Open

Merge branch 'main' into bugs/1787-_Bug_Check_for_CUDA-aware_MPI_migh…

3e76ab8

…t_fail

JuanPedroGHM approved these changes May 26, 2025

View reviewed changes

mrfh92 merged commit a8efd82 into main May 26, 2025
90 of 114 checks passed

github-project-automation bot moved this from In Progress to Done in Roadmap May 26, 2025

mrfh92 mentioned this pull request May 26, 2025

[Bug]: Failed backports #1855

Closed

4 tasks

JuanPedroGHM mentioned this pull request May 28, 2025

[Backport] #1793 to stable #1876

Merged

More generic check for CUDA-aware MPI #1793

More generic check for CUDA-aware MPI #1793

Uh oh!

Conversation

mrfh92 commented Feb 14, 2025

Due Diligence

Does this change modify the behaviour of other functions? If so, which?

Uh oh!

github-actions bot commented Feb 14, 2025

Uh oh!

codecov bot commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ClaudiaComito left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 17, 2025

Uh oh!

github-actions bot commented Feb 17, 2025

Uh oh!

JuanPedroGHM commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks results - Sponsored by perun

Uh oh!

Uh oh!

mrfh92 commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JuanPedroGHM commented Mar 19, 2025

Uh oh!

github-actions bot commented Mar 24, 2025

Uh oh!

github-actions bot commented Mar 28, 2025

Uh oh!

github-actions bot commented Apr 7, 2025

Uh oh!

mrfh92 commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 7, 2025

Uh oh!

github-actions bot commented Apr 7, 2025

Uh oh!

github-actions bot commented Apr 8, 2025

Uh oh!

JuanPedroGHM commented Apr 8, 2025

Examples of possible global variables, and where to use them

Things we should do with those variables

Uh oh!

mrfh92 commented May 19, 2025

Uh oh!

github-actions bot commented May 19, 2025

Uh oh!

github-actions bot commented May 26, 2025

Uh oh!

Uh oh!

github-actions bot commented May 26, 2025

Uh oh!

Uh oh!

codecov bot commented Feb 14, 2025 •

edited

Loading

JuanPedroGHM commented Feb 17, 2025 •

edited

Loading

mrfh92 commented Feb 24, 2025 •

edited

Loading

mrfh92 commented Apr 7, 2025 •

edited

Loading