Skip to content

More generic check for CUDA-aware MPI #1793

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
May 26, 2025

Conversation

mrfh92
Copy link
Collaborator

@mrfh92 mrfh92 commented Feb 14, 2025

fixes #1787

  • CUDA-awareness of OpenMPI is now checked within a try-except-construction to ensure that it is really checked in any case
  • a warning is issued if PyTorch supports CUDA but there is no CUDA-aware MPI detected

Due Diligence

  • General:
  • Implementation:
    • unit tests: all split configurations tested
    • unit tests: multiple dtypes tested
    • [ ] benchmarks: created for new functionality
    • [ ] benchmarks: performance improved or maintained
    • documentation updated where needed

Does this change modify the behaviour of other functions? If so, which?

no

…; warning is issued if PyTorch supports GPUs but no cuda-aware MPI is found.
@mrfh92 mrfh92 self-assigned this Feb 14, 2025
@github-actions github-actions bot added backport stable bug Something isn't working core labels Feb 14, 2025
@mrfh92 mrfh92 added MPI Anything related to MPI communication communication labels Feb 14, 2025
Copy link
Contributor

Thank you for the PR!

Copy link

codecov bot commented Feb 14, 2025

Codecov Report

Attention: Patch coverage is 94.73684% with 2 lines in your changes missing coverage. Please review.

Project coverage is 92.07%. Comparing base (cf33a11) to head (3e76ab8).
Report is 14 commits behind head on main.

Files with missing lines Patch % Lines
heat/core/_config.py 94.28% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1793      +/-   ##
==========================================
+ Coverage   92.06%   92.07%   +0.01%     
==========================================
  Files          85       86       +1     
  Lines       13111    13140      +29     
==========================================
+ Hits        12070    12099      +29     
  Misses       1041     1041              
Flag Coverage Δ
unit 92.07% <94.73%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ClaudiaComito ClaudiaComito added this to the 1.5.1 milestone Feb 17, 2025
@ClaudiaComito ClaudiaComito changed the title Fix check for CUDA-aware MPI More generic check for CUDA-aware MPI Feb 17, 2025
ClaudiaComito
ClaudiaComito previously approved these changes Feb 17, 2025
Copy link
Contributor

@ClaudiaComito ClaudiaComito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I'm concerned this can be merged, thanks for looking into this!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

@ClaudiaComito ClaudiaComito modified the milestones: 1.5.1, 1.6 Feb 17, 2025
@JuanPedroGHM
Copy link
Member

JuanPedroGHM commented Feb 17, 2025

Benchmarks results - Sponsored by perun

function mpi_ranks device metric value ref_value std % change type alert lower_quantile upper_quantile
concatenate 4 GPU RUNTIME 0.0765067 0.0915707 0.00987731 -16.4507 jump-detection True nan nan
apply_inplace_normalizer 4 GPU RUNTIME 0.0037091 0.00180075 0.0057156 105.975 jump-detection True nan nan

Grafana Dashboard
Last updated: 2025-04-07T16:17:46Z

@ClaudiaComito ClaudiaComito modified the milestones: 1.6, 1.5.2 Feb 17, 2025
@mrfh92
Copy link
Collaborator Author

mrfh92 commented Feb 24, 2025

idea:

  • check PyTorch version for string cu or rocm to determine which is installed
  • check cuda_is_available() to determine whether cuda or rocm are really there
  • check mpi version
  • warning at initialization if something doesnt fit
  • Update doc's

@JuanPedroGHM
Copy link
Member

Some more things:

  • GPU tests are not using CUDA_AWARE_MPI, link to the logs
  • There is a new module configuration to use CUDA AWARE MPI in horeka, it can be found here
  • When running the tests using cuda aware mpi, the tests fail when running async functions, like Iallreduce. Obviously those are not supported by OMPI+CUDA (link to the docs here), but it looks like the try/except blocks that we have around those functions are not properly catching the errors. We might need to find a proper way to catch that, or alternatively change the behaviour of those functions when using CUDA buffers.

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

github-actions bot commented Apr 7, 2025

Thank you for the PR!

@mrfh92
Copy link
Collaborator Author

mrfh92 commented Apr 7, 2025

@JuanPedroGHM @ClaudiaComito I have added a module _config in core that handles the MPI- and CUDA-/ROCm-versioning as discussed this morning. I would suggest to have a new variable
GPU_AWARE_MPI that combines both CUDA- and ROCm-awarenes provided that Torch allows for the respective backend.

Copy link
Contributor

github-actions bot commented Apr 7, 2025

Thank you for the PR!

1 similar comment
Copy link
Contributor

github-actions bot commented Apr 7, 2025

Thank you for the PR!

Copy link
Contributor

github-actions bot commented Apr 8, 2025

Thank you for the PR!

@JuanPedroGHM
Copy link
Member

Examples of possible global variables, and where to use them

  • PLATFORM
    • Options: “linux/unix”, “win”, “osx/mps”
    • Affects:
      • Avalilable datatypes and their promotion rules
      • Available functions
      • Tests to run
  • MPI_VERSION
    • Options: “ompi”, “mvpich”, “intel”, “parastation”, “ucx”
    • Affects:
      • Available collective ops, special rules
      • heat/core/communication.py
  • ACCELERATOR_BACKEND / DEVICE
    • Options: “None”, “CUDA”, “ROCM”, “Intel GPU” (eventually), “MPS”
    • Affects:
      • heat/core/devices.py
  • GPU_AWARE_MPI
    • Options: True, False
    • Depends: MPI_VERSION
    • Affects:
      • heat/core/communication.py
  • HEAT_DEFAULT_DEVICE
    • Preferred user device
    • Depends:
      • Env var os.envvar["HEAT_DEFAULT_DEVICE"]
      • Available devices
    • Affects:
      • heat/core/devices.py
  • HEAT_TEST_USE_DEVICE
    • Default device for tests
    • Depends
      • Env var
      • Available devices
    • Affects:
      • Tests
      • Overrides HEAT_DEFAULT_DEVICE
  • HEAT_PRINT
    • Options: LOCAL, GLOBAL
    • Affects:
      • heat/core/printing.py

Things we should do with those variables

  • Selection of backup functions/routines when not available in hardware
  • Test filtering based purely on global variables
    • Limiting the parameters
  • Info command
    • Print the configuration/env
  • Topology command
    • Print the ranks and assigned devices

@mrfh92
Copy link
Collaborator Author

mrfh92 commented May 19, 2025

@JuanPedroGHM @ClaudiaComito idea: could we merge this PR as it is (as quick bugfix) and open an new issue for extensive refactoring as suggested by @JuanPedroGHM ?

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

@mrfh92 mrfh92 dismissed ClaudiaComito’s stale review May 26, 2025 11:17

changes have been done in the meantime

@mrfh92 mrfh92 merged commit a8efd82 into main May 26, 2025
90 of 114 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Roadmap May 26, 2025
Copy link
Contributor

Backport failed for stable, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin stable
git worktree add -d .worktree/backport-1793-to-stable origin/stable
cd .worktree/backport-1793-to-stable
git switch --create backport-1793-to-stable
git cherry-pick -x 8a3ae51adcb4a997346e8f379e48a9a72b9f9517 c96a4e427c124d94df5bb6d1bd64908b997cc4f4 a423f485678ede58f194b1d3f1692c143342487e 70766976c20b7720c090b06f28b178556e3283f6 e44c43b2b51773c9781d37ee27496da03f33ce72

@mrfh92 mrfh92 mentioned this pull request May 26, 2025
4 tasks
JuanPedroGHM added a commit that referenced this pull request Jun 12, 2025
* cuda-awareness of openmpi is now checked in a try-except-construction; warning is issued if PyTorch supports GPUs but no cuda-aware MPI is found.

(cherry picked from commit 8a3ae51)

* Update communication.py

(cherry picked from commit c96a4e4)

* added module _config in core which is intended to handle MPI, CUDA, and ROCm versioning

(cherry picked from commit a423f48)

* added variable GPU_AWARE_MPI

(cherry picked from commit 7076697)

* added MPICH

(cherry picked from commit e44c43b)

---------

Co-authored-by: Hoppe <[email protected]>
Co-authored-by: Fabian Hoppe <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport stable bug Something isn't working communication core MPI Anything related to MPI communication PR talk
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[Bug]: Check for CUDA-aware MPI might fail
5 participants