Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large data counts support for MPI Communication #1765

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

JuanPedroGHM
Copy link
Member

@JuanPedroGHM JuanPedroGHM commented Jan 22, 2025

Due Diligence

  • General:
  • Implementation:
    • unit tests: all split configurations tested
    • unit tests: multiple dtypes tested
    • benchmarks: created for new functionality
    • benchmarks: performance improved or maintained
    • documentation updated where needed

Description

Some MPI implementation are limited to sending only 2^31-1 elements at once. As far as I have tested, this also applies for OpenMPI 4.1 and 5.0, because support has not been added to mpi4py. (At least in my tests it failed).

This small changes uses the trick described here, to pack contiguous data into an MPI Vector, extending the limit of elements being sent.

This is for contiguous data, as non-contiguous data is already packed in recursive vector data types, reducing the need to apply this trick.

Issue/s resolved: #

Changes proposed:

  • MPI Vector to send more than 2^31-1 elements at once.
  • __allreduce_like refactored to use custom reduction operators for derived data types.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Does this change modify the behaviour of other functions? If so, which?

yes, probably a lot of them.

Copy link
Contributor

Thank you for the PR!

Copy link

codecov bot commented Jan 22, 2025

Codecov Report

Attention: Patch coverage is 89.06250% with 7 lines in your changes missing coverage. Please review.

Project coverage is 92.25%. Comparing base (9c8eaf5) to head (fd5599f).

Files with missing lines Patch % Lines
heat/core/communication.py 89.06% 7 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1765   +/-   ##
=======================================
  Coverage   92.24%   92.25%           
=======================================
  Files          84       84           
  Lines       12460    12504   +44     
=======================================
+ Hits        11494    11535   +41     
- Misses        966      969    +3     
Flag Coverage Δ
unit 92.25% <89.06%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot added the testing Implementation of tests, or test-related issues label Jan 27, 2025
Copy link
Contributor

Thank you for the PR!

1 similar comment
Copy link
Contributor

Thank you for the PR!

@mrfh92
Copy link
Collaborator

mrfh92 commented Jan 27, 2025

I have encountered the following problem:

import heat as ht 
import torch

shape = (2 ** 10, 2 ** 10, 2 ** 11)

data = torch.ones(shape, dtype=torch.float32) * ht.MPI_WORLD.rank
ht.MPI_WORLD.Allreduce(ht.MPI.IN_PLACE, data, ht.MPI.SUM)

results in the following error:

  File /heat/heat/core/communication.py", line 915, in Allreduce
    ret, sbuf, rbuf, buf = self.__reduce_like(self.handle.Allreduce, sendbuf, recvbuf, op)
  File "/heat/heat/core/communication.py", line 895, in __reduce_like
    return func(sendbuf, recvbuf, *args, **kwargs), sbuf, rbuf, buf
  File "src/mpi4py/MPI.src/Comm.pyx", line 1115, in mpi4py.MPI.Comm.Allreduce
mpi4py.MPI.Exception: MPI_ERR_OP: invalid reduce operation

With 2 ** 10 in the last entry of shape, there is not problem, so it seems to be related to large counts.

@JuanPedroGHM
Copy link
Member Author

JuanPedroGHM commented Jan 27, 2025

Benchmarks results - Sponsored by perun

function mpi_ranks device metric value ref_value std % change type alert lower_quantile upper_quantile
heat_benchmarks 4 CPU RUNTIME 41.336 52.39 0.128327 -21.0995 jump-detection True nan nan
lanczos 4 CPU RUNTIME 0.243601 0.370699 0.000463902 -34.2861 jump-detection True nan nan
hierachical_svd_rank 4 CPU RUNTIME 0.0461049 0.062739 0.000964415 -26.5132 jump-detection True nan nan
hierachical_svd_tol 4 CPU RUNTIME 0.0528192 0.0734396 0.00218671 -28.078 jump-detection True nan nan
kmeans 4 CPU RUNTIME 0.321285 0.516782 0.00191397 -37.8297 jump-detection True nan nan
kmedians 4 CPU RUNTIME 0.435981 0.690609 0.00343605 -36.8701 jump-detection True nan nan
kmedoids 4 CPU RUNTIME 0.783135 1.23289 0.0214898 -36.4798 jump-detection True nan nan
reshape 4 CPU RUNTIME 0.0670712 0.0762245 0.00207416 -12.0085 jump-detection True nan nan
apply_inplace_standard_scaler_and_inverse 4 CPU RUNTIME 0.00741215 0.0106696 0.000721194 -30.5305 jump-detection True nan nan
apply_inplace_min_max_scaler_and_inverse 4 CPU RUNTIME 0.00101047 0.0013495 2.54152e-05 -25.1228 jump-detection True nan nan
apply_inplace_max_abs_scaler_and_inverse 4 CPU RUNTIME 0.000494969 0.000851977 1.92152e-05 -41.9035 jump-detection True nan nan
apply_inplace_robust_scaler_and_inverse 4 CPU RUNTIME 2.52344 3.90841 0.0524039 -35.4358 jump-detection True nan nan
apply_inplace_normalizer 4 CPU RUNTIME 0.00729311 0.00153844 0.00963938 374.057 jump-detection True nan nan
incremental_pca_split0 4 CPU RUNTIME 34.3503 42.6731 0.12998 -19.5038 jump-detection True nan nan
heat_benchmarks 4 GPU RUNTIME 16.934 21.333 0.0696059 -20.6207 jump-detection True nan nan
lanczos 4 GPU RUNTIME 0.601262 0.724528 0.000842133 -17.0134 jump-detection True nan nan
hierachical_svd_rank 4 GPU RUNTIME 0.0756396 0.095702 0.000126246 -20.9634 jump-detection True nan nan
hierachical_svd_tol 4 GPU RUNTIME 0.104457 0.124643 0.000958304 -16.1952 jump-detection True nan nan
kmeans 4 GPU RUNTIME 0.685847 0.96253 0.00165595 -28.7453 jump-detection True nan nan
kmedians 4 GPU RUNTIME 1.16977 1.636 0.00373927 -28.498 jump-detection True nan nan
kmedoids 4 GPU RUNTIME 1.30779 1.80371 0.00969395 -27.4948 jump-detection True nan nan
reshape 4 GPU RUNTIME 0.138261 0.154857 0.004976 -10.7165 jump-detection True nan nan
concatenate 4 GPU RUNTIME 0.0753749 0.0870384 0.00469164 -13.4004 jump-detection True nan nan
apply_inplace_standard_scaler_and_inverse 4 GPU RUNTIME 0.0121447 0.0181149 0.000520375 -32.9576 jump-detection True nan nan
apply_inplace_min_max_scaler_and_inverse 4 GPU RUNTIME 0.00146581 0.00175316 7.70685e-05 -16.3906 jump-detection True nan nan
apply_inplace_max_abs_scaler_and_inverse 4 GPU RUNTIME 0.0006405 0.000979006 5.84221e-05 -34.5766 jump-detection True nan nan
apply_inplace_robust_scaler_and_inverse 4 GPU RUNTIME 6.23772 8.41783 0.0363718 -25.8987 jump-detection True nan nan
apply_inplace_normalizer 4 GPU RUNTIME 0.00181336 0.00239644 7.24866e-05 -24.3309 jump-detection True nan nan
incremental_pca_split0 4 GPU RUNTIME 4.27703 4.9077 0.0671633 -12.8507 jump-detection True nan nan
heat_benchmarks 4 CPU RUNTIME 41.336 49.1195 0.128327 -15.846 trend-deviation True 44.8761 54.2423
matmul_split_0 4 CPU RUNTIME 0.0978848 0.159913 0.00565681 -38.7888 trend-deviation True 0.121471 0.207108
matmul_split_1 4 CPU RUNTIME 0.0988781 0.138353 0.00119551 -28.5319 trend-deviation True 0.108312 0.179826
qr_split_0 4 CPU RUNTIME 0.216181 0.277881 0.00375546 -22.2038 trend-deviation True 0.227327 0.346153
qr_split_1 4 CPU RUNTIME 0.154111 0.169796 0.00364359 -9.23781 trend-deviation True 0.163472 0.174831
hierachical_svd_rank 4 CPU RUNTIME 0.0461049 0.056287 0.000964415 -18.0896 trend-deviation True 0.0467363 0.0682467
reshape 4 CPU RUNTIME 0.0670712 0.19281 0.00207416 -65.2138 trend-deviation True 0.149999 0.213561
concatenate 4 CPU RUNTIME 0.122652 0.194566 0.00620859 -36.9613 trend-deviation True 0.147756 0.250637
apply_inplace_min_max_scaler_and_inverse 4 CPU RUNTIME 0.00101047 0.00120594 2.54152e-05 -16.2093 trend-deviation True 0.00102689 0.00144081
apply_inplace_max_abs_scaler_and_inverse 4 CPU RUNTIME 0.000494969 0.000763574 1.92152e-05 -35.1773 trend-deviation True 0.000512126 0.00108052
apply_inplace_normalizer 4 CPU RUNTIME 0.00729311 0.00199725 0.00963938 265.157 trend-deviation True 0.000993087 0.00460139
incremental_pca_split0 4 CPU RUNTIME 34.3503 39.9462 0.12998 -14.0087 trend-deviation True 37.1267 43.2585
heat_benchmarks 4 GPU RUNTIME 16.934 21.3689 0.0696059 -20.754 trend-deviation True 20.8217 22.577
matmul_split_0 4 GPU RUNTIME 0.0163408 0.0443321 0.000266767 -63.1399 trend-deviation True 0.0230956 0.0587207
matmul_split_1 4 GPU RUNTIME 0.0160743 0.0272088 0.00011804 -40.9223 trend-deviation True 0.0195909 0.0319684
qr_split_0 4 GPU RUNTIME 0.0394196 0.0537206 0.000230386 -26.6211 trend-deviation True 0.0509402 0.058744
qr_split_1 4 GPU RUNTIME 0.0285229 0.044845 7.8921e-05 -36.3966 trend-deviation True 0.0347649 0.0532777
lanczos 4 GPU RUNTIME 0.601262 0.713858 0.000842133 -15.7729 trend-deviation True 0.606565 0.866203
hierachical_svd_rank 4 GPU RUNTIME 0.0756396 0.0974697 0.000126246 -22.3968 trend-deviation True 0.0953985 0.099392
hierachical_svd_tol 4 GPU RUNTIME 0.104457 0.124553 0.000958304 -16.1345 trend-deviation True 0.120811 0.128828
reshape 4 GPU RUNTIME 0.138261 0.235989 0.004976 -41.412 trend-deviation True 0.174847 0.280703
concatenate 4 GPU RUNTIME 0.0753749 0.0895676 0.00469164 -15.8458 trend-deviation True 0.0755032 0.102385
resplit 4 GPU RUNTIME 2.04335 2.12104 0.00591307 -3.66249 trend-deviation True 2.04964 2.19401
incremental_pca_split0 4 GPU RUNTIME 4.27703 6.41271 0.0671633 -33.3039 trend-deviation True 4.84684 7.58545

Grafana Dashboard
Last updated: 2025-02-17T14:16:21Z

@mrfh92
Copy link
Collaborator

mrfh92 commented Jan 27, 2025

Could there be the problem that for all communication involving MPI-Operations like MPI.SUM etc. such an operation is not well-defined on the MPI-Vector construction chosen for the buffers?

Copy link
Contributor

Thank you for the PR!

@JuanPedroGHM
Copy link
Member Author

Could there be the problem that for all communication involving MPI-Operations like MPI.SUM etc. such an operation is not well-defined on the MPI-Vector construction chosen for the buffers?

Have you found some bug? I don't think it should be an issue, as the vector datatype is just pointing to where the data is, where it needs to go, and it in what order. As long as both send and recv buffers are well-defined by the datatype, there should not be an issue with MPI operations.

@mrfh92
Copy link
Collaborator

mrfh92 commented Jan 28, 2025

The example with Allreduce I posted above caused an error for me.

Copy link
Contributor

github-actions bot commented Feb 4, 2025

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

@ClaudiaComito ClaudiaComito added enhancement New feature or request and removed bug Something isn't working backport stable backport release labels Feb 10, 2025
@github-actions github-actions bot added backport release bug Something isn't working labels Feb 10, 2025
Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

@JuanPedroGHM JuanPedroGHM linked an issue Feb 12, 2025 that may be closed by this pull request
Copy link
Contributor

Thank you for the PR!

1 similar comment
Copy link
Contributor

Thank you for the PR!

Copy link
Collaborator

@mrfh92 mrfh92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can judge, these changes look fine. As testing this approach requires quite large messages, it won't be possible to test every functionality and/or exception properly within the CI; thus I'd suggest to ignore the non-100%-patch-coverage for this PR (in particular because your benchmarks above have tested the core functionality on a real HPC-system).

Thanks for the work! :)

Copy link
Contributor

Thank you for the PR!

@mrfh92
Copy link
Collaborator

mrfh92 commented Feb 14, 2025

I am not sure what happens exactly, but there seem to be problems on the AMD-runner (it crashes in test_communication) and also for PyTorch 2.0.1 (and all Python versions)

@ClaudiaComito
Copy link
Contributor

The bot adds back wrong labels (bug, backport) after every update. Is there a way to switch off the autolabeling? Otherwise we should remove those labels just before merging @JuanPedroGHM

Copy link
Contributor

Thank you for the PR!

@JuanPedroGHM
Copy link
Member Author

The bot adds back wrong labels (bug, backport) after every update. Is there a way to switch off the autolabeling? Otherwise we should remove those labels just before merging @JuanPedroGHM

Part of it will be fixed with #1753, but it needs to be merged first. About turning it off, I don't know if there is an easy way to disable actions temporarily.

@JuanPedroGHM JuanPedroGHM removed backport stable backport release testing Implementation of tests, or test-related issues labels Feb 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark PR benchmarking bug Something isn't working core enhancement New feature or request PR talk
Projects
Status: Merge queue
Development

Successfully merging this pull request may close these issues.

Datatype tiling for large communication
3 participants