Skip to content

Large data counts support for MPI Communication #1765

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Feb 24, 2025

Conversation

JuanPedroGHM
Copy link
Member

@JuanPedroGHM JuanPedroGHM commented Jan 22, 2025

Due Diligence

  • General:
  • Implementation:
    • unit tests: all split configurations tested
    • unit tests: multiple dtypes tested
    • benchmarks: created for new functionality
    • benchmarks: performance improved or maintained
    • documentation updated where needed

Description

Some MPI implementation are limited to sending only 2^31-1 elements at once. As far as I have tested, this also applies for OpenMPI 4.1 and 5.0, because support has not been added to mpi4py. (At least in my tests it failed).

This small changes uses the trick described here, to pack contiguous data into an MPI Vector, extending the limit of elements being sent.

This is for contiguous data, as non-contiguous data is already packed in recursive vector data types, reducing the need to apply this trick.

Issue/s resolved: #

Changes proposed:

  • MPI Vector to send more than 2^31-1 elements at once.
  • __allreduce_like refactored to use custom reduction operators for derived data types.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Does this change modify the behaviour of other functions? If so, which?

yes, probably a lot of them.

Copy link
Contributor

Thank you for the PR!

Copy link

codecov bot commented Jan 22, 2025

Codecov Report

Attention: Patch coverage is 89.06250% with 7 lines in your changes missing coverage. Please review.

Project coverage is 92.24%. Comparing base (ba0b7e1) to head (e7a9fd0).
Report is 14 commits behind head on main.

Files with missing lines Patch % Lines
heat/core/communication.py 89.06% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1765      +/-   ##
==========================================
- Coverage   92.24%   92.24%   -0.01%     
==========================================
  Files          84       84              
  Lines       12460    12504      +44     
==========================================
+ Hits        11494    11534      +40     
- Misses        966      970       +4     
Flag Coverage Δ
unit 92.24% <89.06%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot added the testing Implementation of tests, or test-related issues label Jan 27, 2025
Copy link
Contributor

Thank you for the PR!

1 similar comment
Copy link
Contributor

Thank you for the PR!

@mrfh92
Copy link
Collaborator

mrfh92 commented Jan 27, 2025

I have encountered the following problem:

import heat as ht 
import torch

shape = (2 ** 10, 2 ** 10, 2 ** 11)

data = torch.ones(shape, dtype=torch.float32) * ht.MPI_WORLD.rank
ht.MPI_WORLD.Allreduce(ht.MPI.IN_PLACE, data, ht.MPI.SUM)

results in the following error:

  File /heat/heat/core/communication.py", line 915, in Allreduce
    ret, sbuf, rbuf, buf = self.__reduce_like(self.handle.Allreduce, sendbuf, recvbuf, op)
  File "/heat/heat/core/communication.py", line 895, in __reduce_like
    return func(sendbuf, recvbuf, *args, **kwargs), sbuf, rbuf, buf
  File "src/mpi4py/MPI.src/Comm.pyx", line 1115, in mpi4py.MPI.Comm.Allreduce
mpi4py.MPI.Exception: MPI_ERR_OP: invalid reduce operation

With 2 ** 10 in the last entry of shape, there is not problem, so it seems to be related to large counts.

@JuanPedroGHM
Copy link
Member Author

JuanPedroGHM commented Jan 27, 2025

Benchmarks results - Sponsored by perun

function mpi_ranks device metric value ref_value std % change type alert lower_quantile upper_quantile
heat_benchmarks 4 CPU RUNTIME 41.3293 51.8208 0.13663 -20.2458 jump-detection True nan nan
lanczos 4 CPU RUNTIME 0.245472 0.368403 0.0013412 -33.3685 jump-detection True nan nan
hierachical_svd_rank 4 CPU RUNTIME 0.0464498 0.0639418 0.00221168 -27.3561 jump-detection True nan nan
hierachical_svd_tol 4 CPU RUNTIME 0.0527641 0.0744805 0.00242268 -29.1572 jump-detection True nan nan
kmeans 4 CPU RUNTIME 0.318249 0.510466 0.00115396 -37.6551 jump-detection True nan nan
kmedians 4 CPU RUNTIME 0.435884 0.68515 0.00453745 -36.3812 jump-detection True nan nan
kmedoids 4 CPU RUNTIME 0.779213 1.21754 0.00374886 -36.0009 jump-detection True nan nan
reshape 4 CPU RUNTIME 0.0660566 0.0773497 0.000687786 -14.6001 jump-detection True nan nan
apply_inplace_standard_scaler_and_inverse 4 CPU RUNTIME 0.00699725 0.00974845 0.000726862 -28.2219 jump-detection True nan nan
apply_inplace_min_max_scaler_and_inverse 4 CPU RUNTIME 0.00100462 0.00137954 1.44658e-05 -27.1767 jump-detection True nan nan
apply_inplace_max_abs_scaler_and_inverse 4 CPU RUNTIME 0.000496447 0.000884557 2.12511e-05 -43.8762 jump-detection True nan nan
apply_inplace_robust_scaler_and_inverse 4 CPU RUNTIME 2.50269 3.86577 0.0481586 -35.2603 jump-detection True nan nan
apply_inplace_normalizer 4 CPU RUNTIME 0.000991201 0.00156467 0.000117324 -36.6511 jump-detection True nan nan
incremental_pca_split0 4 CPU RUNTIME 34.3897 42.315 0.140765 -18.7293 jump-detection True nan nan
heat_benchmarks 4 GPU RUNTIME 17.0395 21.3331 0.10293 -20.1267 jump-detection True nan nan
lanczos 4 GPU RUNTIME 0.606968 0.722394 0.00196506 -15.9783 jump-detection True nan nan
hierachical_svd_rank 4 GPU RUNTIME 0.0750733 0.0950145 5.01931e-05 -20.9876 jump-detection True nan nan
hierachical_svd_tol 4 GPU RUNTIME 0.103358 0.125885 0.000109619 -17.8949 jump-detection True nan nan
kmeans 4 GPU RUNTIME 0.687634 0.945599 0.00166606 -27.2806 jump-detection True nan nan
kmedians 4 GPU RUNTIME 1.17235 1.62268 0.00447032 -27.7523 jump-detection True nan nan
kmedoids 4 GPU RUNTIME 1.31362 1.77978 0.00358418 -26.1921 jump-detection True nan nan
apply_inplace_standard_scaler_and_inverse 4 GPU RUNTIME 0.011572 0.019365 0.00139041 -40.2426 jump-detection True nan nan
apply_inplace_min_max_scaler_and_inverse 4 GPU RUNTIME 0.00145524 0.00173657 1.7101e-05 -16.2006 jump-detection True nan nan
apply_inplace_max_abs_scaler_and_inverse 4 GPU RUNTIME 0.000604403 0.00100199 1.4756e-05 -39.6797 jump-detection True nan nan
apply_inplace_robust_scaler_and_inverse 4 GPU RUNTIME 6.21919 8.30611 0.0231803 -25.1251 jump-detection True nan nan
apply_inplace_normalizer 4 GPU RUNTIME 0.00826368 0.00252771 0.0139664 226.923 jump-detection True nan nan
incremental_pca_split0 4 GPU RUNTIME 4.33722 4.86028 0.0737785 -10.7619 jump-detection True nan nan
heat_benchmarks 4 CPU RUNTIME 41.3293 49.2883 0.13663 -16.1479 trend-deviation True 44.8767 54.2372
matmul_split_0 4 CPU RUNTIME 0.098484 0.156183 0.00608012 -36.9431 trend-deviation True 0.099259 0.207059
matmul_split_1 4 CPU RUNTIME 0.100005 0.136353 0.00272885 -26.6576 trend-deviation True 0.104529 0.179407
qr_split_0 4 CPU RUNTIME 0.216765 0.274786 0.00524295 -21.1153 trend-deviation True 0.225948 0.346127
qr_split_1 4 CPU RUNTIME 0.153472 0.169503 0.00290488 -9.4576 trend-deviation True 0.163512 0.174618
hierachical_svd_rank 4 CPU RUNTIME 0.0464498 0.0567654 0.00221168 -18.1722 trend-deviation True 0.0467594 0.0682228
reshape 4 CPU RUNTIME 0.0660566 0.185593 0.000687786 -64.4079 trend-deviation True 0.0770684 0.213552
concatenate 4 CPU RUNTIME 0.120446 0.190176 0.00322879 -36.6658 trend-deviation True 0.128103 0.250247
apply_inplace_min_max_scaler_and_inverse 4 CPU RUNTIME 0.00100462 0.00121679 1.44658e-05 -17.4365 trend-deviation True 0.00102721 0.00143947
apply_inplace_max_abs_scaler_and_inverse 4 CPU RUNTIME 0.000496447 0.000771135 2.12511e-05 -35.6213 trend-deviation True 0.000512384 0.00107936
apply_inplace_normalizer 4 CPU RUNTIME 0.000991201 0.00197022 0.000117324 -49.6908 trend-deviation True 0.000993604 0.00444019
incremental_pca_split0 4 CPU RUNTIME 34.3897 40.0943 0.140765 -14.2279 trend-deviation True 37.1301 43.2449
heat_benchmarks 4 GPU RUNTIME 17.0395 21.3673 0.10293 -20.2542 trend-deviation True 20.8222 22.5718
qr_split_0 4 GPU RUNTIME 0.039751 0.0531461 0.000211481 -25.2042 trend-deviation True 0.0415731 0.0586575
qr_split_1 4 GPU RUNTIME 0.0289725 0.0441642 0.000705078 -34.3983 trend-deviation True 0.0306853 0.0532602
hierachical_svd_rank 4 GPU RUNTIME 0.0750733 0.0973581 5.01931e-05 -22.8896 trend-deviation True 0.0951896 0.0993752
hierachical_svd_tol 4 GPU RUNTIME 0.103358 0.124613 0.000109619 -17.0569 trend-deviation True 0.120825 0.128792
reshape 4 GPU RUNTIME 0.142109 0.23206 0.0084698 -38.7618 trend-deviation True 0.155856 0.280106
concatenate 4 GPU RUNTIME 0.0750148 0.0891306 0.00398519 -15.8371 trend-deviation True 0.0756304 0.102369
apply_inplace_normalizer 4 GPU RUNTIME 0.00826368 0.00301212 0.0139664 174.347 trend-deviation True 0.00177077 0.00476618
incremental_pca_split0 4 GPU RUNTIME 4.33722 6.34214 0.0737785 -31.6127 trend-deviation True 4.84751 7.58446

Grafana Dashboard
Last updated: 2025-02-24T15:45:48Z

@mrfh92
Copy link
Collaborator

mrfh92 commented Jan 27, 2025

Could there be the problem that for all communication involving MPI-Operations like MPI.SUM etc. such an operation is not well-defined on the MPI-Vector construction chosen for the buffers?

Copy link
Contributor

Thank you for the PR!

@JuanPedroGHM
Copy link
Member Author

Could there be the problem that for all communication involving MPI-Operations like MPI.SUM etc. such an operation is not well-defined on the MPI-Vector construction chosen for the buffers?

Have you found some bug? I don't think it should be an issue, as the vector datatype is just pointing to where the data is, where it needs to go, and it in what order. As long as both send and recv buffers are well-defined by the datatype, there should not be an issue with MPI operations.

@mrfh92
Copy link
Collaborator

mrfh92 commented Jan 28, 2025

The example with Allreduce I posted above caused an error for me.

Copy link
Contributor

github-actions bot commented Feb 4, 2025

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

1 similar comment
Copy link
Contributor

Thank you for the PR!

Copy link
Collaborator

@mrfh92 mrfh92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can judge, these changes look fine. As testing this approach requires quite large messages, it won't be possible to test every functionality and/or exception properly within the CI; thus I'd suggest to ignore the non-100%-patch-coverage for this PR (in particular because your benchmarks above have tested the core functionality on a real HPC-system).

Thanks for the work! :)

Copy link
Contributor

Thank you for the PR!

@mrfh92
Copy link
Collaborator

mrfh92 commented Feb 14, 2025

I am not sure what happens exactly, but there seem to be problems on the AMD-runner (it crashes in test_communication) and also for PyTorch 2.0.1 (and all Python versions)

@ClaudiaComito
Copy link
Contributor

The bot adds back wrong labels (bug, backport) after every update. Is there a way to switch off the autolabeling? Otherwise we should remove those labels just before merging @JuanPedroGHM

Copy link
Contributor

Thank you for the PR!

@JuanPedroGHM
Copy link
Member Author

The bot adds back wrong labels (bug, backport) after every update. Is there a way to switch off the autolabeling? Otherwise we should remove those labels just before merging @JuanPedroGHM

Part of it will be fixed with #1753, but it needs to be merged first. About turning it off, I don't know if there is an easy way to disable actions temporarily.

@JuanPedroGHM JuanPedroGHM removed backport stable backport release testing Implementation of tests, or test-related issues labels Feb 17, 2025
@github-actions github-actions bot added backport stable testing Implementation of tests, or test-related issues labels Feb 24, 2025
Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

@JuanPedroGHM JuanPedroGHM merged commit d2afadf into main Feb 24, 2025
7 of 8 checks passed
Copy link
Contributor

Thank you for the PR!

@JuanPedroGHM JuanPedroGHM deleted the fix/mpi_int_limit_trick branch April 1, 2025 11:06
@JuanPedroGHM JuanPedroGHM mentioned this pull request Apr 14, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark PR benchmarking bug Something isn't working core enhancement New feature or request PR talk testing Implementation of tests, or test-related issues
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Datatype tiling for large communication
3 participants