Large data counts support for MPI Communication #1765

JuanPedroGHM · 2025-01-22T16:24:19Z

Due Diligence

General:
- title of the PR is suitable to appear in the Release Notes
Implementation:
- unit tests: all split configurations tested
- unit tests: multiple dtypes tested
- benchmarks: created for new functionality
- benchmarks: performance improved or maintained
- documentation updated where needed

Description

Some MPI implementation are limited to sending only 2^31-1 elements at once. As far as I have tested, this also applies for OpenMPI 4.1 and 5.0, because support has not been added to mpi4py. (At least in my tests it failed).

This small changes uses the trick described here, to pack contiguous data into an MPI Vector, extending the limit of elements being sent.

This is for contiguous data, as non-contiguous data is already packed in recursive vector data types, reducing the need to apply this trick.

Issue/s resolved: #

Changes proposed:

MPI Vector to send more than 2^31-1 elements at once.
__allreduce_like refactored to use custom reduction operators for derived data types.

Type of change

Bug fix (non-breaking change which fixes an issue)

Does this change modify the behaviour of other functions? If so, which?

yes, probably a lot of them.

github-actions · 2025-01-22T16:31:28Z

Thank you for the PR!

codecov · 2025-01-22T17:06:49Z

Codecov Report

Attention: Patch coverage is 89.06250% with 7 lines in your changes missing coverage. Please review.

Project coverage is 92.24%. Comparing base (ba0b7e1) to head (e7a9fd0).
Report is 14 commits behind head on main.

Files with missing lines	Patch %	Lines
heat/core/communication.py	89.06%	7 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1765      +/-   ##
==========================================
- Coverage   92.24%   92.24%   -0.01%     
==========================================
  Files          84       84              
  Lines       12460    12504      +44     
==========================================
+ Hits        11494    11534      +40     
- Misses        966      970       +4

Flag	Coverage Δ
unit	`92.24% <89.06%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2025-01-27T10:52:24Z

Thank you for the PR!

github-actions · 2025-01-27T10:52:30Z

Thank you for the PR!

mrfh92 · 2025-01-27T11:00:39Z

I have encountered the following problem:

import heat as ht 
import torch

shape = (2 ** 10, 2 ** 10, 2 ** 11)

data = torch.ones(shape, dtype=torch.float32) * ht.MPI_WORLD.rank
ht.MPI_WORLD.Allreduce(ht.MPI.IN_PLACE, data, ht.MPI.SUM)

results in the following error:

  File /heat/heat/core/communication.py", line 915, in Allreduce
    ret, sbuf, rbuf, buf = self.__reduce_like(self.handle.Allreduce, sendbuf, recvbuf, op)
  File "/heat/heat/core/communication.py", line 895, in __reduce_like
    return func(sendbuf, recvbuf, *args, **kwargs), sbuf, rbuf, buf
  File "src/mpi4py/MPI.src/Comm.pyx", line 1115, in mpi4py.MPI.Comm.Allreduce
mpi4py.MPI.Exception: MPI_ERR_OP: invalid reduce operation

With 2 ** 10 in the last entry of shape, there is not problem, so it seems to be related to large counts.

JuanPedroGHM · 2025-01-27T11:03:46Z

Benchmarks results - Sponsored by perun

function	mpi_ranks	device	metric	value	ref_value	std	% change	type	alert	lower_quantile	upper_quantile
heat_benchmarks	4	CPU	RUNTIME	41.3293	51.8208	0.13663	-20.2458	jump-detection	True	nan	nan
lanczos	4	CPU	RUNTIME	0.245472	0.368403	0.0013412	-33.3685	jump-detection	True	nan	nan
hierachical_svd_rank	4	CPU	RUNTIME	0.0464498	0.0639418	0.00221168	-27.3561	jump-detection	True	nan	nan
hierachical_svd_tol	4	CPU	RUNTIME	0.0527641	0.0744805	0.00242268	-29.1572	jump-detection	True	nan	nan
kmeans	4	CPU	RUNTIME	0.318249	0.510466	0.00115396	-37.6551	jump-detection	True	nan	nan
kmedians	4	CPU	RUNTIME	0.435884	0.68515	0.00453745	-36.3812	jump-detection	True	nan	nan
kmedoids	4	CPU	RUNTIME	0.779213	1.21754	0.00374886	-36.0009	jump-detection	True	nan	nan
reshape	4	CPU	RUNTIME	0.0660566	0.0773497	0.000687786	-14.6001	jump-detection	True	nan	nan
apply_inplace_standard_scaler_and_inverse	4	CPU	RUNTIME	0.00699725	0.00974845	0.000726862	-28.2219	jump-detection	True	nan	nan
apply_inplace_min_max_scaler_and_inverse	4	CPU	RUNTIME	0.00100462	0.00137954	1.44658e-05	-27.1767	jump-detection	True	nan	nan
apply_inplace_max_abs_scaler_and_inverse	4	CPU	RUNTIME	0.000496447	0.000884557	2.12511e-05	-43.8762	jump-detection	True	nan	nan
apply_inplace_robust_scaler_and_inverse	4	CPU	RUNTIME	2.50269	3.86577	0.0481586	-35.2603	jump-detection	True	nan	nan
apply_inplace_normalizer	4	CPU	RUNTIME	0.000991201	0.00156467	0.000117324	-36.6511	jump-detection	True	nan	nan
incremental_pca_split0	4	CPU	RUNTIME	34.3897	42.315	0.140765	-18.7293	jump-detection	True	nan	nan
heat_benchmarks	4	GPU	RUNTIME	17.0395	21.3331	0.10293	-20.1267	jump-detection	True	nan	nan
lanczos	4	GPU	RUNTIME	0.606968	0.722394	0.00196506	-15.9783	jump-detection	True	nan	nan
hierachical_svd_rank	4	GPU	RUNTIME	0.0750733	0.0950145	5.01931e-05	-20.9876	jump-detection	True	nan	nan
hierachical_svd_tol	4	GPU	RUNTIME	0.103358	0.125885	0.000109619	-17.8949	jump-detection	True	nan	nan
kmeans	4	GPU	RUNTIME	0.687634	0.945599	0.00166606	-27.2806	jump-detection	True	nan	nan
kmedians	4	GPU	RUNTIME	1.17235	1.62268	0.00447032	-27.7523	jump-detection	True	nan	nan
kmedoids	4	GPU	RUNTIME	1.31362	1.77978	0.00358418	-26.1921	jump-detection	True	nan	nan
apply_inplace_standard_scaler_and_inverse	4	GPU	RUNTIME	0.011572	0.019365	0.00139041	-40.2426	jump-detection	True	nan	nan
apply_inplace_min_max_scaler_and_inverse	4	GPU	RUNTIME	0.00145524	0.00173657	1.7101e-05	-16.2006	jump-detection	True	nan	nan
apply_inplace_max_abs_scaler_and_inverse	4	GPU	RUNTIME	0.000604403	0.00100199	1.4756e-05	-39.6797	jump-detection	True	nan	nan
apply_inplace_robust_scaler_and_inverse	4	GPU	RUNTIME	6.21919	8.30611	0.0231803	-25.1251	jump-detection	True	nan	nan
apply_inplace_normalizer	4	GPU	RUNTIME	0.00826368	0.00252771	0.0139664	226.923	jump-detection	True	nan	nan
incremental_pca_split0	4	GPU	RUNTIME	4.33722	4.86028	0.0737785	-10.7619	jump-detection	True	nan	nan
heat_benchmarks	4	CPU	RUNTIME	41.3293	49.2883	0.13663	-16.1479	trend-deviation	True	44.8767	54.2372
matmul_split_0	4	CPU	RUNTIME	0.098484	0.156183	0.00608012	-36.9431	trend-deviation	True	0.099259	0.207059
matmul_split_1	4	CPU	RUNTIME	0.100005	0.136353	0.00272885	-26.6576	trend-deviation	True	0.104529	0.179407
qr_split_0	4	CPU	RUNTIME	0.216765	0.274786	0.00524295	-21.1153	trend-deviation	True	0.225948	0.346127
qr_split_1	4	CPU	RUNTIME	0.153472	0.169503	0.00290488	-9.4576	trend-deviation	True	0.163512	0.174618
hierachical_svd_rank	4	CPU	RUNTIME	0.0464498	0.0567654	0.00221168	-18.1722	trend-deviation	True	0.0467594	0.0682228
reshape	4	CPU	RUNTIME	0.0660566	0.185593	0.000687786	-64.4079	trend-deviation	True	0.0770684	0.213552
concatenate	4	CPU	RUNTIME	0.120446	0.190176	0.00322879	-36.6658	trend-deviation	True	0.128103	0.250247
apply_inplace_min_max_scaler_and_inverse	4	CPU	RUNTIME	0.00100462	0.00121679	1.44658e-05	-17.4365	trend-deviation	True	0.00102721	0.00143947
apply_inplace_max_abs_scaler_and_inverse	4	CPU	RUNTIME	0.000496447	0.000771135	2.12511e-05	-35.6213	trend-deviation	True	0.000512384	0.00107936
apply_inplace_normalizer	4	CPU	RUNTIME	0.000991201	0.00197022	0.000117324	-49.6908	trend-deviation	True	0.000993604	0.00444019
incremental_pca_split0	4	CPU	RUNTIME	34.3897	40.0943	0.140765	-14.2279	trend-deviation	True	37.1301	43.2449
heat_benchmarks	4	GPU	RUNTIME	17.0395	21.3673	0.10293	-20.2542	trend-deviation	True	20.8222	22.5718
qr_split_0	4	GPU	RUNTIME	0.039751	0.0531461	0.000211481	-25.2042	trend-deviation	True	0.0415731	0.0586575
qr_split_1	4	GPU	RUNTIME	0.0289725	0.0441642	0.000705078	-34.3983	trend-deviation	True	0.0306853	0.0532602
hierachical_svd_rank	4	GPU	RUNTIME	0.0750733	0.0973581	5.01931e-05	-22.8896	trend-deviation	True	0.0951896	0.0993752
hierachical_svd_tol	4	GPU	RUNTIME	0.103358	0.124613	0.000109619	-17.0569	trend-deviation	True	0.120825	0.128792
reshape	4	GPU	RUNTIME	0.142109	0.23206	0.0084698	-38.7618	trend-deviation	True	0.155856	0.280106
concatenate	4	GPU	RUNTIME	0.0750148	0.0891306	0.00398519	-15.8371	trend-deviation	True	0.0756304	0.102369
apply_inplace_normalizer	4	GPU	RUNTIME	0.00826368	0.00301212	0.0139664	174.347	trend-deviation	True	0.00177077	0.00476618
incremental_pca_split0	4	GPU	RUNTIME	4.33722	6.34214	0.0737785	-31.6127	trend-deviation	True	4.84751	7.58446

Grafana Dashboard
Last updated: 2025-02-24T15:45:48Z

mrfh92 · 2025-01-27T15:59:32Z

Could there be the problem that for all communication involving MPI-Operations like MPI.SUM etc. such an operation is not well-defined on the MPI-Vector construction chosen for the buffers?

github-actions · 2025-01-27T16:11:41Z

Thank you for the PR!

JuanPedroGHM · 2025-01-28T09:23:36Z

Could there be the problem that for all communication involving MPI-Operations like MPI.SUM etc. such an operation is not well-defined on the MPI-Vector construction chosen for the buffers?

Have you found some bug? I don't think it should be an issue, as the vector datatype is just pointing to where the data is, where it needs to go, and it in what order. As long as both send and recv buffers are well-defined by the datatype, there should not be an issue with MPI operations.

mrfh92 · 2025-01-28T12:09:37Z

The example with Allreduce I posted above caused an error for me.

…ta types

github-actions · 2025-02-04T15:05:48Z

Thank you for the PR!

github-actions · 2025-02-13T12:10:55Z

Thank you for the PR!

github-actions · 2025-02-13T12:11:10Z

Thank you for the PR!

mrfh92

As far as I can judge, these changes look fine. As testing this approach requires quite large messages, it won't be possible to test every functionality and/or exception properly within the CI; thus I'd suggest to ignore the non-100%-patch-coverage for this PR (in particular because your benchmarks above have tested the core functionality on a real HPC-system).

Thanks for the work! :)

github-actions · 2025-02-13T15:05:34Z

Thank you for the PR!

mrfh92 · 2025-02-14T08:36:37Z

I am not sure what happens exactly, but there seem to be problems on the AMD-runner (it crashes in test_communication) and also for PyTorch 2.0.1 (and all Python versions)

ClaudiaComito · 2025-02-17T05:15:41Z

The bot adds back wrong labels (bug, backport) after every update. Is there a way to switch off the autolabeling? Otherwise we should remove those labels just before merging @JuanPedroGHM

github-actions · 2025-02-17T08:05:10Z

Thank you for the PR!

JuanPedroGHM · 2025-02-17T08:07:57Z

The bot adds back wrong labels (bug, backport) after every update. Is there a way to switch off the autolabeling? Otherwise we should remove those labels just before merging @JuanPedroGHM

Part of it will be fixed with #1753, but it needs to be merged first. About turning it off, I don't know if there is an easy way to disable actions temporarily.

github-actions · 2025-02-24T08:58:13Z

Thank you for the PR!

github-actions · 2025-02-24T09:25:27Z

Thank you for the PR!

github-actions · 2025-02-24T14:57:15Z

Thank you for the PR!

trick to send large data

ed843a4

github-actions bot added backport stable bug Something isn't working core labels Jan 22, 2025

JuanPedroGHM added benchmark PR PR talk labels Jan 22, 2025

Hoppe and others added 2 commits January 27, 2025 11:46

added tests

738c634

Merge branch 'main' into fix/mpi_int_limit_trick

9234df9

github-actions bot added the testing Implementation of tests, or test-related issues label Jan 27, 2025

Merge branch 'main' into fix/mpi_int_limit_trick

70f6432

Fixes for allreduce

23c5de4

github-actions bot added the backport release label Feb 3, 2025

fixed large counts for allreduce, now trying to fix non-contiguous da…

15eeedb

…ta types

mrfh92 mentioned this pull request Feb 3, 2025

Perform testing with a large-count MPI-implementation #1737

Closed

JuanPedroGHM added 5 commits February 4, 2025 11:44

Custom operations for allreduce

784a850

Merge branch 'main' into fix/mpi_int_limit_trick

c8d1752

bench fixes

e6c519d

perun fix

88755ef

correct inplace contiguous (sorry fabian)

eb8eb2b

mrfh92 added 2 commits February 13, 2025 12:58

Update test_communication.py

762a42f

Merge branch 'main' into fix/mpi_int_limit_trick

380d857

github-actions bot added the backport stable label Feb 13, 2025

mrfh92 approved these changes Feb 13, 2025

View reviewed changes

Merge branch 'main' into fix/mpi_int_limit_trick

417ce37

Merge branch 'main' into fix/mpi_int_limit_trick

fd5599f

JuanPedroGHM removed backport stable backport release testing Implementation of tests, or test-related issues labels Feb 17, 2025

Merge branch 'main' into fix/mpi_int_limit_trick

9bf0266

github-actions bot added backport stable testing Implementation of tests, or test-related issues labels Feb 24, 2025

Merge branch 'main' into fix/mpi_int_limit_trick

e7a9fd0

JuanPedroGHM removed the backport stable label Feb 24, 2025

Merge branch 'main' into fix/mpi_int_limit_trick

87b1b36

github-actions bot added the backport stable label Feb 24, 2025

JuanPedroGHM removed the backport stable label Feb 24, 2025

JuanPedroGHM merged commit d2afadf into main Feb 24, 2025
7 of 8 checks passed

JuanPedroGHM deleted the fix/mpi_int_limit_trick branch April 1, 2025 11:06

JuanPedroGHM mentioned this pull request Apr 14, 2025

[Bug]: Failed backports #1855

Closed

4 tasks

Large data counts support for MPI Communication #1765

Large data counts support for MPI Communication #1765

Uh oh!

Conversation

JuanPedroGHM commented Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Due Diligence

Description

Changes proposed:

Type of change

Does this change modify the behaviour of other functions? If so, which?

Uh oh!

github-actions bot commented Jan 22, 2025

Uh oh!

codecov bot commented Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Jan 27, 2025

Uh oh!

github-actions bot commented Jan 27, 2025

Uh oh!

mrfh92 commented Jan 27, 2025

Uh oh!

JuanPedroGHM commented Jan 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks results - Sponsored by perun

Uh oh!

mrfh92 commented Jan 27, 2025

Uh oh!

github-actions bot commented Jan 27, 2025

Uh oh!

JuanPedroGHM commented Jan 28, 2025

Uh oh!

mrfh92 commented Jan 28, 2025

Uh oh!

github-actions bot commented Feb 4, 2025

Uh oh!

github-actions bot commented Feb 13, 2025

Uh oh!

github-actions bot commented Feb 13, 2025

Uh oh!

mrfh92 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 13, 2025

Uh oh!

mrfh92 commented Feb 14, 2025

Uh oh!

ClaudiaComito commented Feb 17, 2025

Uh oh!

github-actions bot commented Feb 17, 2025

Uh oh!

JuanPedroGHM commented Feb 17, 2025

Uh oh!

github-actions bot commented Feb 24, 2025

Uh oh!

github-actions bot commented Feb 24, 2025

Uh oh!

Uh oh!

github-actions bot commented Feb 24, 2025

Uh oh!

Uh oh!

JuanPedroGHM commented Jan 22, 2025 •

edited

Loading

codecov bot commented Jan 22, 2025 •

edited

Loading

JuanPedroGHM commented Jan 27, 2025 •

edited

Loading

mrfh92 left a comment •

edited

Loading