Skip to content

Han gatherv noncontiguous datatype fix #12439

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

wenduwan
Copy link
Contributor

This is a bugfix for derived datatypes with gaps. The reorder buffer should match the type signature and be large enough to hold gaps in addition to the actual entries.

Fixes #12438

@wenduwan wenduwan force-pushed the han_gatherv_noncontiguous_datatype_fix branch from bd44095 to 108011c Compare March 27, 2024 20:49
@wenduwan
Copy link
Contributor Author

Running AWS CI

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is technically illegal in MPI to send one type and receive it into another type, with the exception of MPI_PACKED. I just mentioned it for the sake of the argument, I'm not saying we need to follow that rule. However, if we could document that this algorithm is not working in a heterogeneous environment, and make it disqualify itself it would be great.

@@ -144,8 +143,6 @@ int mca_coll_han_scatterv_intra(const void *sbuf, const int *scounts, const int
low_scounts[low_peer] = scounts[w_peer];
}

ompi_datatype_type_size(sdtype, &sdsize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as for gather except that you should use unpack to go from a packed buffer into the local type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently inside ompi we need extra data size check in order to use MPI_PACKED - it is currently possible that the total byte size of datatype size x count can exceeded INT_MAX.

I think we should switch to explicit pack/unpack after large count support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bosilca I took time to come up with possible optimizations following our discussion on Monday. I was trying to take advantage of the type map, but soon realized that this is not generally useful between node leader and follower.

For scatterv, the invariant is actually between Root and other processes:

The type signature implied by sendcount[i], sendtype at the root must be equal to the type signature implied by recvcount, recvtype at MPI process i (however, the type maps may be different). This implies that the amount of data sent must be equal to the amount of data received, pairwise between each MPI process and the root. Distinct type maps between sender and receiver are still allowed.

When we focus on the node leader and its local neighbors, this information is not helpful, since they do not need to match in their respective recvcount, recvtype.

There is a simplification(rather than optimization) opportunity though, if and only if the node leader's recvtype's type map size is the same than its local followers. In that case we don't need MPI_BYTE - we can safely use recvtype instead. This wouldn't bring better performance though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the typemap provided by the different processes must match when complemented with the count, we could use MPI_Get_elements to build any datatype used in the communication for as long as we know the size in bytes we are supposed to handle.

@wenduwan wenduwan requested a review from bosilca March 28, 2024 14:21
@wenduwan wenduwan force-pushed the han_gatherv_noncontiguous_datatype_fix branch from 0bf7cb0 to b8545f7 Compare April 1, 2024 23:37
@wenduwan
Copy link
Contributor Author

wenduwan commented Apr 1, 2024

@bosilca Thanks for your explanation wrt homogeneity. I have incorporated your feedback in the update.

Specifically, I added a flag in HAN to indicate homogeneity based on opal convertor's remote arch. I figured that only the scatterv algorithm is affected by this issue so I added a fallback logic. The gatherv algorithm shouldn't need this.

@wenduwan wenduwan requested a review from devreal April 2, 2024 18:16
@wenduwan wenduwan force-pushed the han_gatherv_noncontiguous_datatype_fix branch from b8545f7 to 35ae459 Compare April 2, 2024 19:57
@jsquyres
Copy link
Member

jsquyres commented Apr 4, 2024

I should point out that in the OMPI docs, we clearly label the heterogeneous support in OMPI as broken:

Screenshot 2024-04-04 at 11 34 08 AM

This is in https://docs.open-mpi.org/en/v5.0.x/installing-open-mpi/configure-cli-options/misc.html.

@bosilca
Copy link
Member

bosilca commented Apr 4, 2024

@ggouaillardet put a lot of efforts to make it work at some point. I haven't see anything going in OB1 that would break it, so I think that statement is a little to harsh and too generic.

@jsquyres
Copy link
Member

jsquyres commented Apr 4, 2024

Should the statement be removed, or softened?

@ggouaillardet Can you comment with some specifics, and/or post a PR to adjust/remove/whatever that statement in the docs?

@wenduwan
Copy link
Contributor Author

wenduwan commented Apr 8, 2024

@bosilca Is there anything else that you would like to address? Unfortunately I do not have a hetero system to test the change so I can only disqualify the algorithm at best. Please let me know.

@wenduwan wenduwan force-pushed the han_gatherv_noncontiguous_datatype_fix branch from 35ae459 to b8c94fd Compare April 15, 2024 17:13
@wenduwan wenduwan requested a review from bosilca April 15, 2024 17:20
Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please squash before merge.

@wenduwan
Copy link
Contributor Author

@bosilca Thank you!

How would you like the commits to look like after squash?

@bosilca
Copy link
Member

bosilca commented Apr 15, 2024

I understand there are two topics:

  • allocate reorder buffer with gaps in gatherv and scatterv
  • determine communicator homogeneity and disqualify hierarchical scatterv for heterogeneous communicators

This is a bugfix for derived datatypes with gaps. The reorder buffer
should match the type signature and be large enough to hold gaps in
addition to the actual entries.

Signed-off-by: Wenduo Wang <[email protected]>
…ators

Hierarchical *v collective algorithms may not work for heterogeneous
communicators with different endianness, interger representation, etc.,
and thus require knowledge of the global communicator's homogeneity to
disqualify the module.

The hierarchical scatterv algorithm requires that every process have the
same architecture as the Root due to the use of MPI_BYTE on node
leaders. Heterogeneous communicators need additional logic to correctly
pack and unpack the data at the cost of memory usage and performance.

Signed-off-by: Wenduo Wang <[email protected]>
@wenduwan wenduwan force-pushed the han_gatherv_noncontiguous_datatype_fix branch from b8c94fd to 8bb3e3d Compare April 15, 2024 17:52
@wenduwan wenduwan merged commit 85fd1bc into open-mpi:main Apr 15, 2024
14 checks passed
@wenduwan
Copy link
Contributor Author

Commit squashed and CI passed. Merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MTT MPI_Gatherv_c/MPI_Scatterv_c failure
4 participants