-
Notifications
You must be signed in to change notification settings - Fork 902
Han gatherv noncontiguous datatype fix #12439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Han gatherv noncontiguous datatype fix #12439
Conversation
bd44095
to
108011c
Compare
Running AWS CI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is technically illegal in MPI to send one type and receive it into another type, with the exception of MPI_PACKED. I just mentioned it for the sake of the argument, I'm not saying we need to follow that rule. However, if we could document that this algorithm is not working in a heterogeneous environment, and make it disqualify itself it would be great.
@@ -144,8 +143,6 @@ int mca_coll_han_scatterv_intra(const void *sbuf, const int *scounts, const int | |||
low_scounts[low_peer] = scounts[w_peer]; | |||
} | |||
|
|||
ompi_datatype_type_size(sdtype, &sdsize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as for gather except that you should use unpack to go from a packed buffer into the local type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently inside ompi we need extra data size check in order to use MPI_PACKED - it is currently possible that the total byte size of datatype size
x count
can exceeded INT_MAX.
I think we should switch to explicit pack/unpack after large count support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bosilca I took time to come up with possible optimizations following our discussion on Monday. I was trying to take advantage of the type map, but soon realized that this is not generally useful between node leader and follower.
For scatterv, the invariant is actually between Root and other processes:
The type signature implied by sendcount[i], sendtype at the root must be equal to the type signature implied by recvcount, recvtype at MPI process i (however, the type maps may be different). This implies that the amount of data sent must be equal to the amount of data received, pairwise between each MPI process and the root. Distinct type maps between sender and receiver are still allowed.
When we focus on the node leader and its local neighbors, this information is not helpful, since they do not need to match in their respective recvcount, recvtype
.
There is a simplification(rather than optimization) opportunity though, if and only if the node leader's recvtype
's type map size is the same than its local followers. In that case we don't need MPI_BYTE
- we can safely use recvtype
instead. This wouldn't bring better performance though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the typemap provided by the different processes must match when complemented with the count, we could use MPI_Get_elements
to build any datatype used in the communication for as long as we know the size in bytes we are supposed to handle.
0bf7cb0
to
b8545f7
Compare
@bosilca Thanks for your explanation wrt homogeneity. I have incorporated your feedback in the update. Specifically, I added a flag in HAN to indicate homogeneity based on opal convertor's remote arch. I figured that only the scatterv algorithm is affected by this issue so I added a fallback logic. The gatherv algorithm shouldn't need this. |
b8545f7
to
35ae459
Compare
I should point out that in the OMPI docs, we clearly label the heterogeneous support in OMPI as broken: This is in https://docs.open-mpi.org/en/v5.0.x/installing-open-mpi/configure-cli-options/misc.html. |
@ggouaillardet put a lot of efforts to make it work at some point. I haven't see anything going in OB1 that would break it, so I think that statement is a little to harsh and too generic. |
Should the statement be removed, or softened? @ggouaillardet Can you comment with some specifics, and/or post a PR to adjust/remove/whatever that statement in the docs? |
@bosilca Is there anything else that you would like to address? Unfortunately I do not have a hetero system to test the change so I can only disqualify the algorithm at best. Please let me know. |
35ae459
to
b8c94fd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please squash before merge.
@bosilca Thank you! How would you like the commits to look like after squash? |
I understand there are two topics:
|
This is a bugfix for derived datatypes with gaps. The reorder buffer should match the type signature and be large enough to hold gaps in addition to the actual entries. Signed-off-by: Wenduo Wang <[email protected]>
…ators Hierarchical *v collective algorithms may not work for heterogeneous communicators with different endianness, interger representation, etc., and thus require knowledge of the global communicator's homogeneity to disqualify the module. The hierarchical scatterv algorithm requires that every process have the same architecture as the Root due to the use of MPI_BYTE on node leaders. Heterogeneous communicators need additional logic to correctly pack and unpack the data at the cost of memory usage and performance. Signed-off-by: Wenduo Wang <[email protected]>
b8c94fd
to
8bb3e3d
Compare
Commit squashed and CI passed. Merging. |
This is a bugfix for derived datatypes with gaps. The reorder buffer should match the type signature and be large enough to hold gaps in addition to the actual entries.
Fixes #12438