-
Notifications
You must be signed in to change notification settings - Fork 417
prov/shm: new shm architecture #10817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
41aa69a
to
9f309aa
Compare
baa5747
to
5bdc740
Compare
aq = (struct name *) aligned_alloc( \ | ||
OFI_CACHE_LINE_SIZE, sizeof(*aq) + \ | ||
sizeof(struct name ## _entry) * \ | ||
(roundup_power_of_two(size))); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the allocation is uninitialized. The init
macro would initialize most of the fields but leave entry.noop
and entry.buf
uninitialized. Would that be an issue? especially the noop
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buf doesn't need to be initialized but you're right, the noop should get initialized to false. Good catch, thanks!
5bdc740
to
1c8fe2f
Compare
@sunkuamzn Could you share the AWS failure? You definitely test more shm than we do so I'm sure I'm just missing a case |
@aingerson at the fabtests level, I see the fi_unexpected_msg test timing out
At the MPI level, I see some collectives timing out on a single node. Which would make sense if unexpected messages are not being processed correctly. |
1c8fe2f
to
6f512ab
Compare
6f512ab
to
5daf2b1
Compare
@sunkuamzn Could you share the AWS failure please? I think I resolved the MPI races (there was one). The unexpected test failure I think is an incorrect use case that just happens to be working upstream. The new implementation doesn't buffer unexpected inject message data anymore and just holds onto the command which means when we have unexpected inject messages, we are still held to the tx size (the maximum number of sends we can have pending at once). Trying to have more than tx size (1024) number of pending sends at the same time is an incorrect use of the API. If you really want to force more messages, you can set FI_SHM_TX_SIZE=2048 which allows the test to pass but I think this test case isn't valid. |
788ee3f
to
cd8d5d9
Compare
main_new_shm_compare_c7gn.txt I re-benchmarked this PR on AMD (hpc6a), Intel (c6i) and ARM (c7gn) CPUs. For the inline sized protocol performance, new shm is almost identical to the main branch on Intel and ARM. It still has a ~ 20 ns shift on AMD |
bot:aws:retest |
For the inject protocol (256B - 4KB), Intel and ARM both show improvement on the bandwidth (tagged_bw). AMD is mixed
|
bot:aws:retest |
ZE IPC code protocol was updated to remove dependency on Unix socket code - can be removed Signed-off-by: Alexia Ingerson <[email protected]>
Remove dates from Intel copyright (no longer recommended) Remove unneeded headers from .c and .h files Fix ifdef name for headers Put headers in "" or <> depending on location Organize headers in the following order: - corresponding .h file - other shm headers - ofi headers - system headers - Within each group, organize alphabetically Signed-off-by: Alexia Ingerson <[email protected]>
Add helper functions to freestack implementation: - smr_freestack_avail: return the number of available elements - smr_freestack_get_index: return the index number of the given element Signed-off-by: Alexia Ingerson <[email protected]>
Allow use of mr copy function using direction Signed-off-by: Alexia Ingerson <[email protected]>
Add function to return minimum of 3 values Signed-off-by: Alexia Ingerson <[email protected]>
xpmem capability can only have 2 settings - on or off. Turn into bool for simplicity Signed-off-by: Alexia Ingerson <[email protected]>
create function needs to align the allocation with the cache line size Signed-off-by: Alexia Ingerson <[email protected]>
Add function definition to be able to initialize fields in in the queue. This function is eager so the entry is already initialized when it gets assigned to the caller and gets pre-emptively re-initialized on release back into the queue. This can help with caching if initialization is more effective done by the owner instead of peers Signed-off-by: Alexia Ingerson <[email protected]>
Replacement of shm protocols with new architecture. Significant changes: - Turn response queue into return queue for local commands. Inline commands are still receive side. All commands have an inline option but a common ptr to the command being used for remote commands. These commands have to be returned to the sender but the receive side can hold onto them as long as needed for the lifetime of the message - shm has self and peer caps for each p2p interface (right now just CMA and xpmem). The support for each of these interfaces is saved in separate fields which causes a lot of wasted memory and is confusing. This merges these into two fields (one for self and one for peer) which holds the information for all p2p interfaces and is accessed by the P2P type enums. CMA also needs a flag to know wether CMA support has been queried yet or not. - Move some shm fields around for alignment - Simplifies access to the map to remove need for container - There is a 1:1 relationship with the av and map so just reuse the util av lock for access to the map as well. This requires some reorganizing of the locking semantics - There is nothing in smr_fabric. Remove and just use the util_fabric directly - Just like on the send side, make the progress functions be an array of function pointers accessible by the command proto. This cleans up the parameters of the progress calls and streamlines the calls - Merge tx and pend entries for simple management of pending operations - Redefinition of cmd and header for simplicty and easier reading. Also removes and adds fields for new architecture - Refactor async ipc list and turn it into a generic async list to track asynchronous copies which can be used for any accelerator (GPU or DSA) that copies locally asynchronously. - Cleanup naming and organization for readibility. Shorten some names to help with line length and organization - Remove unused and non-performant mmap protocol (and sar_threshold environment variable which was only used for that protocol) - Fix weird header dependency smr_util.c->smr.h->smr_util.h so that smr_util.c is only dependent on smr_util.h and is isolated to solely shm region and protocol definitions. This separates the shm utilities from being dependent on the provider leaving the door open for reuse of the shm utilities if needed Signed-off-by: Alexia Ingerson <[email protected]>
In order to support unlimited unexpected messaging, add a flag SMR_BUFFER_RECV for the sender to let the receiver know that resources are limited and the whole message should get buffered on the target. This allows the command to be immediately returned to the sender so that the sender is never blocked due to unexpected messages at the target. Buffering unexpected messages hurts performance so the default is to wait until only a single command is left before requesting buffering, but an environment variable is also added to toggle this for either debugging purposes or workarounds. Signed-off-by: Alexia Ingerson <[email protected]>
|
||
static inline uintptr_t smr_local_to_peer(struct smr_ep *ep, | ||
struct smr_region *peer_smr, | ||
int64_t id, int64_t peer_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id
seems unused, similiar for smr_peer_to_peer
and smr_peer_to_owner
return smr_peer_data(peer_smr)[peer_id].local_region + offset; | ||
} | ||
|
||
static inline uintptr_t smr_peer_to_peer(struct smr_ep *ep, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ep
unused, same for smr_peer_to_owner
if (op_flags & FI_DELIVERY_COMPLETE) | ||
return smr_src_sar; | ||
return smr_proto_inject; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SAR or inject? New inject protocol support DC ?
*err = -FI_ETRUNC; | ||
"Incomplete rma read/fetch " | ||
"buffer copied\n"); | ||
ret = -FI_ETRUNC; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know shm has been doing this, but such error codes don't always get populated to error cq or eq. This may cause hang or unexpected error that application cannot tell without running FI_LOG_LEVEL=warn
I get undeterministic hang / slowness when running IMB-EXT Accumulate https://github.com/intel/mpi-benchmarks with this PR on 2 nodes with 96 ranks per node
With main branch, the test can complete within 1 min consistently
If I ran with this PR + disabling shm, the test can also complete fast without hitting hangs |
Paste my offline discussion with @aingerson here. We narrow down the issue to be something in new shm's atomics calls (fi_atomic, fi_fetch_atomic, fi_compare_atomic). If I make efa skip shm in these calls , there is no hang. Need to further narrow down which exact op is causing problem |
Reopening to pickup new CI changes. This PR should be close and hopefully pass all tests. Ready for review!