prov/shm: new shm architecture v2 #10907

aingerson · 2025-03-26T17:52:19Z

Slightly different version of new shm architecture which allows for claiming a future spot in the queue. Opening for testing purposes, compared against #10817

sunkuamzn · 2025-03-27T17:00:34Z

@aingerson the unexpected message fabtest is still timing out

server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.84.23 'timeout 1800 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10907-debug/install/fabtests/bin/fi_unexpected_msg -e rdm -M 2048 -I 5 -v -S 512 -p shm -E=9234'"'"''

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.84.23 'timeout 1800 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10907-debug/install/fabtests/bin/fi_unexpected_msg -e rdm -M 2048 -I 5 -v -S 512 -p shm -E=9234 172.31.84.23'"'"''
client_stdout:

client returncode: 124
server_stdout:

server returncode: 124

aingerson · 2025-03-27T17:15:16Z

@sunkuamzn Thanks! Is that the only failure? That error is expected, see my comment in #10817

shijin-aws · 2025-05-06T15:13:11Z

The timing out is because the new shm removed the buffering for inject protocol, so it cannot handle unexpected messages of number larger than the tx size. I need to discuss with my team to decide whether it is acceptable for our customers.

aingerson · 2025-05-06T16:52:10Z

@shijin-aws Yeah, it would be good to figure out why OMPI/your customers are having this issue. Sending over tx_size unexpected messages is a misuse of the API. Other providers don't support it as well and supporting it here would have a performance impact for other use cases so I'm hesitant about adding that buffering. There's likely a weird use of the API in OMPI that we need to investigate. If we can have setting FI_SHM_TX_SIZE as a workaround for any apps that see this issue until we can dig more into OMPI to root cause, I think that would be better. Let me know what you decide with your team (and feel free to message me directly to discuss)

shijin-aws · 2025-05-13T00:29:51Z

main_new_shm_compare_c7gn_u2204.txt
main_new_shm_compare_hpc6a_u2204.txt
main_new_shm_compare_hpc6id_u2204.txt

Ran fabtests with shm provider against main branch and this PR on 3 instance types of different CPU types: hpc6a (AMD), hpc6id (intel), c7gn (graviton) on ubuntu 2204. The comparisons are recorded in the attachments.

In the inline data size protocol ( <=256 bytes)
For AMD there is a systematic 20ns regression for the new shm on ping pong (tagged, non-tagged, rma)
For Graviton, the regression is almost negligible.
For Intel, the regression is worse as 50 ns.

Alexia said there should be improvement on the inject size (256 to 4K) bandwidth. I do see such improvement (20% - 50%) for all message sizes from 256 Bytes to 4KB on hpc6id and c7gn. But for hpc6a the bandwidth get worse for new shm between 1536 bytes to 4096 bytes.

ZE IPC code protocol was updated to remove dependency on Unix socket code - can be removed Signed-off-by: Alexia Ingerson <[email protected]>

Remove dates from Intel copyright (no longer recommended) Remove unneeded headers from .c and .h files Fix ifdef name for headers Put headers in "" or <> depending on location Organize headers in the following order: - corresponding .h file - other shm headers - ofi headers - system headers - Within each group, organize alphabetically Signed-off-by: Alexia Ingerson <[email protected]>

Add helper functions to freestack implementation: - smr_freestack_avail: return the number of available elements - smr_freestack_get_index: return the index number of the given element Signed-off-by: Alexia Ingerson <[email protected]>

Allow use of mr copy function using direction Signed-off-by: Alexia Ingerson <[email protected]>

Add function to return minimum of 3 values Signed-off-by: Alexia Ingerson <[email protected]>

xpmem capability can only have 2 settings - on or off. Turn into bool for simplicity Signed-off-by: Alexia Ingerson <[email protected]>

create function needs to align the allocation with the cache line size Signed-off-by: Alexia Ingerson <[email protected]>

Add function definition to be able to initialize fields in in the queue. This function is eager so the entry is already initialized when it gets assigned to the caller and gets pre-emptively re-initialized on release back into the queue. This can help with caching if initialization is more effective done by the owner instead of peers Signed-off-by: Alexia Ingerson <[email protected]>

Replacement of shm protocols with new architecture. Significant changes: - Turn response queue into return queue for local commands. Inline commands are still receive side. All commands have an inline option but a common ptr to the command being used for remote commands. These commands have to be returned to the sender but the receive side can hold onto them as long as needed for the lifetime of the message - shm has self and peer caps for each p2p interface (right now just CMA and xpmem). The support for each of these interfaces is saved in separate fields which causes a lot of wasted memory and is confusing. This merges these into two fields (one for self and one for peer) which holds the information for all p2p interfaces and is accessed by the P2P type enums. CMA also needs a flag to know wether CMA support has been queried yet or not. - Move some shm fields around for alignment - Simplifies access to the map to remove need for container - There is a 1:1 relationship with the av and map so just reuse the util av lock for access to the map as well. This requires some reorganizing of the locking semantics - There is nothing in smr_fabric. Remove and just use the util_fabric directly - Just like on the send side, make the progress functions be an array of function pointers accessible by the command proto. This cleans up the parameters of the progress calls and streamlines the calls - Merge tx and pend entries for simple management of pending operations - Redefinition of cmd and header for simplicty and easier reading. Also removes and adds fields for new architecture - Refactor async ipc list and turn it into a generic async list to track asynchronous copies which can be used for any accelerator (GPU or DSA) that copies locally asynchronously. - Cleanup naming and organization for readibility. Shorten some names to help with line length and organization - Remove unused and non-performant mmap protocol (and sar_threshold environment variable which was only used for that protocol) - Fix weird header dependency smr_util.c->smr.h->smr_util.h so that smr_util.c is only dependent on smr_util.h and is isolated to solely shm region and protocol definitions. This separates the shm utilities from being dependent on the provider leaving the door open for reuse of the shm utilities if needed Signed-off-by: Alexia Ingerson <[email protected]>

In order to support unlimited unexpected messaging, add a flag SMR_BUFFER_RECV for the sender to let the receiver know that resources are limited and the whole message should get buffered on the target. This allows the command to be immediately returned to the sender so that the sender is never blocked due to unexpected messages at the target. Buffering unexpected messages hurts performance so the default is to wait until only a single command is left before requesting buffering, but an environment variable is also added to toggle this for either debugging purposes or workarounds. Signed-off-by: Alexia Ingerson <[email protected]>

Signed-off-by: Alexia Ingerson <[email protected]>

aingerson force-pushed the test branch 2 times, most recently from 76bd871 to 17a2128 Compare March 26, 2025 20:42

aingerson added 12 commits May 19, 2025 19:03

prov/shm: remove socket code, no longer needed

de59573

ZE IPC code protocol was updated to remove dependency on Unix socket code - can be removed Signed-off-by: Alexia Ingerson <[email protected]>

include/ofi_hmem: make ofi_copy_mr_iov non-static

92b7a65

Allow use of mr copy function using direction Signed-off-by: Alexia Ingerson <[email protected]>

include/ofi.h: add MIN3 function

440f1ba

Add function to return minimum of 3 values Signed-off-by: Alexia Ingerson <[email protected]>

include/ofi_xpmem: change cap into bool

95abdbe

xpmem capability can only have 2 settings - on or off. Turn into bool for simplicity Signed-off-by: Alexia Ingerson <[email protected]>

include/ofi_atomic_queue: fix create function

2b31811

create function needs to align the allocation with the cache line size Signed-off-by: Alexia Ingerson <[email protected]>

include/ofi_atom.g: add a few missing atomic ops for use in queue

3761e31

Signed-off-by: Alexia Ingerson <[email protected]>

prov/shm: modify queue for claim

d8f0e88

Signed-off-by: Alexia Ingerson <[email protected]>

aingerson force-pushed the test branch 2 times, most recently from 07d9f9f to d8f0e88 Compare May 22, 2025 03:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

prov/shm: new shm architecture v2 #10907

prov/shm: new shm architecture v2 #10907

aingerson commented Mar 26, 2025 •

edited

Loading

Uh oh!

sunkuamzn commented Mar 27, 2025

Uh oh!

aingerson commented Mar 27, 2025

Uh oh!

shijin-aws commented May 6, 2025

Uh oh!

aingerson commented May 6, 2025

Uh oh!

shijin-aws commented May 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

prov/shm: new shm architecture v2 #10907

Are you sure you want to change the base?

prov/shm: new shm architecture v2 #10907

Conversation

aingerson commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sunkuamzn commented Mar 27, 2025

Uh oh!

aingerson commented Mar 27, 2025

Uh oh!

shijin-aws commented May 6, 2025

Uh oh!

aingerson commented May 6, 2025

Uh oh!

shijin-aws commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

aingerson commented Mar 26, 2025 •

edited

Loading

shijin-aws commented May 13, 2025 •

edited

Loading