prov/shm: new shm architecture #10817

aingerson · 2025-02-21T20:51:08Z

Reopening to pickup new CI changes. This PR should be close and hopefully pass all tests. Ready for review!

j-xiong · 2025-03-17T18:21:33Z

include/ofi_atomic_queue.h

+	aq = (struct name *) aligned_alloc(			\
+			OFI_CACHE_LINE_SIZE, sizeof(*aq) +	\
+			sizeof(struct name ## _entry) *		\
+			(roundup_power_of_two(size)));		\


Now the allocation is uninitialized. The init macro would initialize most of the fields but leave entry.noop and entry.buf uninitialized. Would that be an issue? especially the noop field.

The buf doesn't need to be initialized but you're right, the noop should get initialized to false. Good catch, thanks!

aingerson · 2025-03-19T15:51:14Z

@sunkuamzn Could you share the AWS failure? You definitely test more shm than we do so I'm sure I'm just missing a case

sunkuamzn · 2025-03-19T19:52:10Z

@aingerson at the fabtests level, I see the fi_unexpected_msg test timing out

server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.45.3 'timeout 1800 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10817/install/fabtests/bin/fi_unexpected_msg -e rdm -M 2048 -I 5 -v -S 512 -p shm -E=9235'"'"''

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.45.3 'timeout 1800 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10817/install/fabtests/bin/fi_unexpected_msg -e rdm -M 2048 -I 5 -v -S 512 -p shm -E=9235 172.31.45.3'"'"''
client_stdout:

client returncode: 124
server_stdout:

server returncode: 124

At the MPI level, I see some collectives timing out on a single node. Which would make sense if unexpected messages are not being processed correctly.

aingerson · 2025-03-27T16:20:02Z

@sunkuamzn Could you share the AWS failure please? I think I resolved the MPI races (there was one). The unexpected test failure I think is an incorrect use case that just happens to be working upstream. The new implementation doesn't buffer unexpected inject message data anymore and just holds onto the command which means when we have unexpected inject messages, we are still held to the tx size (the maximum number of sends we can have pending at once). Trying to have more than tx size (1024) number of pending sends at the same time is an incorrect use of the API. If you really want to force more messages, you can set FI_SHM_TX_SIZE=2048 which allows the test to pass but I think this test case isn't valid.
I chatted with @shijin-aws about it and he said there was an application that was hitting this issue but we weren't sure if this was coming from the application or OMPI so he was going to investigate. We can buffer the inject data but this will hurt performance and likely will not find the root of the problem and will just enable the app to have much too many unexpected messages.

shijin-aws · 2025-05-21T17:40:30Z

main_new_shm_compare_c7gn.txt
main_new_shm_compare_c6i.txt
main_new_shm_compare_hpc6a.txt

I re-benchmarked this PR on AMD (hpc6a), Intel (c6i) and ARM (c7gn) CPUs. For the inline sized protocol performance, new shm is almost identical to the main branch on Intel and ARM. It still has a ~ 20 ns shift on AMD

shijin-aws · 2025-05-21T17:40:48Z

bot:aws:retest

shijin-aws · 2025-05-21T17:44:25Z

For the inject protocol (256B - 4KB), Intel and ARM both show improvement on the bandwidth (tagged_bw). AMD is mixed

======================
Compare test name fabtests_tagged_bw with memory type host_to_host
======================
    Message Size (Byte)  main avg Bandwidth (MB/s)  new avg Bandwidth (MB/s)  new - main avg diff Bandwidth (MB/s)  new - main avg diff (%)  main var (%)  new var (%)
0                     1                  10.329291                 10.596087                              0.266796                 2.582907      0.790079     0.291396
1                     2                  21.078772                 21.767421                              0.688649                 3.267025      0.424895     0.683769
2                     3                  31.645734                 32.665361                              1.019627                 3.222006      0.279468     0.441724
3                     4                  42.842411                 43.999760                              1.157349                 2.701410      0.446774     1.072297
4                     6                  63.764733                 66.025101                              2.260368                 3.544856      0.807056     1.204991
5                     8                  85.638743                 88.342607                              2.703865                 3.157291      0.400795     0.544057
6                    12                 129.159945                132.212169                              3.052224                 2.363135      0.208826     0.748948
7                    16                 171.827468                175.520851                              3.693383                 2.149472      0.030996     0.412307
8                    24                 257.510927                262.468145                              4.957218                 1.925051      0.107296     0.561540
9                    32                 345.638625                349.649313                              4.010688                 1.160370      0.420209     0.874681
10                   48                 516.036983                509.612527                             -6.424457                -1.244960      0.108593     0.623902
11                   64                 665.000856                675.345810                             10.344954                 1.555630      0.429969     0.250768
12                   96                 990.123490               1015.212138                             25.088648                 2.533891      1.182172     0.898683
13                  128                1321.753500               1373.530452                             51.776952                 3.917293      1.173757     0.421813
14                  192                1992.366238               2071.829158                             79.462920                 3.988369      1.053523     0.728354
15                  256                1302.980445               2774.116416                           1471.135971               112.905453      0.389102     0.508273
16                  384                1975.831559               2184.256267                            208.424708                10.548708      0.334450     0.302226
17                  512                2647.785359               2815.213744                            167.428386                 6.323337      1.049448     0.704887
18                  768                3933.024732               4569.192320                            636.167589                16.175021      0.645372     0.314787
19                 1024                5290.717703               5508.609283                            217.891580                 4.118375      0.520046     1.846573
20                 1536                8267.891931               7545.785170                           -722.106761                -8.733868      1.547265     2.353081
21                 2048               10108.174565               9447.355196                           -660.819369                -6.537475      0.788226     2.049158
22                 3072               13394.672777              12228.229398                          -1166.443379                -8.708263      0.577448     1.466927
23                 4096               15594.015821              13690.708081                          -1903.307740               -12.205373      1.027649     1.615210

shijin-aws · 2025-05-22T18:07:54Z

bot:aws:retest

ZE IPC code protocol was updated to remove dependency on Unix socket code - can be removed Signed-off-by: Alexia Ingerson <[email protected]>

Remove dates from Intel copyright (no longer recommended) Remove unneeded headers from .c and .h files Fix ifdef name for headers Put headers in "" or <> depending on location Organize headers in the following order: - corresponding .h file - other shm headers - ofi headers - system headers - Within each group, organize alphabetically Signed-off-by: Alexia Ingerson <[email protected]>

Add helper functions to freestack implementation: - smr_freestack_avail: return the number of available elements - smr_freestack_get_index: return the index number of the given element Signed-off-by: Alexia Ingerson <[email protected]>

Allow use of mr copy function using direction Signed-off-by: Alexia Ingerson <[email protected]>

Add function to return minimum of 3 values Signed-off-by: Alexia Ingerson <[email protected]>

xpmem capability can only have 2 settings - on or off. Turn into bool for simplicity Signed-off-by: Alexia Ingerson <[email protected]>

create function needs to align the allocation with the cache line size Signed-off-by: Alexia Ingerson <[email protected]>

Add function definition to be able to initialize fields in in the queue. This function is eager so the entry is already initialized when it gets assigned to the caller and gets pre-emptively re-initialized on release back into the queue. This can help with caching if initialization is more effective done by the owner instead of peers Signed-off-by: Alexia Ingerson <[email protected]>

Replacement of shm protocols with new architecture. Significant changes: - Turn response queue into return queue for local commands. Inline commands are still receive side. All commands have an inline option but a common ptr to the command being used for remote commands. These commands have to be returned to the sender but the receive side can hold onto them as long as needed for the lifetime of the message - shm has self and peer caps for each p2p interface (right now just CMA and xpmem). The support for each of these interfaces is saved in separate fields which causes a lot of wasted memory and is confusing. This merges these into two fields (one for self and one for peer) which holds the information for all p2p interfaces and is accessed by the P2P type enums. CMA also needs a flag to know wether CMA support has been queried yet or not. - Move some shm fields around for alignment - Simplifies access to the map to remove need for container - There is a 1:1 relationship with the av and map so just reuse the util av lock for access to the map as well. This requires some reorganizing of the locking semantics - There is nothing in smr_fabric. Remove and just use the util_fabric directly - Just like on the send side, make the progress functions be an array of function pointers accessible by the command proto. This cleans up the parameters of the progress calls and streamlines the calls - Merge tx and pend entries for simple management of pending operations - Redefinition of cmd and header for simplicty and easier reading. Also removes and adds fields for new architecture - Refactor async ipc list and turn it into a generic async list to track asynchronous copies which can be used for any accelerator (GPU or DSA) that copies locally asynchronously. - Cleanup naming and organization for readibility. Shorten some names to help with line length and organization - Remove unused and non-performant mmap protocol (and sar_threshold environment variable which was only used for that protocol) - Fix weird header dependency smr_util.c->smr.h->smr_util.h so that smr_util.c is only dependent on smr_util.h and is isolated to solely shm region and protocol definitions. This separates the shm utilities from being dependent on the provider leaving the door open for reuse of the shm utilities if needed Signed-off-by: Alexia Ingerson <[email protected]>

In order to support unlimited unexpected messaging, add a flag SMR_BUFFER_RECV for the sender to let the receiver know that resources are limited and the whole message should get buffered on the target. This allows the command to be immediately returned to the sender so that the sender is never blocked due to unexpected messages at the target. Buffering unexpected messages hurts performance so the default is to wait until only a single command is left before requesting buffering, but an environment variable is also added to toggle this for either debugging purposes or workarounds. Signed-off-by: Alexia Ingerson <[email protected]>

shijin-aws · 2025-05-25T04:13:14Z

prov/shm/src/smr.h

+
+static inline uintptr_t smr_local_to_peer(struct smr_ep *ep,
+					  struct smr_region *peer_smr,
+					  int64_t id, int64_t peer_id,


id seems unused, similiar for smr_peer_to_peer and smr_peer_to_owner

shijin-aws · 2025-05-25T04:13:52Z

prov/shm/src/smr.h

+	return smr_peer_data(peer_smr)[peer_id].local_region + offset;
+}
+
+static inline uintptr_t smr_peer_to_peer(struct smr_ep *ep,


ep unused, same for smr_peer_to_owner

shijin-aws · 2025-05-25T04:28:55Z

prov/shm/src/smr_ep.c

 		if (op_flags & FI_DELIVERY_COMPLETE)
-			return smr_src_sar;
+			return smr_proto_inject;


SAR or inject? New inject protocol support DC ?

shijin-aws · 2025-05-26T04:30:40Z

prov/shm/src/smr_progress.c

-					*err = -FI_ETRUNC;
+						"Incomplete rma read/fetch "
+						"buffer copied\n");
+					ret = -FI_ETRUNC;


I know shm has been doing this, but such error codes don't always get populated to error cq or eq. This may cause hang or unexpected error that application cannot tell without running FI_LOG_LEVEL=warn

shijin-aws · 2025-05-27T21:10:49Z

I get undeterministic hang / slowness when running IMB-EXT Accumulate https://github.com/intel/mpi-benchmarks with this PR on 2 nodes with 96 ranks per node


$export PATH=/opt/amazon/openmpi5/bin:$PATH;export FI_EFA_USE_DEVICE_RDMA=1; export LD_LIBRARY_PATH=/home/ubuntu/PortaFiducia/build/libraries/libfabric/shm_new_draft/install/libfabric/lib:/home/ubuntu/PortaFiducia/build/libraries/rdma_core/master/rdma-core/build/lib:/home/ubuntu/PortaFiducia/build/libraries/rdma_core/master/rdma-core/build/lib:;/opt/amazon/openmpi5/bin/mpirun --wdir . -n 192 --hostfile /home/ubuntu/PortaFiducia/hostfile --map-by ppr:96:node --timeout 1800 -x OMPI_MCA_accelerator=null -x FI_EFA_USE_DEVICE_RDMA=1 -x LD_LIBRARY_PATH=/home/ubuntu/PortaFiducia/build/libraries/libfabric/shm_new_draft/install/libfabric/lib:/home/ubuntu/PortaFiducia/build/libraries/rdma_core/master/rdma-core/build/lib:/home/ubuntu/PortaFiducia/build/libraries/rdma_core/master/rdma-core/build/lib: -x PATH -x FI_LOG_LEVEL=warn /home/ubuntu/PortaFiducia/build/workloads/imb/openmpi-v5.0.6-installer/source/mpi-benchmarks-IMB-v2021.7/IMB-EXT Accumulate -npmin 192 -iter 200 -time 20 -mem 1 2>&1 | tee node2-ppn96.txt


# Benchmarking Accumulate 
# #processes = 192 
#-----------------------------------------------------------------------------
#
#    MODE: AGGREGATE 
#
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]      defects
            0          200         0.16         0.31         0.25         0.00
            4          200       132.15       132.28       132.22         0.00
            8          200       140.38       140.47       140.43         0.00
           16          200       137.29       137.35       137.33         0.00
           32           28      1044.05      1044.78      1044.59         0.00
           64           28      1037.59      1039.55      1039.25         0.00
          128           28      1011.87      1012.05      1011.97         0.00
          256           28      1015.35      1015.59      1015.46         0.00
          512           28      1032.48      1034.15      1033.40         0.00
         1024           28      1018.43      1020.68      1020.34         0.00
         2048           10      2841.07      2842.80      2841.87         0.00
         4096           10      5315.17      5336.67      5325.79         0.00
         8192           10      5840.22      5842.77      5841.42         0.00
        16384            8     14103.96     15353.83     14496.42         0.00
        32768            8     17868.92     17872.19     17870.70         0.00
        65536            8     29602.09     29674.22     29671.18         0.00
       131072            8     50515.54     50517.33     50516.08         0.00
       262144            8     91849.49     91854.67     91851.76         0.00
       524288            8    180536.80    180540.43    180538.50         0.00
(hang)

With main branch, the test can complete within 1 min consistently

# Accumulate

#-----------------------------------------------------------------------------
# Benchmarking Accumulate 
# #processes = 192 
#-----------------------------------------------------------------------------
#
#    MODE: AGGREGATE 
#
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]      defects
            0          200         0.23         0.62         0.29         0.00
            4          200       182.97       183.16       183.05         0.00
            8          200       194.45       196.88       194.54         0.00
           16          200       179.63       179.68       179.65         0.00
           32          200       180.95       181.12       181.01         0.00
           64          200       178.87       178.91       178.89         0.00
          128          200       181.25       181.37       181.31         0.00
          256          200       187.19       188.01       187.94         0.00
          512          200       172.24       172.39       172.34         0.00
         1024          200       179.26       179.35       179.31         0.00
         2048          200       183.66       183.69       183.68         0.00
         4096          200       213.62       213.73       213.69         0.00
         8192          200       314.90       314.98       314.94         0.00
        16384          200       608.01       608.31       608.19         0.00
        32768          200      1034.52      1034.59      1034.57         0.00
        65536          185      1646.58      1647.03      1646.77         0.00
       131072           30     18878.94     18879.61     18879.25         0.00
       262144           19     62272.74     62274.36     62273.81         0.00
       524288            9    241836.46    241838.02    241837.22         0.00
      1048576            5    891690.16    891708.16    891699.50         0.00
      2097152            3   2880126.29   2880193.12   2880160.24         0.00
      4194304            2   8650767.34   8650819.30   8650780.42         0.00

#-----------------------------------------------------------------------------
# Benchmarking Accumulate 
# #processes = 192 
#-----------------------------------------------------------------------------
#
#    MODE: NON-AGGREGATE 
#
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]      defects
            0          100        70.00        70.11        70.05         0.00
            4          100     35790.28     35790.42     35790.32         0.00
            8          100     35908.56     35908.72     35908.67         0.00
           16          100     36159.63     36159.77     36159.69         0.00
           32          100     36095.02     36095.17     36095.09         0.00
           64          100     36580.34     36580.52     36580.44         0.00
          128          100     38076.66     38079.97     38079.67         0.00
          256          100     36712.06     36712.32     36712.17         0.00
          512          100     36891.78     36892.01     36891.90         0.00
         1024          100     36606.25     36606.50     36606.37         0.00
         2048          100     36822.14     36822.25     36822.20         0.00
         4096          100     37819.61     37819.78     37819.69         0.00
         8192          100     55959.80     55959.97     55959.88         0.00
        16384          100     92124.97     92125.15     92125.07         0.00
        32768          100    167187.37    167187.45    167187.41         0.00
        65536           62    310688.10    310688.29    310688.15         0.00
       131072           32    594687.66    594688.91    594688.27         0.00
       262144           17   1184168.32   1184172.35   1184170.25         0.00
       524288            9   2342360.93   2342363.44   2342361.91         0.00
      1048576            5   4579259.49   4579268.31   4579263.29         0.00
      2097152            3   9071980.11   9072002.83   9071990.74         0.00
      4194304            2  17919749.43  17919781.21  17919755.18         0.00

If I ran with this PR + disabling shm, the test can also complete fast without hitting hangs

shijin-aws · 2025-05-30T17:56:00Z

Paste my offline discussion with @aingerson here. We narrow down the issue to be something in new shm's atomics calls (fi_atomic, fi_fetch_atomic, fi_compare_atomic). If I make efa skip shm in these calls , there is no hang. Need to further narrow down which exact op is causing problem

aingerson force-pushed the shm_new_draft branch from 41aa69a to 9f309aa Compare February 21, 2025 20:58

aingerson force-pushed the shm_new_draft branch from 9f309aa to baa5747 Compare March 5, 2025 21:23

aingerson force-pushed the shm_new_draft branch from baa5747 to 5bdc740 Compare March 14, 2025 23:28

j-xiong reviewed Mar 17, 2025

View reviewed changes

aingerson force-pushed the shm_new_draft branch from 5bdc740 to 1c8fe2f Compare March 18, 2025 18:43

aingerson mentioned this pull request Mar 21, 2025

prov/shm: In-progress send via CMA(iov protocol) blocks following sends #9853

Open

aingerson force-pushed the shm_new_draft branch from 1c8fe2f to 6f512ab Compare March 26, 2025 17:46

aingerson mentioned this pull request Mar 26, 2025

prov/shm: new shm architecture v2 #10907

Open

aingerson force-pushed the shm_new_draft branch from 6f512ab to 5daf2b1 Compare March 26, 2025 20:42

aingerson force-pushed the shm_new_draft branch 8 times, most recently from 788ee3f to cd8d5d9 Compare May 20, 2025 22:28

aingerson force-pushed the shm_new_draft branch from cd8d5d9 to cbe14e0 Compare May 22, 2025 14:02

aingerson added 5 commits May 23, 2025 11:24

prov/shm: remove socket code, no longer needed

c1792fb

ZE IPC code protocol was updated to remove dependency on Unix socket code - can be removed Signed-off-by: Alexia Ingerson <[email protected]>

include/ofi_hmem: make ofi_copy_mr_iov non-static

d5b3b50

Allow use of mr copy function using direction Signed-off-by: Alexia Ingerson <[email protected]>

include/ofi.h: add MIN3 function

0c5a60e

Add function to return minimum of 3 values Signed-off-by: Alexia Ingerson <[email protected]>

aingerson added 5 commits May 23, 2025 11:24

include/ofi_xpmem: change cap into bool

f89ff7c

xpmem capability can only have 2 settings - on or off. Turn into bool for simplicity Signed-off-by: Alexia Ingerson <[email protected]>

include/ofi_atomic_queue: fix create function

cb11fa5

create function needs to align the allocation with the cache line size Signed-off-by: Alexia Ingerson <[email protected]>

aingerson force-pushed the shm_new_draft branch from cbe14e0 to 784fd01 Compare May 23, 2025 18:28

shijin-aws reviewed May 25, 2025

View reviewed changes

shijin-aws reviewed May 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

prov/shm: new shm architecture #10817

prov/shm: new shm architecture #10817

aingerson commented Feb 21, 2025

Uh oh!

j-xiong Mar 17, 2025

Uh oh!

aingerson Mar 18, 2025

Uh oh!

aingerson commented Mar 19, 2025

Uh oh!

sunkuamzn commented Mar 19, 2025

Uh oh!

aingerson commented Mar 27, 2025

Uh oh!

shijin-aws commented May 21, 2025

Uh oh!

shijin-aws commented May 21, 2025

Uh oh!

shijin-aws commented May 21, 2025 •

edited

Loading

Uh oh!

shijin-aws commented May 22, 2025

Uh oh!

shijin-aws May 25, 2025

Uh oh!

shijin-aws May 25, 2025

Uh oh!

shijin-aws May 25, 2025

Uh oh!

shijin-aws May 26, 2025 •

edited

Loading

Uh oh!

shijin-aws commented May 27, 2025 •

edited

Loading

Uh oh!

shijin-aws commented May 30, 2025

Uh oh!

Uh oh!

prov/shm: new shm architecture #10817

Are you sure you want to change the base?

prov/shm: new shm architecture #10817

Conversation

aingerson commented Feb 21, 2025

Uh oh!

j-xiong Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

aingerson Mar 18, 2025

Choose a reason for hiding this comment

Uh oh!

aingerson commented Mar 19, 2025

Uh oh!

sunkuamzn commented Mar 19, 2025

Uh oh!

aingerson commented Mar 27, 2025

Uh oh!

shijin-aws commented May 21, 2025

Uh oh!

shijin-aws commented May 21, 2025

Uh oh!

shijin-aws commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shijin-aws commented May 22, 2025

Uh oh!

shijin-aws May 25, 2025

Choose a reason for hiding this comment

Uh oh!

shijin-aws May 25, 2025

Choose a reason for hiding this comment

Uh oh!

shijin-aws May 25, 2025

Choose a reason for hiding this comment

Uh oh!

shijin-aws May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shijin-aws commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shijin-aws commented May 30, 2025

Uh oh!

Uh oh!

shijin-aws commented May 21, 2025 •

edited

Loading

shijin-aws May 26, 2025 •

edited

Loading

shijin-aws commented May 27, 2025 •

edited

Loading