Skip to content

patch ofa_kernel to prepare for build-time dependency elimination for downstream GPU kmod#17512

Open
arsdragonfly wants to merge 3 commits into
microsoft:3.0from
arsdragonfly:arsdragonfly/ofa_export_symbol_gpl
Open

patch ofa_kernel to prepare for build-time dependency elimination for downstream GPU kmod#17512
arsdragonfly wants to merge 3 commits into
microsoft:3.0from
arsdragonfly:arsdragonfly/ofa_export_symbol_gpl

Conversation

@arsdragonfly
Copy link
Copy Markdown

Merge Checklist

All boxes should be checked before merging the PR (just tick any boxes which don't apply to this PR)

  • The toolchain has been rebuilt successfully (or no changes were made to it)
  • The toolchain/worker package manifests are up-to-date
  • Any updated packages successfully build (or no packages were changed)
  • Packages depending on static components modified in this PR (Golang, *-static subpackages, etc.) have had their Release tag incremented.
  • Package tests (%check section) have been verified with RUN_CHECK=y for existing SPEC files, or added to new SPEC files
  • All package sources are available
  • cgmanifest files are up-to-date and sorted (./cgmanifest.json, ./toolkit/scripts/toolchain/cgmanifest.json, .github/workflows/cgmanifest.json)
  • LICENSE-MAP files are up-to-date (./LICENSES-AND-NOTICES/SPECS/data/licenses.json, ./LICENSES-AND-NOTICES/SPECS/LICENSES-MAP.md, ./LICENSES-AND-NOTICES/SPECS/LICENSE-EXCEPTIONS.PHOTON)
  • All source files have up-to-date hashes in the *.signatures.json files
  • sudo make go-tidy-all and sudo make go-test-coverage pass
  • Documentation has been updated to match any changes to the build system
  • Ready to merge

Summary

The IB peer-memory API in drivers/infiniband/core/peer_mem.c was
historically exported with EXPORT_SYMBOL. That dates back to a time when
the consumer was the NVIDIA closed-source GPU driver, which could not
link against EXPORT_SYMBOL_GPL symbols.

Today the consumers are GPL kernel modules (AMDGPU, NVIDIA open-source
driver). Exposing the peer-memory client registration API as
EXPORT_SYMBOL_GPL more accurately reflects the GPL licensing of
mlnx-ofa_kernel and matches the kernel's modern expectations around
in-kernel API surfaces.

In addition, switching to EXPORT_SYMBOL_GPL allows out-of-tree drivers
(AMDGPU, NVIDIA open-source GPU driver) to discover the peer-memory
client API via module symbol resolution at runtime instead of having
to hard-link against the mlnx-ofa_kernel build, which is required to
break the diamond build-time dependency between an OOT GPU driver and
multiple OOT NIC drivers that all consume the same API, which is the
situation we are facing with AMD GPU (MI300X works with Mellanox while
MI455X comes with Vulcano NIC, and we would like the same kmod to be
able to work with both hardware).

Change Log
  • patch mlnx-ofa_kernel and mlnx-ofa_kernel-hwe to export symbols via GPL
Does this affect the toolchain?

NO

Associated issues
Test Methodology

tested locally with adjusted ofa_kernel and amdgpu-with-eliminated-build-time-dependency:

#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong  
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)          
           8             2     float     sum      -1   225.21    0.00    0.00       0   210.65    0.00    0.00       0
          16             4     float     sum      -1   211.26    0.00    0.00       0   210.56    0.00    0.00       0
          32             8     float     sum      -1   209.27    0.00    0.00       0   209.02    0.00    0.00       0
          64            16     float     sum      -1   211.89    0.00    0.00       0   207.27    0.00    0.00       0
         128            32     float     sum      -1   211.60    0.00    0.00       0   213.49    0.00    0.00       0
         256            64     float     sum      -1   217.10    0.00    0.00       0   209.60    0.00    0.00       0
         512           128     float     sum      -1   209.81    0.00    0.00       0   213.69    0.00    0.00       0
        1024           256     float     sum      -1    46.25    0.02    0.04       0    46.74    0.02    0.04       0
        2048           512     float     sum      -1    46.06    0.04    0.08       0    45.87    0.04    0.08       0
        4096          1024     float     sum      -1    46.98    0.09    0.15       0    47.50    0.09    0.15       0
        8192          2048     float     sum      -1    45.81    0.18    0.31       0    46.92    0.17    0.31       0
       16384          4096     float     sum      -1    46.63    0.35    0.61       0    47.08    0.35    0.61       0
       32768          8192     float     sum      -1    47.27    0.69    1.21       0    48.38    0.68    1.19       0
       65536         16384     float     sum      -1    48.55    1.35    2.36       0    45.69    1.43    2.51       0
      131072         32768     float     sum      -1    57.07    2.30    4.02       0    59.30    2.21    3.87       0
      262144         65536     float     sum      -1    58.84    4.46    7.80       0    58.82    4.46    7.80       0
      524288        131072     float     sum      -1    59.95    8.75   15.31       0    58.81    8.91   15.60       0
     1048576        262144     float     sum      -1    56.75   18.48   32.33       0    58.52   17.92   31.36       0
     2097152        524288     float     sum      -1    60.47   34.68   60.69       0    61.16   34.29   60.01       0
     4194304       1048576     float     sum      -1    59.93   69.99  122.48       0    60.92   68.85  120.49       0
     8388608       2097152     float     sum      -1    85.52   98.08  171.65       0    88.54   94.75  165.81       0
    16777216       4194304     float     sum      -1   150.49  111.48  195.09       0   157.18  106.74  186.79       0
    33554432       8388608     float     sum      -1   427.23   78.54  137.44       0   429.57   78.11  136.69       0
    67108864      16777216     float     sum      -1   586.83  114.36  200.13       0   592.84  113.20  198.10       0
   134217728      33554432     float     sum      -1   953.67  140.74  246.29       0   951.04  141.13  246.97       0
   268435456      67108864     float     sum      -1  1686.36  159.18  278.57       0  1692.94  158.56  277.48       0
   536870912     134217728     float     sum      -1  3158.38  169.98  297.47       0  3176.69  169.00  295.76       0
  1073741824     268435456     float     sum      -1  6078.79  176.64  309.12       0  6078.00  176.66  309.16       0
  2147483648     536870912     float     sum      -1  11948.3  179.73  314.53       0  11954.7  179.63  314.36       0
  4294967296    1073741824     float     sum      -1  23719.2  181.08  316.88       0  23749.6  180.84  316.48       0
  8589934592    2147483648     float     sum      -1  47344.4  181.43  317.51       0  47400.8  181.22  317.13       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 97.4325 
#
# Collective test concluded: all_reduce_perf
#

hpcuser@compu13f0000001 [ ~/azurelinux ]$ sudo dmesg | grep PeerDirect
[   15.958353] amdgpu: PeerDirect support was initialized successfully
hpcuser@compu13f0000001 [ ~/azurelinux ]$ 

@arsdragonfly arsdragonfly requested a review from a team as a code owner May 27, 2026 23:29
@microsoft-github-policy-service microsoft-github-policy-service Bot added Packaging 3.0 Issues and PRs for Azure Linux 3.0 labels May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.0 Issues and PRs for Azure Linux 3.0 Packaging

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant