1. 原方案:cudaHostAlloc 分配内存
=================================================================
GDR Copy Benchmark — GPU 4 NIC mlx5_4
=================================================================
GPU: NVIDIA H20 (PCIe gen152 x0)
GPUDirect RDMA path: ACTIVE
--- Host→Device (H2D) ---
Size | GDR (median / p99 / BW) | CUDA (median / p99 / BW)
-------------+------------------------------+-----------------------------
4KiB | 3.75 µs / 4.25 µs / 1.06 GB/s | 8.75 µs / 9.50 µs / 0.47 GB/s
16KiB | 7.00 µs / 7.25 µs / 2.38 GB/s | 11.50 µs / 14.75 µs / 1.40 GB/s
64KiB | 12.25 µs / 19.50 µs / 5.22 GB/s | 9.25 µs / 10.25 µs / 6.98 GB/s
256KiB | 32.75 µs / 36.25 µs / 8.03 GB/s | 13.00 µs / 18.75 µs / 20.08 GB/s
1MiB | 120.25 µs / 133.00 µs / 8.70 GB/s | 27.25 µs / 30.50 µs / 38.36 GB/s
4MiB | 338.50 µs / 351.75 µs / 12.34 GB/s | 83.75 µs / 290.75 µs / 48.67 GB/s
16MiB | 1357.00 µs / 1398.25 µs / 12.35 GB/s | 310.50 µs / 315.75 µs / 53.99 GB/s
64MiB | 6066.00 µs / 8246.75 µs / 10.97 GB/s | 1217.75 µs / 1237.25 µs / 55.07 GB/s
--- Device→Host (D2H) ---
Size | GDR (median / p99 / BW) | CUDA (median / p99 / BW)
-------------+------------------------------+-----------------------------
4KiB | 4.00 µs / 4.50 µs / 1.05 GB/s | 7.75 µs / 14.50 µs / 0.52 GB/s
16KiB | 5.75 µs / 6.00 µs / 2.88 GB/s | 7.75 µs / 16.50 µs / 2.05 GB/s
64KiB | 9.25 µs / 9.50 µs / 7.21 GB/s | 8.75 µs / 19.25 µs / 7.35 GB/s
256KiB | 21.00 µs / 23.75 µs / 12.42 GB/s | 12.50 µs / 24.00 µs / 20.81 GB/s
1MiB | 86.25 µs / 114.50 µs / 11.87 GB/s | 27.50 µs / 39.00 µs / 37.93 GB/s
4MiB | 394.50 µs / 467.50 µs / 10.60 GB/s | 85.75 µs / 100.50 µs / 48.64 GB/s
16MiB | 1460.25 µs / 1496.50 µs / 11.48 GB/s | 320.00 µs / 336.25 µs / 52.34 GB/s
64MiB | 5838.25 µs / 5976.25 µs / 11.50 GB/s | 1260.25 µs / 1265.75 µs / 53.26 GB/s
=================================================================
Total ops: 110 (RDMA: 110 Fallback: 0)
Total bytes: 6.88 GiB
=================================================================
=================================================================
GDR Copy Benchmark — GPU 4 NIC mlx5_4
=================================================================
GPU: NVIDIA H20 (PCIe gen152 x0)
GPUDirect RDMA path: ACTIVE
--- Host→Device (H2D) ---
Size | GDR (median / p99 / BW) | CUDA (median / p99 / BW)
-------------+------------------------------+-----------------------------
4KiB | 3.75 µs / 4.50 µs / 1.06 GB/s | 7.25 µs / 7.75 µs / 0.57 GB/s
16KiB | 6.25 µs / 6.75 µs / 2.60 GB/s | 10.00 µs / 11.25 µs / 1.60 GB/s
64KiB | 10.50 µs / 14.00 µs / 6.25 GB/s | 17.25 µs / 23.25 µs / 3.64 GB/s
256KiB | 29.25 µs / 33.25 µs / 9.04 GB/s | 28.50 µs / 34.00 µs / 9.18 GB/s
1MiB | 114.75 µs / 124.50 µs / 9.12 GB/s | 85.75 µs / 90.50 µs / 12.20 GB/s
4MiB | 363.00 µs / 1038.00 µs / 11.17 GB/s | 300.00 µs / 312.75 µs / 13.93 GB/s
16MiB | 1451.00 µs / 1471.00 µs / 11.57 GB/s | 1099.00 µs / 1124.00 µs / 15.25 GB/s
64MiB | 6191.00 µs / 17711.00 µs / 10.02 GB/s | 5138.50 µs / 9132.00 µs / 12.56 GB/s
--- Device→Host (D2H) ---
Size | GDR (median / p99 / BW) | CUDA (median / p99 / BW)
-------------+------------------------------+-----------------------------
4KiB | 4.25 µs / 7.75 µs / 0.96 GB/s | 12.75 µs / 52.75 µs / 0.30 GB/s
16KiB | 6.25 µs / 7.50 µs / 2.58 GB/s | 14.25 µs / 44.25 µs / 1.06 GB/s
64KiB | 9.50 µs / 10.25 µs / 6.88 GB/s | 20.75 µs / 43.25 µs / 3.11 GB/s
256KiB | 21.50 µs / 39.25 µs / 12.04 GB/s | 43.50 µs / 62.00 µs / 5.96 GB/s
1MiB | 91.75 µs / 104.00 µs / 11.38 GB/s | 132.75 µs / 156.50 µs / 8.26 GB/s
4MiB | 350.00 µs / 367.00 µs / 11.98 GB/s | 291.50 µs / 305.25 µs / 14.32 GB/s
16MiB | 1362.25 µs / 6507.50 µs / 10.32 GB/s | 1276.50 µs / 1325.00 µs / 13.13 GB/s
64MiB | 7238.75 µs / 15770.00 µs / 8.40 GB/s | 5622.50 µs / 48008.25 µs / 9.08 GB/s
=================================================================
Total ops: 110 (RDMA: 110 Fallback: 0)
Total bytes: 6.88 GiB
=================================================================
3. CPU直接分配到RDAM设备MR注册好的内存中
=================================================================
GDR Copy Benchmark — GPU 4 NIC mlx5_4
=================================================================
GPU: NVIDIA H20 (PCIe gen152 x0)
GPUDirect RDMA path: ACTIVE
--- Host→Device (H2D) ---
Size | GDR (median / p99 / BW) | CUDA (median / p99 / BW)
-------------+------------------------------+-----------------------------
4KiB | 3.00 µs / 3.50 µs / 1.33 GB/s | 8.25 µs / 13.75 µs / 0.49 GB/s
16KiB | 4.50 µs / 4.75 µs / 3.66 GB/s | 11.00 µs / 12.50 µs / 1.46 GB/s
64KiB | 7.00 µs / 7.25 µs / 9.47 GB/s | 9.00 µs / 14.50 µs / 7.29 GB/s
256KiB | 12.25 µs / 17.50 µs / 21.32 GB/s | 12.25 µs / 13.50 µs / 21.11 GB/s
1MiB | 29.25 µs / 31.50 µs / 35.81 GB/s | 26.75 µs / 31.75 µs / 39.09 GB/s
4MiB | 96.50 µs / 97.00 µs / 43.51 GB/s | 83.25 µs / 84.00 µs / 50.37 GB/s
16MiB | 384.00 µs / 387.00 µs / 43.68 GB/s | 310.00 µs / 313.50 µs / 54.11 GB/s
64MiB | 1589.00 µs / 2543.00 µs / 39.25 GB/s | 1281.00 µs / 1302.25 µs / 52.43 GB/s
--- Device→Host (D2H) ---
Size | GDR (median / p99 / BW) | CUDA (median / p99 / BW)
-------------+------------------------------+-----------------------------
4KiB | 3.75 µs / 4.25 µs / 1.10 GB/s | 7.75 µs / 10.75 µs / 0.53 GB/s
16KiB | 5.00 µs / 44.75 µs / 3.04 GB/s | 7.75 µs / 9.25 µs / 2.09 GB/s
64KiB | 6.25 µs / 6.75 µs / 10.43 GB/s | 9.50 µs / 16.50 µs / 6.80 GB/s
256KiB | 11.50 µs / 17.00 µs / 22.65 GB/s | 13.25 µs / 26.25 µs / 19.39 GB/s
1MiB | 29.75 µs / 70.50 µs / 34.27 GB/s | 28.75 µs / 38.00 µs / 35.64 GB/s
4MiB | 102.25 µs / 106.00 µs / 40.96 GB/s | 88.00 µs / 107.25 µs / 47.15 GB/s
16MiB | 403.50 µs / 431.75 µs / 41.50 GB/s | 320.50 µs / 338.00 µs / 52.12 GB/s
64MiB | 1770.50 µs / 1774.50 µs / 37.90 GB/s | 1255.25 µs / 1261.25 µs / 53.44 GB/s
=================================================================
Total ops: 110 (RDMA: 110 Fallback: 0)
Total bytes: 6.88 GiB
=================================================================