Skip to content

[Question] How does nixl use MNNVL communication in the GB200 series environment? #1240

@kebe7jun

Description

@kebe7jun

When I use NIXL as the connector for Prefill/Decode (P/D) separation in vLLM, and enable UCX_PROTO_INFO=y, I observe that the traffic never negotiates the MNNVL path.

Details
[1769083945.891234] [local-cluster-10:1768445:0]   +--------------------------------+---------------------+-----------------------------------------------------+
[1769083945.891317] [local-cluster-10:1768445:0]   +--------------------------------+--------------------------------------------------------------------+
[1769083945.891317] [local-cluster-10:1768445:0]   | ucp_context_0 intra-node cfg#1 | remote memory write by ucp_put* from host memory to cuda/dev[1]    |
[1769083945.891317] [local-cluster-10:1768445:0]   +--------------------------------+--------------+-----------------------------------------------------+
[1769083945.891318] [local-cluster-10:1768445:0]   |                         0..220 | short        | rc_mlx5/mlx5_2:1                                    |
[1769083945.891318] [local-cluster-10:1768445:0]   |                221..3371787874 | copy-in      | rc_mlx5/mlx5_2:1                                    |
[1769083945.891318] [local-cluster-10:1768445:0]   |                3371787875..inf | zero-copy    | 72% on rc_mlx5/mlx5_2:1 and 28% on dc_mlx5/mlx5_3:1 |
[1769083945.891318] [local-cluster-10:1768445:0]   +--------------------------------+--------------+-----------------------------------------------------+
[1769083945.891379] [local-cluster-10:1768445:0]   +--------------------------------+-------------------------------------------------------------------------------------+
[1769083945.891379] [local-cluster-10:1768445:0]   | ucp_context_0 intra-node cfg#1 | remote memory write by ucp_put*(fast-completion) from host memory to cuda/dev[1]    |
[1769083945.891379] [local-cluster-10:1768445:0]   +--------------------------------+-------------------------------+-----------------------------------------------------+
[1769083945.891380] [local-cluster-10:1768445:0]   |                         0..220 | short                         | rc_mlx5/mlx5_2:1                                    |
[1769083945.891380] [local-cluster-10:1768445:0]   |                221..3371891700 | copy-in                       | rc_mlx5/mlx5_2:1                                    |
[1769083945.891380] [local-cluster-10:1768445:0]   |                3371891701..inf | zero-copy                     | 72% on rc_mlx5/mlx5_2:1 and 28% on dc_mlx5/mlx5_3:1 |
[1769083945.891380] [local-cluster-10:1768445:0]   +--------------------------------+-------------------------------+-----------------------------------------------------+
[1769083945.891443] [local-cluster-10:1768445:0]   +--------------------------------+---------------------------------------------------------------------------+
[1769083945.891443] [local-cluster-10:1768445:0]   | ucp_context_0 intra-node cfg#1 | remote memory write by ucp_put*(multi) from host memory to cuda/dev[1]    |
[1769083945.891443] [local-cluster-10:1768445:0]   +--------------------------------+---------------------+-----------------------------------------------------+
[1769083945.891443] [local-cluster-10:1768445:0]   |                         0..220 | short               | rc_mlx5/mlx5_2:1                                    |
[1769083945.891443] [local-cluster-10:1768445:0]   |                   221..1157826 | copy-in             | rc_mlx5/mlx5_2:1                                    |
[1769083945.891444] [local-cluster-10:1768445:0]   |                   1157827..inf | zero-copy           | 72% on rc_mlx5/mlx5_2:1 and 28% on dc_mlx5/mlx5_3:1 |
[1769083945.891444] [local-cluster-10:1768445:0]   +--------------------------------+---------------------+-----------------------------------------------------+
[1769084049.338751] [local-cluster-10:1768445:a]   +--------------------------------+-------------------------------------------------------------+
[1769084049.338755] [local-cluster-10:1768445:a]   | ucp_context_0 inter-node cfg#2 | remote memory write by ucp_put* from host memory to host    |
[1769084049.338756] [local-cluster-10:1768445:a]   +--------------------------------+------------------------------------------+------------------+
[1769084049.338756] [local-cluster-10:1768445:a]   |                         0..220 | short                                    | rc_mlx5/mlx5_3:1 |
[1769084049.338757] [local-cluster-10:1768445:a]   |                       221..inf | copy-in                                  | rc_mlx5/mlx5_3:1 |
[1769084049.338757] [local-cluster-10:1768445:a]   +------------------------------------------------------------+
[1769083942.580488] [local-cluster-10:1768444:0]   |                         0..220 | short               | rc_mlx5/mlx5_2:1                                    |
[1769083942.580488] [local-cluster-10:1768444:0]   |                   221..1157826 | copy-in             | rc_mlx5/mlx5_2:1                                    |
[1769083942.580489] [local-cluster-10:1768444:0]   |                   1157827..inf | zero-copy           | 72% on rc_mlx5/mlx5_2:1 and 28% on dc_mlx5/mlx5_3:1 |
[1769083942.580489] [local-cluster-10:1768444:0]   +--------------------------------+---------------------+-----------------------------------------------------+
[1769083942.580573] [local-cluster-10:1768444:0]   +--------------------------------+--------------------------------------------------------------------+
[1769083942.580573] [local-cluster-10:1768444:0]   | ucp_context_0 intra-node cfg#1 | remote memory write by ucp_put* from host memory to cuda/dev[0]    |
[1769083942.580573] [local-cluster-10:1768444:0]   +--------------------------------+--------------+-----------------------------------------------------+
[1769083942.580573] [local-cluster-10:1768444:0]   |                         0..220 | short        | rc_mlx5/mlx5_2:1                                    |
[1769083942.580574] [local-cluster-10:1768444:0]   |                221..3371787874 | copy-in      | rc_mlx5/mlx5_2:1                                    |
[1769083942.580574] [local-cluster-10:1768444:0]   |                3371787875..inf | zero-copy    | 72% on rc_mlx5/mlx5_2:1 and 28% on dc_mlx5/mlx5_3:1 |
[1769083942.580574] [local-cluster-10:1768444:0]   +--------------------------------+--------------+-----------------------------------------------------+
[1769083942.580632] [local-cluster-10:1768444:0]   +--------------------------------+-------------------------------------------------------------------------------------+
[1769083942.580632] [local-cluster-10:1768444:0]   | ucp_context_0 intra-node cfg#1 | remote memory write by ucp_put*(fast-completion) from host memory to cuda/dev[0]    |
[1769083942.580633] [local-cluster-10:1768444:0]   +--------------------------------+-------------------------------+-----------------------------------------------------+
[1769083942.580633] [local-cluster-10:1768444:0]   |                         0..220 | short                         | rc_mlx5/mlx5_2:1                                    |
[1769083942.580633] [local-cluster-10:1768444:0]   |                221..3371891700 | copy-in                       | rc_mlx5/mlx5_2:1                                    |
[1769083942.580633] [local-cluster-10:1768444:0]   |                3371891701..inf | zero-copy                     | 72% on rc_mlx5/mlx5_2:1 and 28% on dc_mlx5/mlx5_3:1 |
[1769083942.580634] [local-cluster-10:1768444:0]   +--------------------------------+-------------------------------+-----------------------------------------------------+

The KV transfer efficiency also seems suboptimal:

KV Transfer metrics: Num successful transfers=2, Avg xfer time (ms)=1.49, P90 xfer time (ms)=1.627, Avg post time (ms)=1.15, P90 post time (ms)=1.28, Avg MB per transfer=4.781, Throughput (MB/s)=3209.478.

This behavior is consistent across both nixl[cu13]==0.9.0 and nixl[cu13]==0.8.0.

To isolate the issue, I tested with ucx_perftest (both my self-built version and the system-provided one; the system binary also does not enable MNNVL by default in my environment). My test setup is as follows:

Server: UCX_PROTO_INFO=y UCX_NET_DEVICES=mlx5_2:1 UCX_TLS=ib,sm,cuda_copy,cuda_ipc,self UCX_CUDA_IPC_ENABLE_MNNVL=y ucx_perftest -t tag_bw -m cuda -n 1000 -s 4194304

Client: UCX_NET_DEVICES=mlx5_2:1 UCX_TLS=ib,sm,cuda_copy,cuda_ipc,self UCX_CUDA_IPC_ENABLE_MNNVL=y ucx_perftest -t tag_bw -m cuda -n 1000 -s 41943040 <serverip>

In this case, MNNVL is successfully used

[1769084848.088832] [local-cluster-10:1776775:0]   +---------------------------+----------------------------------------------------------+
[1769084848.088841] [local-cluster-10:1776775:0]   | perftest inter-node cfg#1 | rendezvous data fetch(multi) into cuda/GPU0 from cuda    |
[1769084848.088843] [local-cluster-10:1776775:0]   +---------------------------+------------------------------------------+---------------+
[1769084848.088844] [local-cluster-10:1776775:0]   |                         0 | no data fetch                            |               |
[1769084848.088848] [local-cluster-10:1776775:0]   |                    1..inf | zero-copy read from remote               | cuda_ipc/cuda |
[1769084848.088850] [local-cluster-10:1776775:0]   +---------------------------+------------------------------------------+---------------+
[1769084848.091072] [local-cluster-10:1776775:0]   +---------------------------+--------------------------------------------------------------+
[1769084848.091082] [local-cluster-10:1776775:0]   | perftest inter-node cfg#1 | tagged message by ucp_tag_send*(multi) from cuda/GPU0        |
[1769084848.091083] [local-cluster-10:1776775:0]   +---------------------------+-------------------------------------------+------------------+
[1769084848.091085] [local-cluster-10:1776775:0]   |                         0 | eager short                               | rc_mlx5/mlx5_2:1 |
[1769084848.091088] [local-cluster-10:1776775:0]   |                    1..inf | (?) rendezvous zero-copy read from remote | cuda_ipc/cuda    |
[1769084848.091090] [local-cluster-10:1776775:0]   +---------------------------+-------------------------------------------+------------------+

And the performance is excellent:

+--------------+--------------+------------------------------+---------------------+-----------------------+
|              |              |       overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
|    Stage     | # iterations | 50.0%ile | average | overall |  average |  overall |  average  |  overall  |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
Final:                  1000     55.456    55.555    55.555   720005.84  720005.84       18000       18000

I have reviewed many documents and references, and most of them indicate that setting UCX_CUDA_IPC_ENABLE_MNNVL=y should be sufficient. However, in my case, that does not appear to enable MNNVL for vLLM/NIXL.

I would like to understand what additional configuration or constraints are required to make vLLM/NIXL use the MNNVL path, and why it works with ucx_perftest but not in the vLLM + NIXL connector workflow.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions