-
Notifications
You must be signed in to change notification settings - Fork 232
Description
When I use NIXL as the connector for Prefill/Decode (P/D) separation in vLLM, and enable UCX_PROTO_INFO=y, I observe that the traffic never negotiates the MNNVL path.
Details
[1769083945.891234] [local-cluster-10:1768445:0] +--------------------------------+---------------------+-----------------------------------------------------+
[1769083945.891317] [local-cluster-10:1768445:0] +--------------------------------+--------------------------------------------------------------------+
[1769083945.891317] [local-cluster-10:1768445:0] | ucp_context_0 intra-node cfg#1 | remote memory write by ucp_put* from host memory to cuda/dev[1] |
[1769083945.891317] [local-cluster-10:1768445:0] +--------------------------------+--------------+-----------------------------------------------------+
[1769083945.891318] [local-cluster-10:1768445:0] | 0..220 | short | rc_mlx5/mlx5_2:1 |
[1769083945.891318] [local-cluster-10:1768445:0] | 221..3371787874 | copy-in | rc_mlx5/mlx5_2:1 |
[1769083945.891318] [local-cluster-10:1768445:0] | 3371787875..inf | zero-copy | 72% on rc_mlx5/mlx5_2:1 and 28% on dc_mlx5/mlx5_3:1 |
[1769083945.891318] [local-cluster-10:1768445:0] +--------------------------------+--------------+-----------------------------------------------------+
[1769083945.891379] [local-cluster-10:1768445:0] +--------------------------------+-------------------------------------------------------------------------------------+
[1769083945.891379] [local-cluster-10:1768445:0] | ucp_context_0 intra-node cfg#1 | remote memory write by ucp_put*(fast-completion) from host memory to cuda/dev[1] |
[1769083945.891379] [local-cluster-10:1768445:0] +--------------------------------+-------------------------------+-----------------------------------------------------+
[1769083945.891380] [local-cluster-10:1768445:0] | 0..220 | short | rc_mlx5/mlx5_2:1 |
[1769083945.891380] [local-cluster-10:1768445:0] | 221..3371891700 | copy-in | rc_mlx5/mlx5_2:1 |
[1769083945.891380] [local-cluster-10:1768445:0] | 3371891701..inf | zero-copy | 72% on rc_mlx5/mlx5_2:1 and 28% on dc_mlx5/mlx5_3:1 |
[1769083945.891380] [local-cluster-10:1768445:0] +--------------------------------+-------------------------------+-----------------------------------------------------+
[1769083945.891443] [local-cluster-10:1768445:0] +--------------------------------+---------------------------------------------------------------------------+
[1769083945.891443] [local-cluster-10:1768445:0] | ucp_context_0 intra-node cfg#1 | remote memory write by ucp_put*(multi) from host memory to cuda/dev[1] |
[1769083945.891443] [local-cluster-10:1768445:0] +--------------------------------+---------------------+-----------------------------------------------------+
[1769083945.891443] [local-cluster-10:1768445:0] | 0..220 | short | rc_mlx5/mlx5_2:1 |
[1769083945.891443] [local-cluster-10:1768445:0] | 221..1157826 | copy-in | rc_mlx5/mlx5_2:1 |
[1769083945.891444] [local-cluster-10:1768445:0] | 1157827..inf | zero-copy | 72% on rc_mlx5/mlx5_2:1 and 28% on dc_mlx5/mlx5_3:1 |
[1769083945.891444] [local-cluster-10:1768445:0] +--------------------------------+---------------------+-----------------------------------------------------+
[1769084049.338751] [local-cluster-10:1768445:a] +--------------------------------+-------------------------------------------------------------+
[1769084049.338755] [local-cluster-10:1768445:a] | ucp_context_0 inter-node cfg#2 | remote memory write by ucp_put* from host memory to host |
[1769084049.338756] [local-cluster-10:1768445:a] +--------------------------------+------------------------------------------+------------------+
[1769084049.338756] [local-cluster-10:1768445:a] | 0..220 | short | rc_mlx5/mlx5_3:1 |
[1769084049.338757] [local-cluster-10:1768445:a] | 221..inf | copy-in | rc_mlx5/mlx5_3:1 |
[1769084049.338757] [local-cluster-10:1768445:a] +------------------------------------------------------------+
[1769083942.580488] [local-cluster-10:1768444:0] | 0..220 | short | rc_mlx5/mlx5_2:1 |
[1769083942.580488] [local-cluster-10:1768444:0] | 221..1157826 | copy-in | rc_mlx5/mlx5_2:1 |
[1769083942.580489] [local-cluster-10:1768444:0] | 1157827..inf | zero-copy | 72% on rc_mlx5/mlx5_2:1 and 28% on dc_mlx5/mlx5_3:1 |
[1769083942.580489] [local-cluster-10:1768444:0] +--------------------------------+---------------------+-----------------------------------------------------+
[1769083942.580573] [local-cluster-10:1768444:0] +--------------------------------+--------------------------------------------------------------------+
[1769083942.580573] [local-cluster-10:1768444:0] | ucp_context_0 intra-node cfg#1 | remote memory write by ucp_put* from host memory to cuda/dev[0] |
[1769083942.580573] [local-cluster-10:1768444:0] +--------------------------------+--------------+-----------------------------------------------------+
[1769083942.580573] [local-cluster-10:1768444:0] | 0..220 | short | rc_mlx5/mlx5_2:1 |
[1769083942.580574] [local-cluster-10:1768444:0] | 221..3371787874 | copy-in | rc_mlx5/mlx5_2:1 |
[1769083942.580574] [local-cluster-10:1768444:0] | 3371787875..inf | zero-copy | 72% on rc_mlx5/mlx5_2:1 and 28% on dc_mlx5/mlx5_3:1 |
[1769083942.580574] [local-cluster-10:1768444:0] +--------------------------------+--------------+-----------------------------------------------------+
[1769083942.580632] [local-cluster-10:1768444:0] +--------------------------------+-------------------------------------------------------------------------------------+
[1769083942.580632] [local-cluster-10:1768444:0] | ucp_context_0 intra-node cfg#1 | remote memory write by ucp_put*(fast-completion) from host memory to cuda/dev[0] |
[1769083942.580633] [local-cluster-10:1768444:0] +--------------------------------+-------------------------------+-----------------------------------------------------+
[1769083942.580633] [local-cluster-10:1768444:0] | 0..220 | short | rc_mlx5/mlx5_2:1 |
[1769083942.580633] [local-cluster-10:1768444:0] | 221..3371891700 | copy-in | rc_mlx5/mlx5_2:1 |
[1769083942.580633] [local-cluster-10:1768444:0] | 3371891701..inf | zero-copy | 72% on rc_mlx5/mlx5_2:1 and 28% on dc_mlx5/mlx5_3:1 |
[1769083942.580634] [local-cluster-10:1768444:0] +--------------------------------+-------------------------------+-----------------------------------------------------+
The KV transfer efficiency also seems suboptimal:
KV Transfer metrics: Num successful transfers=2, Avg xfer time (ms)=1.49, P90 xfer time (ms)=1.627, Avg post time (ms)=1.15, P90 post time (ms)=1.28, Avg MB per transfer=4.781, Throughput (MB/s)=3209.478.
This behavior is consistent across both nixl[cu13]==0.9.0 and nixl[cu13]==0.8.0.
To isolate the issue, I tested with ucx_perftest (both my self-built version and the system-provided one; the system binary also does not enable MNNVL by default in my environment). My test setup is as follows:
Server: UCX_PROTO_INFO=y UCX_NET_DEVICES=mlx5_2:1 UCX_TLS=ib,sm,cuda_copy,cuda_ipc,self UCX_CUDA_IPC_ENABLE_MNNVL=y ucx_perftest -t tag_bw -m cuda -n 1000 -s 4194304
Client: UCX_NET_DEVICES=mlx5_2:1 UCX_TLS=ib,sm,cuda_copy,cuda_ipc,self UCX_CUDA_IPC_ENABLE_MNNVL=y ucx_perftest -t tag_bw -m cuda -n 1000 -s 41943040 <serverip>
In this case, MNNVL is successfully used:
[1769084848.088832] [local-cluster-10:1776775:0] +---------------------------+----------------------------------------------------------+
[1769084848.088841] [local-cluster-10:1776775:0] | perftest inter-node cfg#1 | rendezvous data fetch(multi) into cuda/GPU0 from cuda |
[1769084848.088843] [local-cluster-10:1776775:0] +---------------------------+------------------------------------------+---------------+
[1769084848.088844] [local-cluster-10:1776775:0] | 0 | no data fetch | |
[1769084848.088848] [local-cluster-10:1776775:0] | 1..inf | zero-copy read from remote | cuda_ipc/cuda |
[1769084848.088850] [local-cluster-10:1776775:0] +---------------------------+------------------------------------------+---------------+
[1769084848.091072] [local-cluster-10:1776775:0] +---------------------------+--------------------------------------------------------------+
[1769084848.091082] [local-cluster-10:1776775:0] | perftest inter-node cfg#1 | tagged message by ucp_tag_send*(multi) from cuda/GPU0 |
[1769084848.091083] [local-cluster-10:1776775:0] +---------------------------+-------------------------------------------+------------------+
[1769084848.091085] [local-cluster-10:1776775:0] | 0 | eager short | rc_mlx5/mlx5_2:1 |
[1769084848.091088] [local-cluster-10:1776775:0] | 1..inf | (?) rendezvous zero-copy read from remote | cuda_ipc/cuda |
[1769084848.091090] [local-cluster-10:1776775:0] +---------------------------+-------------------------------------------+------------------+
And the performance is excellent:
+--------------+--------------+------------------------------+---------------------+-----------------------+
| | | overhead (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
| Stage | # iterations | 50.0%ile | average | overall | average | overall | average | overall |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
Final: 1000 55.456 55.555 55.555 720005.84 720005.84 18000 18000
I have reviewed many documents and references, and most of them indicate that setting UCX_CUDA_IPC_ENABLE_MNNVL=y should be sufficient. However, in my case, that does not appear to enable MNNVL for vLLM/NIXL.
I would like to understand what additional configuration or constraints are required to make vLLM/NIXL use the MNNVL path, and why it works with ucx_perftest but not in the vLLM + NIXL connector workflow.