The testing env:
2 pods for nixl-client & nixl-server, using the deployment manifests: https://github.com/llm-d/llm-d-pd-utils/blob/main/deployment/nixl_client.yaml & https://github.com/llm-d/llm-d-pd-utils/blob/main/deployment/nixl_server.yaml
2 Pods are running different GPU node.
Run the following command in nixl-server pod: python3 /workspace/benchmark.py --role peer --operation READ --host 0.0.0.0 --device cpu
Run the following command in nixl-client pod : python3 benchmark.py --role creator --operation READ --host 10.216.188.96 --device cpu
The the errors logs from nixl-server came out as below. Do you have any idea about this ? @aslom @clubanderson
-
-
2025-06-23 10:25:40,668 - AgentPair-0 - INFO - Peer added creator: creator-4wt4sdge-0
-
2025-06-23 10:25:40,668 - AgentPair-0 - INFO - Initializing transfer metadata for peer-xnvv698w-0 with operation READ
-
2025-06-23 10:25:40,672 - AgentPair-0 - INFO - peer-xnvv698w-0 received START message
-
2025-06-23 10:25:40,672 - AgentPair-0 - INFO - peer-xnvv698w-0 sending xfer descs to creator-4wt4sdge-0
-
2025-06-23 10:25:40,672 - AgentPair-0 - INFO - Time to initialize transfer metadata: 0.00 seconds
-
2025-06-23 10:25:40,672 - AgentPair-0 - INFO - Starting transfer for peer-xnvv698w-0
-
[1750674340.673323] [nixl-server:82 :1] topo.c:895 UCX DEBUG /sys/class/net/eth0: sysfs path undetected
-
[1750674340.673329] [nixl-server:82 :1] topo.c:848 UCX DEBUG eth0: pci bandwidth undetected, using maximal value
-
[1750674340.673389] [nixl-server:82 :1] topo.c:895 UCX DEBUG /sys/class/net/eth0: sysfs path undetected
-
[1750674340.673393] [nixl-server:82 :1] topo.c:848 UCX DEBUG eth0: pci bandwidth undetected, using maximal value
-
[1750674340.673412] [nixl-server:82 :1] mpool.c:281 UCX DEBUG mpool uct_tcp_iface_tx_buf_mp: allocated chunk 0x7fbfe8035590 of 66136 bytes with 8 elements
-
[1750674340.673556] [nixl-server:82 :1] mpool.c:281 UCX DEBUG mpool ucp_requests: allocated chunk 0x7fbfe8000c54 of 41044 bytes with 128 elements
-
2025-06-23 10:25:41,796 - AgentPair-0 - INFO - Transfer finished in peer
-
2025-06-23 10:25:41,796 - AgentPair-0 - INFO - Round 0: Transfer speed: 2.99 GB/s
-
2025-06-23 10:25:42,220 - AgentPair-0 - INFO - Passed correctness check!
-
2025-06-23 10:25:42,220 - AgentPair-0 - INFO - Agent pair peer-xnvv698w-0 completed
-
[nixl-server:82 :1:110] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fc01ecb4770)
-
==== backtrace (tid: 110) ====
-
0 /usr/lib/libucs.so.0(ucs_handle_error+0x304) [0x7fc1e55bf074]
-
1 /usr/lib/libucs.so.0(+0x3c274) [0x7fc1e55bf274]
-
2 /usr/lib/libucs.so.0(+0x3c4b8) [0x7fc1e55bf4b8]
-
3 /lib/x86_64-linux-gnu/libc.so.6(+0x45330) [0x7fc202c45330]
-
4 /usr/lib/libucp.so.0(+0x9993c) [0x7fc1085e793c]
-
5 /usr/lib/libuct.so.0(uct_tcp_ep_am_bcopy+0x8d) [0x7fc1fa8dd57d]
-
6 /usr/lib/libucp.so.0(+0x9937e) [0x7fc1085e737e]
-
7 /usr/lib/libuct.so.0(uct_tcp_ep_pending_queue_dispatch+0x41) [0x7fc1fa8d9ed1]
-
8 /usr/lib/libuct.so.0(+0x28d88) [0x7fc1fa8dad88]
-
9 /usr/lib/libuct.so.0(+0x2d8b3) [0x7fc1fa8df8b3]
-
10 /usr/lib/libucs.so.0(ucs_event_set_wait+0x139) [0x7fc1e55cc5a9]
-
11 /usr/lib/libuct.so.0(uct_tcp_iface_progress+0x93) [0x7fc1fa8de823]
-
12 /usr/lib/libucp.so.0(ucp_worker_progress+0x7a) [0x7fc1085b6b7a]
-
13 /usr/local/nixl/lib/x86_64-linux-gnu/libucx_utils.so(_ZN13nixlUcxWorker8progressEv+0x20) [0x7fc201d657bc]
-
14 /opt/nixl/build/src/plugins/ucx/libplugin_UCX.so(_ZN13nixlUcxEngine12progressFuncEv+0x40) [0x7fc1087625be]
-
15 /opt/nixl/build/src/plugins/ucx/libplugin_UCX.so(ZSt13__invoke_implIvM13nixlUcxEngineFvvEPS0_JEET_St21__invoke_memfun_derefOT0_OT1_DpOT2+0x6a) [0x7fc10879d6e2]
-
16 /opt/nixl/build/src/plugins/ucx/libplugin_UCX.so(ZSt8__invokeIM13nixlUcxEngineFvvEJPS0_EENSt15__invoke_resultIT_JDpT0_EE4typeEOS5_DpOS6+0x3b) [0x7fc10879d635]
-
17 /opt/nixl/build/src/plugins/ucx/libplugin_UCX.so(_ZNSt6thread8_InvokerISt5tupleIJM13nixlUcxEngineFvvEPS2_EEE9_M_invokeIJLm0ELm1EEEEvSt12_Index_tupleIJXspT_EEE+0x47) [0x7fc10879d595]
-
18 /opt/nixl/build/src/plugins/ucx/libplugin_UCX.so(_ZNSt6thread8_InvokerISt5tupleIJM13nixlUcxEngineFvvEPS2_EEEclEv+0x1c) [0x7fc10879d4da]
-
19 /opt/nixl/build/src/plugins/ucx/libplugin_UCX.so(_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJM13nixlUcxEngineFvvEPS3_EEEEE6_M_runEv+0x20) [0x7fc10879d49c]
-
20 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7fc19c596db4]
-
21 /lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7fc202c9caa4]
-
22 /lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7fc202d29c3c]
-
=================================
-
Segmentation fault (core dumped)
-
-
The testing env:
2 pods for nixl-client & nixl-server, using the deployment manifests: https://github.com/llm-d/llm-d-pd-utils/blob/main/deployment/nixl_client.yaml & https://github.com/llm-d/llm-d-pd-utils/blob/main/deployment/nixl_server.yaml
2 Pods are running different GPU node.
Run the following command in nixl-server pod: python3 /workspace/benchmark.py --role peer --operation READ --host 0.0.0.0 --device cpu
Run the following command in nixl-client pod : python3 benchmark.py --role creator --operation READ --host 10.216.188.96 --device cpu
The the errors logs from nixl-server came out as below. Do you have any idea about this ? @aslom @clubanderson