nixl server -  Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fc01ecb4770)

The testing env:
2 pods for nixl-client & nixl-server, using the deployment manifests: https://github.com/llm-d/llm-d-pd-utils/blob/main/deployment/nixl_client.yaml & https://github.com/llm-d/llm-d-pd-utils/blob/main/deployment/nixl_server.yaml

2 Pods are running different GPU node.
Run the following command in nixl-server pod: python3 /workspace/benchmark.py --role peer --operation READ --host 0.0.0.0 --device cpu
Run the following command in nixl-client pod : python3 benchmark.py --role creator --operation READ --host 10.216.188.96  --device cpu

The the errors logs from nixl-server came out as below.  Do you have any  idea about this ? @aslom @clubanderson 

 

- > 
- > 2025-06-23 10:25:40,668 - AgentPair-0 - INFO - Peer added creator: creator-4wt4sdge-0
- > 2025-06-23 10:25:40,668 - AgentPair-0 - INFO - Initializing transfer metadata for peer-xnvv698w-0 with operation READ
- > 2025-06-23 10:25:40,672 - AgentPair-0 - INFO - peer-xnvv698w-0 received START message
- > 2025-06-23 10:25:40,672 - AgentPair-0 - INFO - peer-xnvv698w-0 sending xfer descs to creator-4wt4sdge-0
- > 2025-06-23 10:25:40,672 - AgentPair-0 - INFO - Time to initialize transfer metadata: 0.00 seconds
- > 2025-06-23 10:25:40,672 - AgentPair-0 - INFO - Starting transfer for peer-xnvv698w-0
- > [1750674340.673323] [nixl-server:82   :1]            topo.c:895  UCX  DEBUG   /sys/class/net/eth0: sysfs path undetected
- > [1750674340.673329] [nixl-server:82   :1]            topo.c:848  UCX  DEBUG   eth0: pci bandwidth undetected, using maximal value
- > [1750674340.673389] [nixl-server:82   :1]            topo.c:895  UCX  DEBUG   /sys/class/net/eth0: sysfs path undetected
- > [1750674340.673393] [nixl-server:82   :1]            topo.c:848  UCX  DEBUG   eth0: pci bandwidth undetected, using maximal value
- > [1750674340.673412] [nixl-server:82   :1]           mpool.c:281  UCX  DEBUG mpool uct_tcp_iface_tx_buf_mp: allocated chunk 0x7fbfe8035590 of 66136 bytes with 8 elements
- > [1750674340.673556] [nixl-server:82   :1]           mpool.c:281  UCX  DEBUG mpool ucp_requests: allocated chunk 0x7fbfe8000c54 of 41044 bytes with 128 elements
- > 2025-06-23 10:25:41,796 - AgentPair-0 - INFO - Transfer finished in peer
- > 2025-06-23 10:25:41,796 - AgentPair-0 - INFO - Round 0: Transfer speed: 2.99 GB/s
- > 2025-06-23 10:25:42,220 - AgentPair-0 - INFO - Passed correctness check!
- > 2025-06-23 10:25:42,220 - AgentPair-0 - INFO - Agent pair peer-xnvv698w-0 completed
- > [nixl-server:82   :1:110] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fc01ecb4770)
- > ==== backtrace (tid:    110) ====
- >  0  /usr/lib/libucs.so.0(ucs_handle_error+0x304) [0x7fc1e55bf074]
- >  1  /usr/lib/libucs.so.0(+0x3c274) [0x7fc1e55bf274]
- >  2  /usr/lib/libucs.so.0(+0x3c4b8) [0x7fc1e55bf4b8]
- >  3  /lib/x86_64-linux-gnu/libc.so.6(+0x45330) [0x7fc202c45330]
- >  4  /usr/lib/libucp.so.0(+0x9993c) [0x7fc1085e793c]
- >  5  /usr/lib/libuct.so.0(uct_tcp_ep_am_bcopy+0x8d) [0x7fc1fa8dd57d]
- >  6  /usr/lib/libucp.so.0(+0x9937e) [0x7fc1085e737e]
- >  7  /usr/lib/libuct.so.0(uct_tcp_ep_pending_queue_dispatch+0x41) [0x7fc1fa8d9ed1]
- >  8  /usr/lib/libuct.so.0(+0x28d88) [0x7fc1fa8dad88]
- >  9  /usr/lib/libuct.so.0(+0x2d8b3) [0x7fc1fa8df8b3]
- > 10  /usr/lib/libucs.so.0(ucs_event_set_wait+0x139) [0x7fc1e55cc5a9]
- > 11  /usr/lib/libuct.so.0(uct_tcp_iface_progress+0x93) [0x7fc1fa8de823]
- > 12  /usr/lib/libucp.so.0(ucp_worker_progress+0x7a) [0x7fc1085b6b7a]
- > 13  /usr/local/nixl/lib/x86_64-linux-gnu/libucx_utils.so(_ZN13nixlUcxWorker8progressEv+0x20) [0x7fc201d657bc]
- > 14  /opt/nixl/build/src/plugins/ucx/libplugin_UCX.so(_ZN13nixlUcxEngine12progressFuncEv+0x40) [0x7fc1087625be]
- > 15  /opt/nixl/build/src/plugins/ucx/libplugin_UCX.so(_ZSt13__invoke_implIvM13nixlUcxEngineFvvEPS0_JEET_St21__invoke_memfun_derefOT0_OT1_DpOT2_+0x6a) [0x7fc10879d6e2]
- > 16  /opt/nixl/build/src/plugins/ucx/libplugin_UCX.so(_ZSt8__invokeIM13nixlUcxEngineFvvEJPS0_EENSt15__invoke_resultIT_JDpT0_EE4typeEOS5_DpOS6_+0x3b) [0x7fc10879d635]
- > 17  /opt/nixl/build/src/plugins/ucx/libplugin_UCX.so(_ZNSt6thread8_InvokerISt5tupleIJM13nixlUcxEngineFvvEPS2_EEE9_M_invokeIJLm0ELm1EEEEvSt12_Index_tupleIJXspT_EEE+0x47) [0x7fc10879d595]
- > 18  /opt/nixl/build/src/plugins/ucx/libplugin_UCX.so(_ZNSt6thread8_InvokerISt5tupleIJM13nixlUcxEngineFvvEPS2_EEEclEv+0x1c) [0x7fc10879d4da]
- > 19  /opt/nixl/build/src/plugins/ucx/libplugin_UCX.so(_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJM13nixlUcxEngineFvvEPS3_EEEEE6_M_runEv+0x20) [0x7fc10879d49c]
- > 20  /lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7fc19c596db4]
- > 21  /lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7fc202c9caa4]
- > 22  /lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7fc202d29c3c]
- > =================================
- > Segmentation fault (core dumped)
- > 
- > 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nixl server - Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fc01ecb4770) #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

nixl server - Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fc01ecb4770) #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions