Skip to content

nixl server - Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fc01ecb4770) #6

@osswangxining

Description

@osswangxining

The testing env:
2 pods for nixl-client & nixl-server, using the deployment manifests: https://github.com/llm-d/llm-d-pd-utils/blob/main/deployment/nixl_client.yaml & https://github.com/llm-d/llm-d-pd-utils/blob/main/deployment/nixl_server.yaml

2 Pods are running different GPU node.
Run the following command in nixl-server pod: python3 /workspace/benchmark.py --role peer --operation READ --host 0.0.0.0 --device cpu
Run the following command in nixl-client pod : python3 benchmark.py --role creator --operation READ --host 10.216.188.96 --device cpu

The the errors logs from nixl-server came out as below. Do you have any idea about this ? @aslom @clubanderson

  • 2025-06-23 10:25:40,668 - AgentPair-0 - INFO - Peer added creator: creator-4wt4sdge-0

  • 2025-06-23 10:25:40,668 - AgentPair-0 - INFO - Initializing transfer metadata for peer-xnvv698w-0 with operation READ

  • 2025-06-23 10:25:40,672 - AgentPair-0 - INFO - peer-xnvv698w-0 received START message

  • 2025-06-23 10:25:40,672 - AgentPair-0 - INFO - peer-xnvv698w-0 sending xfer descs to creator-4wt4sdge-0

  • 2025-06-23 10:25:40,672 - AgentPair-0 - INFO - Time to initialize transfer metadata: 0.00 seconds

  • 2025-06-23 10:25:40,672 - AgentPair-0 - INFO - Starting transfer for peer-xnvv698w-0

  • [1750674340.673323] [nixl-server:82 :1] topo.c:895 UCX DEBUG /sys/class/net/eth0: sysfs path undetected

  • [1750674340.673329] [nixl-server:82 :1] topo.c:848 UCX DEBUG eth0: pci bandwidth undetected, using maximal value

  • [1750674340.673389] [nixl-server:82 :1] topo.c:895 UCX DEBUG /sys/class/net/eth0: sysfs path undetected

  • [1750674340.673393] [nixl-server:82 :1] topo.c:848 UCX DEBUG eth0: pci bandwidth undetected, using maximal value

  • [1750674340.673412] [nixl-server:82 :1] mpool.c:281 UCX DEBUG mpool uct_tcp_iface_tx_buf_mp: allocated chunk 0x7fbfe8035590 of 66136 bytes with 8 elements

  • [1750674340.673556] [nixl-server:82 :1] mpool.c:281 UCX DEBUG mpool ucp_requests: allocated chunk 0x7fbfe8000c54 of 41044 bytes with 128 elements

  • 2025-06-23 10:25:41,796 - AgentPair-0 - INFO - Transfer finished in peer

  • 2025-06-23 10:25:41,796 - AgentPair-0 - INFO - Round 0: Transfer speed: 2.99 GB/s

  • 2025-06-23 10:25:42,220 - AgentPair-0 - INFO - Passed correctness check!

  • 2025-06-23 10:25:42,220 - AgentPair-0 - INFO - Agent pair peer-xnvv698w-0 completed

  • [nixl-server:82 :1:110] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fc01ecb4770)

  • ==== backtrace (tid: 110) ====

  • 0 /usr/lib/libucs.so.0(ucs_handle_error+0x304) [0x7fc1e55bf074]

  • 1 /usr/lib/libucs.so.0(+0x3c274) [0x7fc1e55bf274]

  • 2 /usr/lib/libucs.so.0(+0x3c4b8) [0x7fc1e55bf4b8]

  • 3 /lib/x86_64-linux-gnu/libc.so.6(+0x45330) [0x7fc202c45330]

  • 4 /usr/lib/libucp.so.0(+0x9993c) [0x7fc1085e793c]

  • 5 /usr/lib/libuct.so.0(uct_tcp_ep_am_bcopy+0x8d) [0x7fc1fa8dd57d]

  • 6 /usr/lib/libucp.so.0(+0x9937e) [0x7fc1085e737e]

  • 7 /usr/lib/libuct.so.0(uct_tcp_ep_pending_queue_dispatch+0x41) [0x7fc1fa8d9ed1]

  • 8 /usr/lib/libuct.so.0(+0x28d88) [0x7fc1fa8dad88]

  • 9 /usr/lib/libuct.so.0(+0x2d8b3) [0x7fc1fa8df8b3]

  • 10 /usr/lib/libucs.so.0(ucs_event_set_wait+0x139) [0x7fc1e55cc5a9]

  • 11 /usr/lib/libuct.so.0(uct_tcp_iface_progress+0x93) [0x7fc1fa8de823]

  • 12 /usr/lib/libucp.so.0(ucp_worker_progress+0x7a) [0x7fc1085b6b7a]

  • 13 /usr/local/nixl/lib/x86_64-linux-gnu/libucx_utils.so(_ZN13nixlUcxWorker8progressEv+0x20) [0x7fc201d657bc]

  • 14 /opt/nixl/build/src/plugins/ucx/libplugin_UCX.so(_ZN13nixlUcxEngine12progressFuncEv+0x40) [0x7fc1087625be]

  • 15 /opt/nixl/build/src/plugins/ucx/libplugin_UCX.so(ZSt13__invoke_implIvM13nixlUcxEngineFvvEPS0_JEET_St21__invoke_memfun_derefOT0_OT1_DpOT2+0x6a) [0x7fc10879d6e2]

  • 16 /opt/nixl/build/src/plugins/ucx/libplugin_UCX.so(ZSt8__invokeIM13nixlUcxEngineFvvEJPS0_EENSt15__invoke_resultIT_JDpT0_EE4typeEOS5_DpOS6+0x3b) [0x7fc10879d635]

  • 17 /opt/nixl/build/src/plugins/ucx/libplugin_UCX.so(_ZNSt6thread8_InvokerISt5tupleIJM13nixlUcxEngineFvvEPS2_EEE9_M_invokeIJLm0ELm1EEEEvSt12_Index_tupleIJXspT_EEE+0x47) [0x7fc10879d595]

  • 18 /opt/nixl/build/src/plugins/ucx/libplugin_UCX.so(_ZNSt6thread8_InvokerISt5tupleIJM13nixlUcxEngineFvvEPS2_EEEclEv+0x1c) [0x7fc10879d4da]

  • 19 /opt/nixl/build/src/plugins/ucx/libplugin_UCX.so(_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJM13nixlUcxEngineFvvEPS3_EEEEE6_M_runEv+0x20) [0x7fc10879d49c]

  • 20 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7fc19c596db4]

  • 21 /lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7fc202c9caa4]

  • 22 /lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7fc202d29c3c]

  • =================================

  • Segmentation fault (core dumped)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions