-
Notifications
You must be signed in to change notification settings - Fork 525
Open
Labels
Description
Describe the bug
Compute sanitizer reports CUDA_ERROR_INVALID_CONTEXT (error 201) when calling MPI_Init. For all subsequent calls to CUDA API it reports CUDA_ERROR_INVALID_VALUE (error 1) .
Relevant compute-sanitizer output for MPI_Init.
========= COMPUTE-SANITIZER
========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuCtxSetFlags.
========= Saved host backtrace up to driver entry point at error
========= Host Frame: [0x9026] in libuct_cuda.so.0
========= Host Frame: uct_md_open [0x1c74a] in libuct.so.0
========= Host Frame: [0x1afc4] in libucp.so.0
========= Host Frame: [0x1d058] in libucp.so.0
========= Host Frame: ucp_init_version [0x27058] in libucp.so.0
========= Host Frame: mca_pml_ucx_open [0x28e25a] in libmpi.so.40
========= Host Frame: mca_base_framework_components_open [0x37cc1] in libopen-pal.so.80
========= Host Frame: [0x284417] in libmpi.so.40
========= Host Frame: mca_base_framework_open [0x38e90] in libopen-pal.so.80
========= Host Frame: [0x8cf8c] in libmpi.so.40
========= Host Frame: ompi_mpi_instance_init [0x8e3e5] in libmpi.so.40
========= Host Frame: ompi_mpi_init [0x8e5f2] in libmpi.so.40
========= Host Frame: MPI_Init [0xce48a] in libmpi.so.40
========= Host Frame: main in k-medoids.cpp:139 [0x145fd] in k-medoids
=========
Relevant compute-sanitizer output for other MPI functions.
========= Program hit CUDA_ERROR_INVALID_VALUE (error 1) due to "invalid argument" on CUDA API call to cuMemRetainAllocationHandle.
========= Saved host backtrace up to driver entry point at error
========= Host Frame: [0x2a19] in mca_accelerator_cuda.so
========= Host Frame: mca_coll_cuda_allreduce [0x145fea] in libmpi.so.40
========= Host Frame: ompi_comm_split_type [0x5b0f6] in libmpi.so.40
========= Host Frame: MPI_Comm_split_type [0xabf2a] in libmpi.so.40
========= Host Frame: KMedoidsCudaMPI<7, DBSignatureListMapped<7, float>, float>::KMedoidsCudaMPI(BinaryDistanceFunctorSQFD<7, float, float, 2000>&, float, unsigned long, unsigned long, unsigned long) in kmedoids_cuda_mpi.hpp:349 [0x263d8] in k-medoids
========= Host Frame: void run<7, float>(bpp::ProgramArguments const&) in k-medoids.cpp:83 [0x20050] in k-medoids
========= Host Frame: main in k-medoids.cpp:160 [0x146da] in k-medoids
=========
Steps to Reproduce
mpirun -n 1 compute-sanitizer ./app- app needs to call
MPI_InitOpenMPI implementation - release 1.19.0
- UCX configure flags:
# Configured with: --prefix=/usr --sysconfdir=/etc --with-cuda=/opt/cuda --with-rocm=/opt/rocm --with-verbs --with-rc --with-ud --with-dc --with-mlx5-dv --enable-mt
Setup and versions
- CachyOS 6.18.3-2 + x86_64
- For GPU related issues:
- GPU type - NVIDIA GeForce RTX 3050 Ti Mobile / Max-Q
- Cuda:
- CUDA 13.1.0
- Peer-direct is not loaded
Additional information (depending on the issue)
- OpenMPI v5.0.9
ucx_info -d
``` # # Memory domain: self # Component: self # register: unlimited, cost: 0 nsec # remote key: 0 bytes # rkey_ptr is supported # memory types: host (access,reg_nonblock,reg,cache) # # Transport: self # Device: memory # Type: loopback # System device: # # capabilities: # bandwidth: 0.00/ppn + 19360.00 MB/sec # latency: 0 nsec # overhead: 10 nsec # put_short: <= 4294967295 # put_bcopy: unlimited # get_bcopy: unlimited # am_short: <= 8K # am_bcopy: <= 8K # domain: cpu # atomic_add: 32, 64 bit # atomic_and: 32, 64 bit # atomic_or: 32, 64 bit # atomic_xor: 32, 64 bit # atomic_fadd: 32, 64 bit # atomic_fand: 32, 64 bit # atomic_for: 32, 64 bit # atomic_fxor: 32, 64 bit # atomic_swap: 32, 64 bit # atomic_cswap: 32, 64 bit # connection: to iface # device priority: 0 # device num paths: 1 # max eps: inf # device address: 0 bytes # iface address: 8 bytes # error handling: ep_check # # # Memory domain: tcp # Component: tcp # memory types: # # Transport: tcp # Device: lo # Type: network # System device: # # capabilities: # bandwidth: 11.91/ppn + 0.00 MB/sec # latency: 10960 nsec # overhead: 50000 nsec # put_zcopy: <= 18446744073709551590, up to 6 iov # put_opt_zcopy_align: <= 1 # put_align_mtu: <= 0 # am_short: <= 8K # am_bcopy: <= 8K # am_zcopy: <= 64K, up to 6 iov # am_opt_zcopy_align: <= 1 # am_align_mtu: <= 0 # am header: <= 8037 # connection: to ep, to iface # device priority: 1 # device num paths: 1 # max eps: 256 # device address: 18 bytes # iface address: 2 bytes # ep address: 10 bytes # error handling: peer failure, ep_check, keepalive # # Transport: tcp # Device: wlan0 # Type: network # System device: wlan0 (0) # # capabilities: # bandwidth: 11.32/ppn + 0.00 MB/sec # latency: 10960 nsec # overhead: 50000 nsec # put_zcopy: <= 18446744073709551590, up to 6 iov # put_opt_zcopy_align: <= 1 # put_align_mtu: <= 0 # am_short: <= 8K # am_bcopy: <= 8K # am_zcopy: <= 64K, up to 6 iov # am_opt_zcopy_align: <= 1 # am_align_mtu: <= 0 # am header: <= 8037 # connection: to ep, to iface # device priority: 0 # device num paths: 1 # max eps: 256 # device address: 6 bytes # iface address: 2 bytes # ep address: 10 bytes # error handling: peer failure, ep_check, keepalive # # # Connection manager: tcp # max_conn_priv: 2064 bytes # # Memory domain: sysv # Component: sysv # allocate: unlimited # remote key: 12 bytes # rkey_ptr is supported # memory types: host (access,alloc,cache) # # Transport: sysv # Device: memory # Type: intra-node # System device: # # capabilities: # bandwidth: 0.00/ppn + 15360.00 MB/sec # latency: 80 nsec # overhead: 10 nsec # put_short: <= 4294967295 # put_bcopy: unlimited # get_bcopy: unlimited # am_short: <= 100 # am_bcopy: <= 8256 # domain: cpu # atomic_add: 32, 64 bit # atomic_and: 32, 64 bit # atomic_or: 32, 64 bit # atomic_xor: 32, 64 bit # atomic_fadd: 32, 64 bit # atomic_fand: 32, 64 bit # atomic_for: 32, 64 bit # atomic_fxor: 32, 64 bit # atomic_swap: 32, 64 bit # atomic_cswap: 32, 64 bit # connection: to iface # device priority: 0 # device num paths: 1 # max eps: inf # device address: 8 bytes # iface address: 8 bytes # error handling: ep_check # # # Memory domain: posix # Component: posix # allocate: <= 7831040K # remote key: 24 bytes # rkey_ptr is supported # memory types: host (access,alloc,cache) # # Transport: posix # Device: memory # Type: intra-node # System device: # # capabilities: # bandwidth: 0.00/ppn + 15360.00 MB/sec # latency: 80 nsec # overhead: 10 nsec # put_short: <= 4294967295 # put_bcopy: unlimited # get_bcopy: unlimited # am_short: <= 100 # am_bcopy: <= 8256 # domain: cpu # atomic_add: 32, 64 bit # atomic_and: 32, 64 bit # atomic_or: 32, 64 bit # atomic_xor: 32, 64 bit # atomic_fadd: 32, 64 bit # atomic_fand: 32, 64 bit # atomic_for: 32, 64 bit # atomic_fxor: 32, 64 bit # atomic_swap: 32, 64 bit # atomic_cswap: 32, 64 bit # connection: to iface # device priority: 0 # device num paths: 1 # max eps: inf # device address: 8 bytes # iface address: 8 bytes # error handling: ep_check # # # Memory domain: cuda_ipc # Component: cuda_ipc # register: unlimited, cost: 0 nsec # remote key: 192 bytes # memory invalidation is supported # memory types: cuda (access,reg,cache) # # Transport: cuda_ipc # Device: cuda # Type: intra-node # System device: # # capabilities: # bandwidth: 300000.00/ppn + 0.00 MB/sec # latency: 1000 nsec # overhead: 7000 nsec # put_zcopy: unlimited, up to 1 iov # put_opt_zcopy_align: <= 1 # put_align_mtu: <= 1 # get_zcopy: <= 0, up to 1 iov # get_opt_zcopy_align: <= 1 # get_align_mtu: <= 1 # connection: to iface # device priority: 0 # device num paths: 1 # max eps: inf # device address: 8 bytes # iface address: 4 bytes # error handling: peer failure # # # Memory domain: cuda_cpy # Component: cuda_cpy # allocate: unlimited # register: unlimited, cost: 0 nsec # memory types: host (access,reg), cuda (access,alloc,reg,detect), cuda-managed (access,alloc,reg,cache,detect) # # Transport: cuda_copy # Device: cuda # Type: accelerator # System device: # # capabilities: # bandwidth: 10000.00/ppn + 0.00 MB/sec # latency: 8000 nsec # overhead: 0 nsec # put_short: <= 4294967295 # put_zcopy: unlimited, up to 1 iov # put_opt_zcopy_align: <= 1 # put_align_mtu: <= 1 # get_short: <= 4294967295 # get_zcopy: unlimited, up to 1 iov # get_opt_zcopy_align: <= 1 # get_align_mtu: <= 1 # connection: to iface # device priority: 0 # device num paths: 1 # max eps: inf # device address: 0 bytes # iface address: 8 bytes # error handling: none # # # Memory domain: cma # Component: cma # memory types: # # Transport: cma # Device: memory # Type: intra-node # System device: # # capabilities: # bandwidth: 0.00/ppn + 11145.00 MB/sec # latency: 80 nsec # overhead: 2000 nsec # put_zcopy: unlimited, up to 16 iov # put_opt_zcopy_align: <= 1 # put_align_mtu: <= 1 # get_zcopy: unlimited, up to 16 iov # get_opt_zcopy_align: <= 1 # get_align_mtu: <= 1 # connection: to iface # device priority: 0 # device num paths: 1 # max eps: inf # device address: 8 bytes # iface address: 4 bytes # error handling: peer failure, ep_check ```I have found similar sounding issue #9493
Reactions are currently unavailable