-
Notifications
You must be signed in to change notification settings - Fork 136
Open
Description
The same setup works fine on A100x8, but on H100x8, saw below errors.
Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid: 30) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x000000000001677b uct_iface_mp_chunk_alloc_inner() /build-result/src/hpcx-v2.14-gcc-inbox-ubuntu22.04-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/uct/base/uct_mem.c:467
2 0x000000000001677b uct_iface_mp_chunk_alloc() /build-result/src/hpcx-v2.14-gcc-inbox-ubuntu22.04-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/uct/base/uct_mem.c:443
3 0x0000000000052c4b ucs_mpool_grow() /build-result/src/hpcx-v2.14-gcc-inbox-ubuntu22.04-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/ucs/datastruct/mpool.c:266
4 0x0000000000052ec9 ucs_mpool_get_grow() /build-result/src/hpcx-v2.14-gcc-inbox-ubuntu22.04-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/ucs/datastruct/mpool.c:316
5 0x000000000001b418 uct_mm_iface_t_init() /build-result/src/hpcx-v2.14-gcc-inbox-ubuntu22.04-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/uct/sm/mm/base/mm_iface.c:821
Metadata
Metadata
Assignees
Labels
No labels