Skip to content

Failed to run on H100 GPU with tensor para=8 #166

@sfc-gh-zhwang

Description

@sfc-gh-zhwang

The same setup works fine on A100x8, but on H100x8, saw below errors.

Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid:     30) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x000000000001677b uct_iface_mp_chunk_alloc_inner()  /build-result/src/hpcx-v2.14-gcc-inbox-ubuntu22.04-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/uct/base/uct_mem.c:467
 2 0x000000000001677b uct_iface_mp_chunk_alloc()  /build-result/src/hpcx-v2.14-gcc-inbox-ubuntu22.04-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/uct/base/uct_mem.c:443
 3 0x0000000000052c4b ucs_mpool_grow()  /build-result/src/hpcx-v2.14-gcc-inbox-ubuntu22.04-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/ucs/datastruct/mpool.c:266
 4 0x0000000000052ec9 ucs_mpool_get_grow()  /build-result/src/hpcx-v2.14-gcc-inbox-ubuntu22.04-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/ucs/datastruct/mpool.c:316
 5 0x000000000001b418 uct_mm_iface_t_init()  /build-result/src/hpcx-v2.14-gcc-inbox-ubuntu22.04-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/uct/sm/mm/base/mm_iface.c:821

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions