-
Notifications
You must be signed in to change notification settings - Fork 78
Open
Description
Open MPI: 6.1.0a1
Open MPI repo revision: v2.x-dev-12210-g448fa674c9
and
+ ucx_info -v
# Library version: 1.20.0
# Library path: /data/ashterenli/spe-tst5/install/u_intel/2025.1.0.666/ucx-1.20.0/lib/libucs.so.0
# API headers version: 1.20.0
# Git branch '', revision 4b7a6ca
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check CC=cc CFLAGS=-Wno-error=unused-command-line-argument -Wno-error=unused-but-set-variable -Wno-error=tautological-constant-compare -qopenmp CXX=CC CXXFLAGS=-Wno-error --without-cuda --without-gdrcopy --without-fuse3 --without-go --without-java --without-rocm --disable-doxygen-doc --with-dm --with-dc --with-rc --with-ud --with-mlx5-dv --with-ib-hw-tm --with-mcpu=native --with-march=native --enable-builtin-memcpy --enable-cma --enable-logging --enable-mt --enable-optimizations --prefix=/data/ashterenli/spe-tst5/install/u_intel/2025.1.0.666/ucx-1.20.0
submitting a job to 11 nodes with 176 cores each (azure hbv4 instances) with:
mpiexec --display-map --bind-to core \
-n 704 --map-by ppr:22:numa:pe=2 /usr/bin/env OMP_NUM_THREADS=2 "$exe" \
: -n 24 --map-by ppr:1:l3cache /usr/bin/env OMP_NUM_THREADS=4 "$exe" \
: -n 176 --map-by ppr:22:numa:pe=2 /usr/bin/env OMP_NUM_THREADS=2 "$exe"
and getting
PMIX ERROR: PMIX_ERR_UNPACK_INADEQUATE_SPACE in file runtime/data_type_support/prte_dt_unpacking_fns.c at line 222
PMIX ERROR: PMIX_ERR_UNPACK_INADEQUATE_SPACE in file base/odls_base_default_fns.c at line 558
https://github.com/openpmix/prrte/blob/master/src/mca/odls/base/odls_base_default_fns.c#L558
Is there something wrong with my mpiexec line?
Also, --display-map shows some weird results, I hope just printing glitches.
For the first node in the allocation all is correct:
Data for node: tst5-hbv4-1 Num slots: 176 Max slots: 0 Num procs: 88
Process jobid: prterun-tst5-hbv4-1-3896068@1 App: 0 Process rank: 0 Bound: package[0][core:L0-1]
Process jobid: prterun-tst5-hbv4-1-3896068@1 App: 0 Process rank: 1 Bound: package[0][core:L2-3]
Process jobid: prterun-tst5-hbv4-1-3896068@1 App: 0 Process rank: 2 Bound: package[0][core:L4-5]
Process jobid: prterun-tst5-hbv4-1-3896068@1 App: 0 Process rank: 3 Bound: package[0][core:L6-7]
Process jobid: prterun-tst5-hbv4-1-3896068@1 App: 0 Process rank: 4 Bound: package[0][core:L8-9]
Process jobid: prterun-tst5-hbv4-1-3896068@1 App: 0 Process rank: 5 Bound: package[0][core:L10-11]
...
but for the second, and all other nodes, the core numbers seem corrupted:
Data for node: tst5-hbv4-2 Num slots: 176 Max slots: 0 Num procs: 88
Process jobid: prterun-tst5-hbv4-1-3896068@1 App: 0 Process rank: 88 Bound: package[0][core:L0-1-175]
Process jobid: prterun-tst5-hbv4-1-3896068@1 App: 0 Process rank: 89 Bound: package[0][core:L2-3-175]
Process jobid: prterun-tst5-hbv4-1-3896068@1 App: 0 Process rank: 90 Bound: package[0][core:L4-5-175]
Process jobid: prterun-tst5-hbv4-1-3896068@1 App: 0 Process rank: 91 Bound: package[0][core:L6-7-175]
Process jobid: prterun-tst5-hbv4-1-3896068@1 App: 0 Process rank: 92 Bound: package[0][core:L8-9-175]
Process jobid: prterun-tst5-hbv4-1-3896068@1 App: 0 Process rank: 93 Bound: package[0][core:L10-1175]
...
Thank you
Anton
Metadata
Metadata
Assignees
Labels
No labels