Skip to content

PMIX_ERR_UNPACK_INADEQUATE_SPACE in file runtime/data_type_support/prte_dt_unpacking_fns.c at line 222 #2332

@ashterenli

Description

@ashterenli
                Open MPI: 6.1.0a1
  Open MPI repo revision: v2.x-dev-12210-g448fa674c9

and

+ ucx_info -v
# Library version: 1.20.0
# Library path: /data/ashterenli/spe-tst5/install/u_intel/2025.1.0.666/ucx-1.20.0/lib/libucs.so.0
# API headers version: 1.20.0
# Git branch '', revision 4b7a6ca
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check CC=cc CFLAGS=-Wno-error=unused-command-line-argument -Wno-error=unused-but-set-variable -Wno-error=tautological-constant-compare -qopenmp CXX=CC CXXFLAGS=-Wno-error --without-cuda --without-gdrcopy --without-fuse3 --without-go --without-java --without-rocm --disable-doxygen-doc --with-dm --with-dc --with-rc --with-ud --with-mlx5-dv --with-ib-hw-tm --with-mcpu=native --with-march=native --enable-builtin-memcpy --enable-cma --enable-logging --enable-mt --enable-optimizations --prefix=/data/ashterenli/spe-tst5/install/u_intel/2025.1.0.666/ucx-1.20.0

submitting a job to 11 nodes with 176 cores each (azure hbv4 instances) with:

mpiexec --display-map --bind-to core \
        -n 704 --map-by ppr:22:numa:pe=2 /usr/bin/env OMP_NUM_THREADS=2 "$exe" \
      : -n  24 --map-by ppr:1:l3cache    /usr/bin/env OMP_NUM_THREADS=4 "$exe" \
      : -n 176 --map-by ppr:22:numa:pe=2 /usr/bin/env OMP_NUM_THREADS=2 "$exe"

and getting

PMIX ERROR: PMIX_ERR_UNPACK_INADEQUATE_SPACE in file runtime/data_type_support/prte_dt_unpacking_fns.c at line 222
PMIX ERROR: PMIX_ERR_UNPACK_INADEQUATE_SPACE in file base/odls_base_default_fns.c at line 558

https://github.com/openpmix/prrte/blob/master/src/runtime/data_type_support/prte_dt_unpacking_fns.c#L222

https://github.com/openpmix/prrte/blob/master/src/mca/odls/base/odls_base_default_fns.c#L558

Is there something wrong with my mpiexec line?

Also, --display-map shows some weird results, I hope just printing glitches.
For the first node in the allocation all is correct:

Data for node: tst5-hbv4-1      Num slots: 176  Max slots: 0    Num procs: 88
        Process jobid: prterun-tst5-hbv4-1-3896068@1 App: 0 Process rank: 0 Bound: package[0][core:L0-1]
        Process jobid: prterun-tst5-hbv4-1-3896068@1 App: 0 Process rank: 1 Bound: package[0][core:L2-3]
        Process jobid: prterun-tst5-hbv4-1-3896068@1 App: 0 Process rank: 2 Bound: package[0][core:L4-5]
        Process jobid: prterun-tst5-hbv4-1-3896068@1 App: 0 Process rank: 3 Bound: package[0][core:L6-7]
        Process jobid: prterun-tst5-hbv4-1-3896068@1 App: 0 Process rank: 4 Bound: package[0][core:L8-9]
        Process jobid: prterun-tst5-hbv4-1-3896068@1 App: 0 Process rank: 5 Bound: package[0][core:L10-11]
...

but for the second, and all other nodes, the core numbers seem corrupted:

Data for node: tst5-hbv4-2      Num slots: 176  Max slots: 0    Num procs: 88
        Process jobid: prterun-tst5-hbv4-1-3896068@1 App: 0 Process rank: 88 Bound: package[0][core:L0-1-175]
        Process jobid: prterun-tst5-hbv4-1-3896068@1 App: 0 Process rank: 89 Bound: package[0][core:L2-3-175]
        Process jobid: prterun-tst5-hbv4-1-3896068@1 App: 0 Process rank: 90 Bound: package[0][core:L4-5-175]
        Process jobid: prterun-tst5-hbv4-1-3896068@1 App: 0 Process rank: 91 Bound: package[0][core:L6-7-175]
        Process jobid: prterun-tst5-hbv4-1-3896068@1 App: 0 Process rank: 92 Bound: package[0][core:L8-9-175]
        Process jobid: prterun-tst5-hbv4-1-3896068@1 App: 0 Process rank: 93 Bound: package[0][core:L10-1175]
...

Thank you

Anton

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions