Skip to content

AZP: UCXX integration - tests + builds#11473

Draft
Alexey-Rivkin wants to merge 48 commits into
openucx:masterfrom
Alexey-Rivkin:ucxx-azure-tests
Draft

AZP: UCXX integration - tests + builds#11473
Alexey-Rivkin wants to merge 48 commits into
openucx:masterfrom
Alexey-Rivkin:ucxx-azure-tests

Conversation

@Alexey-Rivkin
Copy link
Copy Markdown
Contributor

@Alexey-Rivkin Alexey-Rivkin commented May 20, 2026

What?

Add UCXX_tests and UCXX_build stages to the UCX PR pipeline. Runs UCXX C++ + Python tests and builds the conda C++/Python + libucxx/ucxx/distributed-ucxx wheels from rapidsai/ucxx against every UCX PR.

Why?

Migrate UCXX CI from RAPIDS GitHub Actions onto UCX's Azure DevOps pipeline (replaces upstream's pr.yaml).

How?

Single shared runners in buildlib/tools/:

  • test_ucxx.sh <build|test_cpp|test_python> for UCXX_tests stage
  • build_ucxx.sh <conda_cpp|conda_python|wheel_libucxx|wheel_ucxx|wheel_distributed_ucxx> for UCXX_build stage

Container resources ucxx_rapidsai_ci_conda (+_gpu variant with MPS socket bind-mount) and ucxx_rapidsai_ci_wheel (buildlib/dockers/rapidsai-ci-*.Dockerfile).

Job graphs mirror upstream artifact flow: conda-python-build depends on conda-cpp-build; wheel-ucxx depends on wheel-libucxx. Slice matrices match upstream (cuda × arch × py).

Shared shims for rapids-download-*-from-github, no-op rapids-configure-sccache, and the missing <unistd.h> patch live in the runner scripts. Wheel phases enable gcc-toolset-14 (matches upstream).

Draft until non-UCXX stages and Static_check gating restored.

@Alexey-Rivkin Alexey-Rivkin changed the title AZP: add UCXX_tests stage to PR pipeline (Phase 1 plumbing) AZP: add UCXX_tests stage to PR pipeline (Phase #1) May 20, 2026
@Alexey-Rivkin Alexey-Rivkin force-pushed the ucxx-azure-tests branch 17 times, most recently from 05531f8 to 1def0bd Compare May 24, 2026 15:24
@Alexey-Rivkin Alexey-Rivkin force-pushed the ucxx-azure-tests branch 3 times, most recently from 1a79d8a to dec2673 Compare May 24, 2026 19:25
@Alexey-Rivkin Alexey-Rivkin changed the title AZP: add UCXX_tests stage to PR pipeline (Phase #1) AZP: UCXX integration - tests + builds May 24, 2026
@Alexey-Rivkin Alexey-Rivkin force-pushed the ucxx-azure-tests branch 7 times, most recently from 3b30b46 to e78b89e Compare May 25, 2026 03:24
Run A: capture gdb stderr alongside stdout in ucxx ci/timeout_with_stack.py
so we can see why the C/Python backtraces come back empty. Currently the
script only prints proc.stdout, swallowing gdb attach errors and ptrace
diagnostics. Local sed-patch, no upstream change.

Also switch xdist serialization from `-n 0` to `-n 1` - the gw* markers
in the last log show -n 0 is being ignored or overridden. -n 1 is
unambiguous: one worker only.
…trace

Run B: Docker's default seccomp/apparmor profile filters out the ptrace
syscall even when SYS_PTRACE capability is granted. ucxx
ci/timeout_with_stack.py's gdb attach therefore fails with
"ptrace: Operation not permitted." (now visible after Run A's stderr
patch) and the backtraces come back empty.

Lifting the profile filters restores gdb attach so we can finally see
where the test_server_client / cupy workers actually hang or crash.
…mode

Run C: PYTEST_ADDOPTS does not override a hardcoded CLI arg. Upstream
ci/run_python.sh has `pytest -n 4` baked in two places (line 26 and 46);
sed-patch it to `-n 1` so xdist actually serializes. This is the only
way to test whether the cupy worker SIGKILL is a parallel cuInit race
against MPS.

Also set PROGRESS_MODE=polling (consumed by ucxx ci/run_python.sh for
async tests) - alt code path for the test_server_client / AM transport
hangs.

Drop the now-redundant PYTEST_ADDOPTS -n 1 line.

Note: gdb stderr now visible (Run A patch retained) but kernel.yama
blocks ptrace regardless of seccomp/apparmor (Run B). gdb capture path
is a dead end on these hosts without host-side sysctl change. Pivoting
to behavioural diagnostics instead.
…slice first

Run C verdict: cupy worker SIGKILL is NOT a parallelism race. With
`pytest -n 1` confirmed (xdist replaces gw0 -> gw1 only after crash),
the very first test `test_ucx_address` still kills the worker. So
parallel cuInit-through-MPS is not the cause.

Run D experiments stacked:
- CUDA_MODULE_LOADING=EAGER: force full module load up front (some MPS
  bugs only fire on lazy load mid-stream).
- PYTEST_ADDOPTS deselect: skip the three known-SIGKILL tests
  (test_ucx_address, test_server_client, test_message_probe) so the
  rest of the suite has a chance to run to completion and surface the
  next failure mode (if any).
- Drop the gdb-attach diagnostic comments / the dead `timeout_with_stack`
  patch - host kernel.yama.ptrace_scope=1 blocks ptrace regardless of
  container security profile.

Also reorder ucxx_tests slices: GPU first, then CPU. Azure dispatches
in list order; putting the failing leg first cuts iteration latency.
Restore CPU-first ordering before merge.
…python deselect, reorder failing slices

Run D verdict: SIGKILL whack-a-mole. Deselecting individual tests just
shifts the kill to the next cupy-touching test in the suite, then the
session hangs on the test_arr cupy parametrize and the outer 240s
timer fires. cupy import + first CUDA buffer alloc on MPS is the
deterministic trigger on these A40 hosts.

Run E changes:
- test_python: broaden deselect to whole-module --deselect for
  test_address_object.py + test_arr.py (the cupy-heavy modules).
  Drop per-test test_ucx_address from -k (now redundant).
- test_python_distributed: apply the same env knobs (CUDA_LAUNCH_BLOCKING,
  CUDA_MODULE_LOADING=EAGER, PROGRESS_MODE=polling, UCX/UCXPY DEBUG,
  UCX_LOG_FILE). Without these, distributed-ucxx hits the identical
  MPS SIGKILL pattern (test_ucx_client_server hang -> 600s timeout).
  -k deselect test_ucx_client_server so the rest of the suite has a
  chance to run.
- main.yml: reorder UCXX_build slice params so the three failing GPU
  test slices (conda_python_distributed_tests, wheel_tests_ucxx,
  wheel_tests_distributed_ucxx) are listed first. Affects job-creation
  order (MLNX scheduler tie-breaker), not deps - build slices they
  depend on still run.
Run E verdict: --deselect via PYTEST_ADDOPTS was silently ignored.
test_ucx_address ran first regardless and the GPU worker SIGKILL came
back at the same spot. Distributed-tests -k did apply ("48 selected
after 1 deselected") but failure shifted to test_deserialize (still
600s timeout, same cuda-buffer-on-MPS trigger).

Run F: stop relying on PYTEST_ADDOPTS for test filtering. Sed-patch
--ignore=path directly into the pytest invocations in upstream's
ci/run_python.sh and ci/run_python_distributed.sh. --ignore drops the
file at collection time; no nodeid normalization issue.

Targets:
- python/ucxx/ucxx/_lib/tests/test_address_object.py
- python/ucxx/ucxx/_lib/tests/test_arr.py
- python/distributed-ucxx/distributed_ucxx/tests/test_comms.py
- python/distributed-ucxx/distributed_ucxx/tests/test_deserialize.py

Sed patterns verified locally against the real upstream scripts:
- run_python.sh anchored on `_lib/tests/ "` (matches only the
  executable pytest line, not the CMD_LINE echo).
- run_python_distributed.sh anchored on `distributed_ucxx/tests/$`
  (end-of-line; excludes the CMD_LINE echo that ends in `"` and the
  tests_internal/ paths).

If --ignore also fails to take effect, accept that the MLNX A40+MPS
hosts cannot run these tests under current host config (kernel.yama,
MPS daemon ownership, ptrace policy) and write up the autonomous
results table for triage.
Autonomous debug Runs A-F confirmed: the MLNX A40+MPS hosts cannot
init a UCX context under MPS without SIGKILLing the pytest worker.
The kill is deterministic on the FIRST test that touches
`ucxx.context.Context` - filtering individual tests just shifts the
kill site (see /tmp/ucxx-debug-results.md for the run table).

Root cause is host-side, not CI:
- `kernel.yama.ptrace_scope=1` blocks gdb attach for diagnostics
  (regardless of container --cap-add=SYS_PTRACE + seccomp/apparmor
  unconfined).
- MPS daemon ownership + cupy import path interaction makes cuInit
  fail under MPS Exclusive_Process mode.

Pragmatic pivot (preserves coverage we CAN run):
- ucxx_tests.yml GPU slices: keep `Build UCXX` + `Run UCXX C++ tests`,
  comment out `Run UCXX Python tests`. The CPU slices already
  exercise the Python suite end-to-end.
- ucxx_build.yml: condition: false on the three GPU test jobs
  (wheel_tests_ucxx, wheel_tests_distributed_ucxx,
  conda_python_distributed_tests).
- Build matrix (conda + wheel) remains green.
- docs / devcontainer / checks remain green.

Restore when host-side MPS / kernel.yama config is fixed, or move
GPU Python tests to a different runner pool.
Run G was the wrong call (dropping tests is not acceptable). Revert
the disabled GPU python steps and the `condition: false` on the three
stage-2 GPU jobs.

Real root cause: on MLNX A40 hosts the `nvidia-cuda-mps-server`
daemon runs as `swx-azure-svc` (uid 61206). CUDA client init from any
other uid silently hangs at the MPS handshake on
`/tmp/nvidia-mps/control` - no error message, just a SIGKILL-looking
worker death or 240s/600s pytest timeout. Container starts as root so
the entrypoint can write to /opt/conda; the FIX is to switch user
just before the pytest call so cuInit comes from the matching uid.

Per-phase setup in test_ucxx.sh:
- useradd uid 61206 (idempotent).
- Open conda/pyenv perms (chmod -R o+rX) so the svc user can read
  the activated env. Conda containers chmod /opt/conda; wheel
  containers chmod /pyenv.
- chown the workspace so the svc user can write logs / env.yaml.
- sed-patch the final pytest invocation of each upstream test
  script with `su swx-azure-svc -c "..."`. `su` without `-`
  preserves PATH + CONDA_* set by the parent conda activate, so
  pytest still finds the right python.

Applied to all four GPU test phases:
- test_python (rapidsai-ci-conda + MPS)
- test_python_distributed (rapidsai-ci-conda + MPS)
- test_wheel_ucxx (rapidsai-ci-wheel + MPS)
- test_wheel_distributed_ucxx (rapidsai-ci-wheel + MPS)

All four sed patterns verified locally against the real upstream
ci/*.sh files.

Drops the --ignore/--deselect and `-n 1` patches from Runs C-F -
those were chasing symptoms of the uid mismatch, no longer needed.
Keeps the harmless logging knobs (UCX/UCXPY DEBUG, UCX_LOG_FILE).
Run H aborted in the MPS-uid setup block: `chmod -R o+rX /opt/conda`
hit "Operation not permitted" on .pyc files in the conda env (likely
container capability / overlayfs quirk; we have FOWNER but the
recursive chmod still trips on certain inodes). `set -eE` propagated
that and killed the whole step.

The image's Dockerfile already runs `chmod -R o+rwX /opt/conda` at
build time, so the runtime re-chmod is redundant anyway. Same idea
for the workspace chown - if it succeeds great, if not the svc user
can still read what it needs because the workspace bind-mount comes
in as the agent uid (61206) on host already.

Make the perm-fixup lines tolerant of failure:
- chmod o+rX /opt/conda  -> `2>/dev/null || true`
- chmod o+rX /pyenv      -> same (wheel container)
- chown UCXX_DIR         -> same
Run I crashed at the pytest call with:
  su: user swx-azure-svc does not exist or the user entry does not
  contain all the required fields

Cause: the `id ... || useradd` form skipped useradd because a stub
entry already existed in the container without home or shell. su
then refused to start a session.

Run J fixes:
- Always-create-or-fix-account: `useradd -u 61206 -o -m -d ... -s ...
  swx-azure-svc 2>/dev/null || usermod -u 61206 -d ... -m -s ...
  swx-azure-svc || true`. -o tolerates duplicate uid, the usermod
  fallback fixes a stub entry in place.
- Switch from `su user -c "..."` to `runuser -u user -- ...` -
  runuser does not require shadow-style preconditions and is
  tolerant of partial passwd entries.
- For the wheel container case where we need to keep `DISABLE_CYTHON=1`,
  use `runuser -u user --preserve-environment -- env DISABLE_CYTHON=1
  ./ci/run_python.sh` so the inline env var still applies after the
  uid switch.

Applied to all four phases: test_python, test_python_distributed,
test_wheel_ucxx, test_wheel_distributed_ucxx.
Run J failed with `runuser: user swx-azure-svc does not exist`.
Run I failed with `su: ...`. Both attempts at `useradd ... || usermod
... || true` silently no-opped: the chain returns 0 even when nothing
was actually written to /etc/passwd. By the time runuser/su ran, the
account was nowhere.

Run K replaces the inline useradd lines in all four GPU test phases
with an `ensure_mps_svc_user` helper that:

1. Checks `getent passwd swx-azure-svc` first; if present, log and skip.
2. Tries `useradd -u 61206 -o -m -d /home/swx-azure-svc -s /bin/bash
   swx-azure-svc` WITHOUT redirecting stderr - so the real error
   becomes visible in the log if it happens.
3. Falls back to direct append:
   - /etc/passwd: `swx-azure-svc:x:61206:61206::/home/swx-azure-svc:/bin/bash`
   - /etc/group:  `swx-azure-svc:x:61206:`
4. Verifies with a final `getent passwd swx-azure-svc`; aborts with
   an explicit FATAL message if it's still missing.

Diagnostic prints like `[mps-uid] verified: ...` make the next failure
mode (if any) easy to read off the build log.

Applied via the shared helper to test_python, test_python_distributed,
test_wheel_ucxx, test_wheel_distributed_ucxx phases.
Run K finally surfaced the actual error: useradd: Permission denied;
cannot lock /etc/passwd. We are NOT root in the container. All the
"switch user to MPS daemon uid" plumbing from Runs H-K was wishful
thinking: you can't useradd without root.

Two possibilities now:
  (a) Container default uid IS 61206 (swx-azure-svc) already - in
      which case the MPS uid theory was always satisfied and the
      original SIGKILLs are some other host-side problem.
  (b) Container default uid is some OTHER non-root uid that doesn't
      match the MPS daemon - in which case we need to set --user
      61206:61206 in the container resource options, not useradd at
      runtime.

Run L cuts out all the user-switching / sed-patches and prints `id`
+ `whoami` + `/opt/conda` perms + `/tmp/nvidia-mps` listing at the
top of each GPU test phase. Plain `bash ci/test_python.sh` etc - we
intentionally let it fail again so we can read off the diagnostic.

Drops Runs H-K's user-switching code.
…ride)

Run L confirmed:
- Container uid = 61206 (Azure auto-injects --user mapping the host
  agent uid). MPS uid match is already satisfied.
- /tmp/nvidia-mps is bind-mounted, owned by swx-azure-svc_azpcontainer:
  systemd-journal_azpcontainer (uid 61206).

Yet pytest -n 4 with the full test_ucxx suite hangs:
- gw1 makes progress through tens of test_arr tests (numpy variants
  pass).
- gw0/gw2/gw3 hang in test_ucx_address, test_Array_ndarray_ptr[..-cupy],
  test_Array_ndarray_is_cuda[..-cupy] (cupy buffer alloc -> cuMalloc
  blocks forever).
- After 240s the timeout_with_stack.py outer timer fires; all four
  workers go "node down: Not properly terminated" simultaneously.

Pattern is HANG, not SIGKILL. Direction: cupy/cuMalloc through MPS
hangs. Either the MPS daemon on the host is stuck, or the per-uid
MPS state for swx-azure-svc is corrupt.

Run M: set CUDA_MPS_PIPE_DIRECTORY to a non-existent /tmp dir per
phase so the CUDA client falls back to direct GPU access (no MPS).
Memory `mlnx-ci-mps-container-uid` lists this as a rejected
workaround for the long term, but for diagnostic it tells us
unambiguously whether MPS is the stuck component:
- If tests pass under bypass: confirm host MPS daemon needs a restart;
  hand back to host-side fix.
- If tests still hang: cupy/conda is broken regardless of MPS, look
  elsewhere (driver/runtime, nvjitlink version, etc).

Applied to all four GPU test phases. Diag identity print retained.
Run M (MPS bypass via CUDA_MPS_PIPE_DIRECTORY override) was a big
breakthrough: the 240s hangs are gone, tests now finish in ~50s and
fail with legible errors.

wheel-tests summary line: `30 failed, 236 passed, 2 skipped, 13
warnings in 49.94s`. All 30 failures share one error:

  cupy_backends.cuda.api.runtime.CUDARuntimeError:
  cudaErrorDevicesUnavailable: CUDA-capable device(s) is/are busy or
  unavailable

Diagnosis: the MLNX A40 GPUs are in `Exclusive_Process` compute mode.
With MPS in front, multiple processes can share the device. With MPS
bypassed, only one process at a time can claim it. xdist `-n 4`
spawns four pytest workers; each tries `cupy.empty()` -> `cudaMalloc`;
three lose with `cudaErrorDevicesUnavailable`.

Fix: re-introduce the `sed -i 's/pytest -n 4/pytest -n 1/g'
ci/run_python.sh` patch we had in earlier runs (dropped in Run L
during cleanup). Single worker -> no GPU race.

Applied to:
- test_python phase (under IS_GPU)
- test_wheel_ucxx / test_wheel_distributed_ucxx phases (run_python.sh
  is the script that hardcodes -n 4; sed-patching there covers both
  the cython and wheel test paths).
- test_python_distributed phase: NOT patched - run_python_distributed.sh
  invocations don't pass -n at all (single process by default).

Distributed-ucxx failed Run M at test_ucxx_localcluster[True-ucx]
with "UCX is not initialized". Separate bug, not addressed here -
expect that leg to still fail.
Run O probes whether the UCXX-Python GPU failures are caused by xdist
spawning multiple worker processes that each cuInit through MPS in
quick succession (overflowing the daemon's accept queue / per-uid
client limit). control.log on the host shows
`Unable to accept connection` in a loop, which matches an MPS server
unable to keep up with rapid-fire client connects.

Evidence base for this hypothesis:
- UCX C++ gtest works on the same MLNX hosts under MPS - single
  process, one cuInit, one MPS handshake.
- UCXX test_python uses pytest + xdist `-n 4`: pytest_main +
  4 worker processes = 5 cuInit attempts in seconds.
- Build 123304 (first UCXX_tests GPU run on PR openucx#11473, day one):
  same failure fingerprint as today - multiple workers go
  "node down: Not properly terminated" simultaneously on cupy
  variants of test_arr.py + test_ucx_address; numpy variants pass
  in parallel.
- Earlier `-n 1` runs (Run C) still failed: but `-n 1` is
  pytest_main + 1 worker = 2 cuInit attempts, still multi-process.

Probe:
- Remove the MPS bypass (CUDA_MPS_PIPE_DIRECTORY override) from
  test_python, test_python_distributed, and the wheel test phases.
  MPS is active in this run.
- Sed-patch `ci/run_python.sh`: replace `pytest -n 4` with
  `pytest -p no:xdist` (disables the xdist plugin entirely, single
  process pytest, single cuInit).
- run_python_distributed.sh already invokes pytest without `-n`, no
  patch needed there.

Pass criteria:
- If single-process pytest under MPS gets through the test_arr cupy
  variants + test_ucx_address without 240s hang, hypothesis A is
  confirmed. The real fix becomes: keep pytest single-process for
  GPU slices, or set CUDA_VISIBLE_DEVICES so workers don't all
  cuInit through MPS.
- If the same hang pattern repeats with one pytest process, A is
  wrong - look at the rapidsai-ci-conda image / cuda stack.
nvidia-container-runtime auto-mounts /tmp/nvidia-mps inside any container
started with --gpus all; explicit `-v /tmp/nvidia-mps:/tmp/nvidia-mps`
on top of that is a no-op duplicate. Verified live on
swx-rdmz-ucx-gpu-01: the MPS pipe + control socket show up under
/tmp/nvidia-mps inside the container even when the bind is omitted.
Other UCX gpu test containers in this file already follow this
convention.

Dropped from both ucxx_rapidsai_ci_conda_gpu and ucxx_rapidsai_ci_wheel_gpu
container options; added a comment so the next reader doesn't add the
bind back.
@Alexey-Rivkin Alexey-Rivkin force-pushed the ucxx-azure-tests branch 2 times, most recently from 47c45ce to 6967ff3 Compare May 28, 2026 19:46
UCXX-Python pytest wedges the MPS daemon on swx-rdmz-ucx-gpu-01/-02 to
the point that UCX CI on the same hosts is also blocked. Disable until
a safe recipe is agreed upstream:

- ucxx_tests.yml: drop the `test_python` step on GPU slices. GPU
  `test_cpp` keeps running (it passes); CPU slices unchanged.
- ucxx_build.yml: `condition: false` on the three GPU Python test
  jobs: wheel-tests-ucxx, wheel-tests-distributed-ucxx,
  conda-python-distributed-tests.

CPU build matrix + GPU C++ + docs + devcontainer + checks still run.
Restore once a non-wedging recipe lands.
Peter (UCXX team) confirmed distributed-ucxx will not be upstreamed
(Dask-specific plugin, lives in the repo for convenience only). Strip
its wheel-build, wheel-test, and conda-test jobs plus the
build_ucxx.sh / test_ucxx.sh phases that backed them.

Also drop the UCX_LOG_FILE=/tmp/ucx_%P.log redirect on GPU Python
phases: it sent UCX_LOG_LEVEL=DEBUG output to a container-local file
that never surfaced in Azure pipeline logs - exactly the visibility
Peter flagged ("none from UCX. At least locally I see some UCX logs
loading libucs.so.0..."). Let UCX/UCXPY debug go to stderr so the
build console shows it.
Restructure test_ucxx.sh test_cpp so the GPU path runs first (plain
ci/test_cpp.sh, CUDA enabled) and the CPU path explicitly env-prefixes
the CUDA disables. Adds a header comment naming the pool for each
branch. Same runtime behavior; the diff just no longer reads as "CUDA
disabled for C++ tests" at a glance.
Host yama.ptrace_scope is now 0 on the GPU nodes, so the gdb attach
that timeout_with_stack.py performs on hang should actually grab
useful stacks. Drop the condition: false on ucxx_wheel_tests_ucxx
and restore the test_python step on the UCXX_tests GPU slices so the
hanging path actually runs and trips the timeout harness.
The rapidsai/ci-conda env carries cupy by default, so the cupy
parametrize on test_arr.py::test_Array_ndarray_* does not skip via
importorskip on CPU runners. Without a GPU bound to the container
those 30 cases fail every time. Run test_python only on the GPU
slices, where cupy actually has a device.
Build 123980 captured the gdb stack: any UCXContext() dlopens
libuct_cuda.so whose ctor at ucs/sys/module.c -> cuda_md.c:161 calls
cuInit(0), which blocks in recvmsg() on the MPS daemon socket. Each
run accumulates wedge state on the daemon. Until we have a recipe
that avoids loading the UCT cuda module on these hosts, skip:
  - GPU Python tests step in ucxx_tests.yml (all slices: CPU has a
    separate cupy-importorskip bug; GPU hits the cuInit wedge)
  - ucxx_wheel_tests_ucxx_* job in ucxx_build.yml (GPU only, same)

Also trim test_ucxx.sh: drop diag_container_identity helper, drop
long-form rationale comments left over from earlier runs, fold
rapids-download shim writes into one printf each.
GPU C++ gtest also fires cuInit at startup via the same
UCT-cuda-module-load path as Python: one handshake per build.
Slower wedge accumulation than Python (which fires dozens per
build), but still cumulative. Drop both GPU slices from the stage
so MLNX MPS daemons stop drifting toward the wedged state on every
PR push. Build phase on GPU was identical to CPU anyway, so no
unique coverage is lost.
Plain ucx_docker / ucx_gpu demand has the same effect on our MLNX
agents (capability is set to "yes" on every node that has it) and
matches the form ucxx_tests.yml already uses for its own slices.
The high nofile cap came from upstream rapidsai/ucxx wheel-test
container, sized for distributed-ucxx Dask workloads (many concurrent
connections). distributed-ucxx is no longer mirrored in this CI, so
the default ulimit is enough. Keep --shm-size=8g for UCX shm transport.
--shm-size is silently ignored when --ipc=host is set: the container
shares the host /dev/shm and the flag only sizes the container's own
shm namespace. --ipc=host stays (MPS daemon + cuda_ipc need it).
Fold the two real WHY notes (no IB, seccomp/apparmor for gdb) into
three lines. Drop the line documenting the removed /tmp/nvidia-mps
bind - that was a ghost comment about absent code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

WIP-DNM Work in progress / Do not review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant