Gtest roce infra by NirWolfer · Pull Request #11490 · openucx/ucx

NirWolfer · 2026-05-26T12:09:20Z

No description provided.

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

Gtest failures when running in k8s env, as unlimites max_threads cause resource exhaustion. Setting the CPU affinity will limit max_threads to 2 dynamically. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

Run RoCE first, then IB. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

RoCE on Blossom hits a crash in rdma_get_cm_event during test_ucp_sockaddr.ep_query that doesn't reproduce on Azure RoCE. Land IB (CX8) coverage first; RoCE will return in a separate PR after the rdmacm path is triaged. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

Run RoCE first, then IB. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

NirWolfer · 2026-05-26T12:09:32Z

/build

NirWolfer · 2026-05-26T13:01:34Z

/build

The matrix runner pinned gtest to 2 CPUs to keep thread counts bounded in k8s. Under multi-threaded variants the UCS async thread (one per process) contended on those 2 cores with the gtest main thread, worker threads, and UCP progress threads. Async starvation grew the missed events queue and pushed rdmacm events through __ucs_async_poll_missed, exposing a residual race in uct_rdmacm_cm_event_handler that segfaulted inside rdma_get_cm_event. Widen the affinity mask to up to 4 CPUs (clamped to the cgroup's allowed range), still well under the pod's 10-core request. Keeps the thread-count rationale intact while giving async breathing room. Signed-off-by: NirWolfer <nwolfer@nvidia.com>

NirWolfer · 2026-05-27T11:08:44Z

@svc-nvidia-pr-review

svc-nvidia-pr-review · 2026-05-27T11:08:48Z

🤖 Starting review — findings will be posted here when done.

svc-nvidia-pr-review · 2026-05-27T11:11:10Z

OK. UCS_ASYNC_BLOCK is recursive in UCX. Good.

Now I have a complete picture. Let me also check whether the mixing of bash variable expansion with the Jenkins YAML interpolation in timeout: "${STEP_TIMEOUT_MINUTES}" works. That's a YAML/Jenkins concern, not a code concern.

One more concern about the bash CPU pinning in test_matrix.yaml: it uses [[ syntax requiring bash. if [[ "${first_chunk}" == *-* ]] etc. The shebang isn't shown but Jenkins typically runs under bash. Acceptable.

Now I'll compile findings. The PR mixes two distinct changes: CI infra (.ci/pipeline/test_matrix.yaml) and a real correctness fix in rdmacm. Per REVIEW.md, "Bug fix PRs — must contain only the fix and its tests". Mixing CI changes with a bug fix is borderline, though arguably the CI changes are needed to expose/test the fix (the comment in the YAML mentions "exposed a poll_missed race in rdmacm"). This is worth flagging as a scope question but not a blocker.

Let me also check the comment on line 488-490 "Hold the cm's async block over all librdmacm calls on this channel so they cannot race with rdma_get_cm_event() running in the cm event handler." The cm event handler is uct_rdmacm_cm_event_handler which calls rdma_get_cm_event BEFORE acquiring the block. So the block does NOT actually protect against rdma_get_cm_event racing — it protects against uct_rdmacm_cm_process_event which is what dispatches per-event handlers. The comment is slightly inaccurate.

Looking again at the event handler:

ret = rdma_get_cm_event(cm->ev_ch, &event);
...
UCS_ASYNC_BLOCK(uct_rdmacm_cm_get_async(cm));
uct_rdmacm_cm_process_event(cm, event);
UCS_ASYNC_UNBLOCK(...)

So rdma_get_cm_event itself runs unblocked. The block only protects event processing. The comments in the diff at lines 488-490, 598-600, 52-55 are technically misleading: blocking does NOT prevent races with rdma_get_cm_event. It prevents races with process_event (the handlers that mutate cep state). However, since librdmacm's own data structures are presumably thread-safe internally, the actual fix is preventing the EP/listener state from being mutated concurrently.

This is a meaningful documentation accuracy issue worth flagging.

Let me finalize findings.

[
{"file": ".ci/pipeline/test_matrix.yaml", "line": 45, "severity": "minor", "comment": "renaming ${axis_index} to ${worker} is fine for this matrix where the only axis is worker, but test_gpu_matrix.yaml already uses the same ${name}/${arch}/w${worker} form — pls double-check the matrix engine actually exposes the axis name as a placeholder here (the previous code used ${axis_index} which suggests it might not)."},
{"file": ".ci/pipeline/test_matrix.yaml", "line": 57, "severity": "minor", "comment": "the previous timeout was 360 (minutes, matching the 90 min job timeout x4), now it's 180. With timeout_minutes: 240 on the job and STEP_TIMEOUT_MINUTES: 180 on the step, a single slow worker can still consume most of the job budget. Is 180 enough headroom given valgrind/asan disabled but MT tests now exposed?"},
{"file": ".ci/pipeline/test_matrix.yaml", "line": 83, "severity": "minor", "comment": "the taskset -c mask is built from the first cpu range only; if Cpus_allowed_list looks like 2,5,7-10 we drop everything after the first comma and bind to a single cpu, which silently shrinks parallelism. Worth either documenting the assumption or handling the comma-separated case."},
{"file": ".ci/pipeline/test_matrix.yaml", "line": null, "severity": "minor", "comment": "scope: this PR mixes a real correctness fix in rdmacm with a CI infra rework (new HCA images, CPU pinning, timeout/limit changes, ASAN_CHECK removal, taskName rename). Per REVIEW.md bug-fix PRs should contain only the fix and its tests — can the CI changes be split into a separate PR, keeping only the minimum needed to reproduce the rdmacm race here?"},
{"file": "src/uct/ib/rdmacm/rdmacm_cm_ep.c", "line": 491, "severity": "info", "comment": "the comment says the block prevents races with rdma_get_cm_event(), but the cm event handler calls rdma_get_cm_event() before taking UCS_ASYNC_BLOCK (see uct_rdmacm_cm_event_handler in rdmacm_cm.c). What this block actually serializes is uct_rdmacm_cm_process_event() and the per-ep state it mutates — pls update the wording so future readers don't look for a guarantee that isn't there. Same comment applies at rdmacm_cm_ep.c:600 and rdmacm_listener.c:55."},
{"file": "src/uct/ib/rdmacm/rdmacm_cm_ep.c", "line": 637, "severity": "minor", "comment": "the comment on L598-600 says server_send_priv_data takes the same recursive block on the success path, suggesting continuous coverage — but here we UNBLOCK before calling it, so there is a small unblocked window between the two. Either move the UNBLOCK after the call (relying on recursion) or drop the misleading wording."},
{"file": "src/uct/ib/rdmacm/rdmacm_listener.c", "line": 140, "severity": "info", "comment": "minor: the introduced status local can be avoided by keeping the existing return uct_rdmacm_cm_ack_event(event); pattern if you UNBLOCK before the return — though as written it's clearer. Up to you."}
]

Wait, I need to re-check if worker is actually bound from the axis. Let me drop my speculation since I can't verify, and tighten findings.

Also, I should not include an item about ${worker} since test_gpu_matrix.yaml already uses it successfully on line 64. So this likely works.

[
{"file": ".ci/pipeline/test_matrix.yaml", "line": 83, "severity": "minor", "comment": "the affinity mask is built from first_chunk only — if Cpus_allowed_list is something like 2,5,7-10 we silently drop everything after the first comma and bind to a single cpu, shrinking parallelism. Either document the assumption (k8s gives a single contiguous range) or handle the comma case."},
{"file": ".ci/pipeline/test_matrix.yaml", "line": null, "severity": "minor", "comment": "scope: this PR bundles a real rdmacm race fix with a fairly large CI infra rework (new HCA RoCE/IB images, CPU pinning, timeout/limit changes, taskName rename, ASAN_CHECK removal). Per REVIEW.md, bug-fix PRs should contain only the fix and its tests — can the CI changes be split out, keeping only what's strictly needed to reproduce the race here?"},
{"file": "src/uct/ib/rdmacm/rdmacm_cm_ep.c", "line": 491, "severity": "info", "comment": "the comment says the block prevents races with rdma_get_cm_event(), but in uct_rdmacm_cm_event_handler rdma_get_cm_event() runs outside the block — only uct_rdmacm_cm_process_event() is covered. What we actually serialize against is the per-ep state mutated by the dispatched handlers. Pls reword so readers don't look for a guarantee that isn't there (same applies to the comment at rdmacm_cm_ep.c:600 and rdmacm_listener.c:55)."},
{"file": "src/uct/ib/rdmacm/rdmacm_cm_ep.c", "line": 637, "severity": "minor", "comment": "the comment above says server_send_priv_data takes the same recursive block on the success path, implying continuous coverage — but here we UNBLOCK before calling it, so there's a brief unblocked window. Either keep the block held across the call (relying on recursion in server_send_priv_data) or drop the misleading wording."}
]

[
  {"file": ".ci/pipeline/test_matrix.yaml", "line": 83, "severity": "minor", "comment": "the affinity mask is built from `first_chunk` only — if `Cpus_allowed_list` is e.g. `2,5,7-10` we silently drop everything after the first comma and bind to a single cpu, shrinking parallelism. Either document the assumption (k8s gives a single contiguous range) or handle the comma case."},
  {"file": ".ci/pipeline/test_matrix.yaml", "line": null, "severity": "minor", "comment": "scope: this PR bundles a real rdmacm race fix with a fairly large CI infra rework (new HCA RoCE/IB images, CPU pinning, timeout/limit changes, taskName rename, ASAN_CHECK removal). Per REVIEW.md, bug-fix PRs should contain only the fix and its tests — can the CI changes be split out, keeping only what's strictly needed to reproduce the race here?"},
  {"file": "src/uct/ib/rdmacm/rdmacm_cm_ep.c", "line": 491, "severity": "info", "comment": "the comment says the block prevents races with `rdma_get_cm_event()`, but in `uct_rdmacm_cm_event_handler` the `rdma_get_cm_event()` call runs *outside* the block — only `uct_rdmacm_cm_process_event()` is covered. What we actually serialize against is the per-ep state mutated by the dispatched handlers. Pls reword so readers don't look for a guarantee that isn't there (same applies to the comment at rdmacm_cm_ep.c:600 and rdmacm_listener.c:55)."},
  {"file": "src/uct/ib/rdmacm/rdmacm_cm_ep.c", "line": 637, "severity": "minor", "comment": "the comment above says `server_send_priv_data` takes the same recursive block on the success path, implying continuous coverage — but here we UNBLOCK before calling it, so there's a brief unblocked window. Either keep the block held across the call (relying on recursion in `server_send_priv_data`) or drop the misleading wording."}
]

Signed-off-by: NirWolfer <nwolfer@nvidia.com>

NirWolfer · 2026-06-10T13:37:43Z

@pr-test

svc-nvidia-pr-review · 2026-06-10T13:37:50Z

🤖 Starting review — findings will be posted here when done.

svc-nvidia-pr-review · 2026-06-10T13:40:29Z

Code Review

ℹ️ INFO

scope: the in-line comment on L63-66 says CPU pinning was needed because a 2-cpu mask exposed a poll_missed race in rdmacm. If this PR also carries the rdmacm fix, per REVIEW.md bug-fix PRs should contain only the fix and its tests — can the CI rework (new HCA RoCE/IB images, cpu/memory limit changes, taskName rename, ASAN_CHECK removal, ulimit/taskset wrapper) be split out, keeping only what's strictly needed to reproduce the race here?

NirWolfer · 2026-06-10T14:11:01Z

@pr-test

svc-nvidia-pr-review · 2026-06-10T14:11:12Z

🤖 Starting review — findings will be posted here when done.

svc-nvidia-pr-review · 2026-06-10T14:13:08Z

-
+  limits: "{memory: 16Gi, cpu: 10000m}"
+  requests: "{memory: 16Gi, cpu: 10000m}"
+# mofed-26.04-0.7.6.0


minor: this header comment says mofed-26.04-0.7.6.0 but the image URLs on L20 and L31 are mofed-24.10-3.2.5.0. Pls update the comment to match the actual image tag, or update the images if 26.04 was the intent.

NirWolfer · 2026-06-10T14:15:31Z

@pr-test

svc-nvidia-pr-review · 2026-06-10T14:15:35Z

🤖 Starting review — findings will be posted here when done.

svc-nvidia-pr-review

Code Review

3 findings

3 findings posted as inline comments.

NirWolfer · 2026-06-10T14:19:46Z

@pr-test

svc-nvidia-pr-review · 2026-06-10T14:19:54Z

🤖 Starting review — findings will be posted here when done.

svc-nvidia-pr-review · 2026-06-10T14:22:19Z

-
+  limits: "{memory: 16Gi, cpu: 10000m}"
+  requests: "{memory: 16Gi, cpu: 10000m}"
+# mofed-26.04-0.7.6.0


minor: the # mofed-26.04-0.7.6.0 header comment doesn't match the image tags below (both URLs use mofed-24.10-3.2.5.0) — pls remove or update it so future readers aren't misled about which mofed is actually pulled.

svc-nvidia-pr-review · 2026-06-10T14:22:19Z

+      caps_add: "[ IPC_LOCK, NET_RAW ]"
+    }
+  # HCA IB (CX8)
+  - {


why drop ASAN_CHECK: "no" here when test_gpu_matrix.yaml and test_dl_matrix.yaml still set it? keeping it makes the default explicit and the matrices consistent.

svc-nvidia-pr-review · 2026-06-10T14:22:19Z

+  # HCA IB (CX8)
+  - {
+      name: "hca-ib",
+      url: "harbor.mellanox.com/hpcx/x86_64/ubuntu24.04/builder:mofed-24.10-3.2.5.0",


minor: STEP_TIMEOUT_MINUTES is referenced only on the very next step. either inline the literal 180 or rename to TEST_TIMEOUT to match the sibling matrices (test_gpu_matrix.yaml / test_dl_matrix.yaml).

Alexey-Rivkin added 11 commits May 19, 2026 12:05

CI: Target IB-capable nodes for tests

ab6316b

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

CI: Add Bluefield tests

a61f46d

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

CI: Dynamiclayy set CPU affinity

5348ca5

Gtest failures when running in k8s env, as unlimites max_threads cause resource exhaustion. Setting the CPU affinity will limit max_threads to 2 dynamically. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

CI: Test SRIOV settings

c24856b

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

CI: Rename to IB as opposed to RoCE

a49719f

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

CI: address PR comments

430f99f

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

CI: remove ASAN-only override from gtest matrix run

4c3d5ce

CI: document CPU affinity limitation in gtest matrix

d45deed

CI: Add RoCE (CX7) test target alongside IB (CX8)

21fad89

Run RoCE first, then IB. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

CI: Add RoCE (CX7) test target alongside IB (CX8)

4d8ef2f

Run RoCE first, then IB. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

NirWolfer force-pushed the gtest_roce_infra branch from 56cfcf8 to 85aa477 Compare May 26, 2026 12:59

NirWolfer force-pushed the gtest_roce_infra branch from 85aa477 to 5cac0af Compare May 26, 2026 14:26

NirWolfer force-pushed the gtest_roce_infra branch from d6bc784 to 36911bc Compare May 27, 2026 10:14

NirWolfer force-pushed the gtest_roce_infra branch from 36911bc to 4b3f6a2 Compare May 27, 2026 11:14

Upgrade builder docker to newer version

09f0b65

Signed-off-by: NirWolfer <nwolfer@nvidia.com>

NirWolfer force-pushed the gtest_roce_infra branch from 4b3f6a2 to 09f0b65 Compare May 27, 2026 11:44

CI: set ulimit -l unlimited for gtest

75f78f5

Signed-off-by: NirWolfer <nwolfer@nvidia.com>

NirWolfer force-pushed the gtest_roce_infra branch 2 times, most recently from 4ec8f8d to 75f78f5 Compare May 28, 2026 09:26