Gtest roce infra#11490
Conversation
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Gtest failures when running in k8s env, as unlimites max_threads cause resource exhaustion. Setting the CPU affinity will limit max_threads to 2 dynamically. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Run RoCE first, then IB. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
RoCE on Blossom hits a crash in rdma_get_cm_event during test_ucp_sockaddr.ep_query that doesn't reproduce on Azure RoCE. Land IB (CX8) coverage first; RoCE will return in a separate PR after the rdmacm path is triaged. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Run RoCE first, then IB. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
|
/build |
56cfcf8 to
85aa477
Compare
|
/build |
85aa477 to
5cac0af
Compare
The matrix runner pinned gtest to 2 CPUs to keep thread counts bounded in k8s. Under multi-threaded variants the UCS async thread (one per process) contended on those 2 cores with the gtest main thread, worker threads, and UCP progress threads. Async starvation grew the missed events queue and pushed rdmacm events through __ucs_async_poll_missed, exposing a residual race in uct_rdmacm_cm_event_handler that segfaulted inside rdma_get_cm_event. Widen the affinity mask to up to 4 CPUs (clamped to the cgroup's allowed range), still well under the pod's 10-core request. Keeps the thread-count rationale intact while giving async breathing room. Signed-off-by: NirWolfer <nwolfer@nvidia.com>
d6bc784 to
36911bc
Compare
|
🤖 Starting review — findings will be posted here when done. |
|
OK. Now I have a complete picture. Let me also check whether the mixing of bash variable expansion with the Jenkins YAML interpolation in One more concern about the bash CPU pinning in test_matrix.yaml: it uses Now I'll compile findings. The PR mixes two distinct changes: CI infra ( Let me also check the comment on line 488-490 "Hold the cm's async block over all librdmacm calls on this channel so they cannot race with rdma_get_cm_event() running in the cm event handler." The cm event handler is Looking again at the event handler: ret = rdma_get_cm_event(cm->ev_ch, &event);
...
UCS_ASYNC_BLOCK(uct_rdmacm_cm_get_async(cm));
uct_rdmacm_cm_process_event(cm, event);
UCS_ASYNC_UNBLOCK(...)So This is a meaningful documentation accuracy issue worth flagging. Let me finalize findings. [ Wait, I need to re-check if Also, I should not include an item about [ [
{"file": ".ci/pipeline/test_matrix.yaml", "line": 83, "severity": "minor", "comment": "the affinity mask is built from `first_chunk` only — if `Cpus_allowed_list` is e.g. `2,5,7-10` we silently drop everything after the first comma and bind to a single cpu, shrinking parallelism. Either document the assumption (k8s gives a single contiguous range) or handle the comma case."},
{"file": ".ci/pipeline/test_matrix.yaml", "line": null, "severity": "minor", "comment": "scope: this PR bundles a real rdmacm race fix with a fairly large CI infra rework (new HCA RoCE/IB images, CPU pinning, timeout/limit changes, taskName rename, ASAN_CHECK removal). Per REVIEW.md, bug-fix PRs should contain only the fix and its tests — can the CI changes be split out, keeping only what's strictly needed to reproduce the race here?"},
{"file": "src/uct/ib/rdmacm/rdmacm_cm_ep.c", "line": 491, "severity": "info", "comment": "the comment says the block prevents races with `rdma_get_cm_event()`, but in `uct_rdmacm_cm_event_handler` the `rdma_get_cm_event()` call runs *outside* the block — only `uct_rdmacm_cm_process_event()` is covered. What we actually serialize against is the per-ep state mutated by the dispatched handlers. Pls reword so readers don't look for a guarantee that isn't there (same applies to the comment at rdmacm_cm_ep.c:600 and rdmacm_listener.c:55)."},
{"file": "src/uct/ib/rdmacm/rdmacm_cm_ep.c", "line": 637, "severity": "minor", "comment": "the comment above says `server_send_priv_data` takes the same recursive block on the success path, implying continuous coverage — but here we UNBLOCK before calling it, so there's a brief unblocked window. Either keep the block held across the call (relying on recursion in `server_send_priv_data`) or drop the misleading wording."}
] |
36911bc to
4b3f6a2
Compare
Signed-off-by: NirWolfer <nwolfer@nvidia.com>
4b3f6a2 to
09f0b65
Compare
Signed-off-by: NirWolfer <nwolfer@nvidia.com>
4ec8f8d to
75f78f5
Compare
|
🤖 Starting review — findings will be posted here when done. |
Code Reviewℹ️ INFO scope: the in-line comment on L63-66 says CPU pinning was needed because a 2-cpu mask |
|
🤖 Starting review — findings will be posted here when done. |
|
|
||
| limits: "{memory: 16Gi, cpu: 10000m}" | ||
| requests: "{memory: 16Gi, cpu: 10000m}" | ||
| # mofed-26.04-0.7.6.0 |
There was a problem hiding this comment.
minor: this header comment says mofed-26.04-0.7.6.0 but the image URLs on L20 and L31 are mofed-24.10-3.2.5.0. Pls update the comment to match the actual image tag, or update the images if 26.04 was the intent.
|
🤖 Starting review — findings will be posted here when done. |
svc-nvidia-pr-review
left a comment
There was a problem hiding this comment.
Code Review
3 findings
3 findings posted as inline comments.
|
🤖 Starting review — findings will be posted here when done. |
|
|
||
| limits: "{memory: 16Gi, cpu: 10000m}" | ||
| requests: "{memory: 16Gi, cpu: 10000m}" | ||
| # mofed-26.04-0.7.6.0 |
There was a problem hiding this comment.
minor: the # mofed-26.04-0.7.6.0 header comment doesn't match the image tags below (both URLs use mofed-24.10-3.2.5.0) — pls remove or update it so future readers aren't misled about which mofed is actually pulled.
| caps_add: "[ IPC_LOCK, NET_RAW ]" | ||
| } | ||
| # HCA IB (CX8) | ||
| - { |
There was a problem hiding this comment.
why drop ASAN_CHECK: "no" here when test_gpu_matrix.yaml and test_dl_matrix.yaml still set it? keeping it makes the default explicit and the matrices consistent.
| # HCA IB (CX8) | ||
| - { | ||
| name: "hca-ib", | ||
| url: "harbor.mellanox.com/hpcx/x86_64/ubuntu24.04/builder:mofed-24.10-3.2.5.0", |
There was a problem hiding this comment.
minor: STEP_TIMEOUT_MINUTES is referenced only on the very next step. either inline the literal 180 or rename to TEST_TIMEOUT to match the sibling matrices (test_gpu_matrix.yaml / test_dl_matrix.yaml).
No description provided.