UCP/PERF: Fixed races in progress #10935

iyastreb · 2025-10-06T13:41:45Z

What?

Fixed 4 races in progress:

Read-after-write for ep->sq_wqe_pi. When we read ep->sq_wqe_pi, it might be already updated by another thread and may be larger than current wqe_cnt => substracting sq_wqe_pi from wqe_cnt gives negative result => we advance sq_wqe_pi by almost 64K => wrong completion results
CQE race.
The maximum CQ size = WQ size (=1024 on rock).
The problem happens when window_size * num_threads > CQ size. The root cause is that CQ is being polled simultaneously from multiple threads. Some threads advance much more than the others, so that

Thread1: start processing CQ1
Thread2: start processing CQ2
Thread2: post new WQE => hardware reuses CQ1
Thread1: read from stale CQ1, which already belongs to hardware

The fix is to validate the CQE after reading its content.
Also increase the CQ size to reduce the probability of the CQ overflow

FC race.
We used to check PI for detecting force completion

fc = doca_gpu_dev_verbs_wqe_idx_inc_mask(ep->sq_wqe_pi, ep->sq_wqe_num / 2);

This is not reliable with many producing threads (256), because PI is constantly being updated by other threads and sometimes we can skip the forced update (when PI jumps more that a half). This leads to progress stuck for tests without NODELAY flag and without completion

Race in progress & reservation on sw_wqe_pi
Fix n1 alone does not fully solve the problem, because fetch_max does not work reliably when counter is updated from other threads. The issue is solved with 2 counters, will check performance impact in ucx_perftest PR
Fixed bug with uninitialized value for wqe_cnt

wqe_base = __shfl_sync(0xffffffff, wqe_base, 0);

This may sync up only 32bits of the value, while the MSB of the wqe_base remains garbage => leads to syndrome 68

Added litmus test for the fixes above

Performance impact:

ucp_put_single_bw latency
Threads  BEFORE  AFTER  DIFF
1        7.6     7.7    1.3%
32       159     166    4.4%
64       217     225    3.7%
256      767     769    0.3%
512      1888    1890   0.1%

Artemy-Mellanox · 2025-10-06T14:44:36Z

src/uct/ib/mlx5/gdaki/gdaki.cuh

-    sq_wqe_pi          = ((wqe_cnt - sq_wqe_pi) & 0xffff) + sq_wqe_pi + 1;
+    /* Skip CQE if it's older than current producer index, could be already
+     * processed by another thread */
+    if (wqe_cnt < (sq_wqe_pi & 0xffff)) {


if sq_wqe_pi get value of 0xffff all completions with wqe_count != 0xffff will be dropped. can't it stuck progress?

With this scenario stuck is possible if for some reason we miss update wqe_count=0xffff.
This shouldn't happen, we must not miss this update according to the algorithm, because we consume CQ elements in a serial order.
But I'm adding a litmus test to run with large amount of threads and check the invariants at the end, to make sure that this is not possible.

Litmus test actually proves that this change does stuck the progress in scenario when we push many requests without completion => wqe_cnt jumps by 512, not by one => we miss the update with exact number => stuck
I'm modifying the algorithm

Artemy-Mellanox · 2025-10-06T14:45:44Z

src/uct/ib/mlx5/gdaki/gdaki.c


+    /* Add extra space for CQE to avoid override. This extra space is equal to
+     * the number of threads in a warp = maximum amount of threads that can
+     * read/write CQE simultaneously */


what if CQE are being read from multiple warps?

With WARP level I don't see any problem (tested up to 16 warps with 512 threads in total)
Because just the first thread of a warp does progress and there are no other warps running concurrently. We have the maximum concurrency at THREAD level

can you confirm this behavior by some doc, or is it pure experimental suggestion? Would it be safer to make cq_length twice larger than wq_length?

Yes, it would be safer to have larger size
The litmus test that I'm working on should reveal the potential issues. Currently it stucks in some cases.

This logic is mostly driven by experiments, but I found confirmation in PRM:

Producer Counter – A counter maintained by hardware and is incremented for every CQE that is written to the CQ. If CQ is full while another CQE needs to be posted to the CQ, and if overflow detection is disabled, then old CQEs may be overwritten.

We do explicitly disable overflow detection:

cq_attr.flags |= UCT_IB_MLX5_CQ_IGNORE_OVERRUN;

Maybe we should start with sync post wqe & process cqe, maybe with a simple lock. It's probably bad for performance but at least we know for sure that it will solve the issue (If our analysis is correct).

I don't say we will merge it (need to check performance) but after we see that it's stable it will prove we identified all the problems correctly, then we can start think of ways to improve it.

I think I have identified all the issues, fixed 2 more races, will push soon.
IMO lock is not an option.
What's important is that now we have a test that executes all the flows (for now only for single API, but I will extend it) under heavy load, and it indicates whether there are some issues. So if my fixes are not enough it will show that

DIscussed lock approach with @Artemy-Mellanox , implemented it and evaluated.
Basically performance of lock approach is not terribly slower.
In worst case (each post comes with completion and 256 threads) the difference is:

(1165 ms) with CAS (1798 ms) with lock +50%

For waprs and workflows without explicit completions I don't see much diffrence, maybe within +5% max

Artemy-Mellanox · 2025-10-08T12:36:49Z

src/uct/ib/mlx5/gdaki/gdaki.cuh

+
+        uint16_t completed_delta = (wqe_cnt - (uint16_t)sq_wqe_pi) & 0xffff;
+        new_pi = sq_wqe_pi + completed_delta + 1;
+    } while (!pi_ref.compare_exchange_weak(sq_wqe_pi, new_pi,


assigning value to sq_wqe_pi without checking opcode for error may lead to reporting of success on request that actually failed.
also this while look strange - if sq_wqe_pi changed since it was loaded it won't return back so why retry?

Right, I didn't spent much time thinking on error code, mostly fixed happy path.
Maybe we should use a separate variable for error code to keep it simple?

Regarding the while loop. Not sure I understood the Q
The goal is to properly increase the sq_wqe_pi counter, based on the latest value, not on the stale copy. And this is what CAS does, also updating the sq_wqe_pi counter to the latest value in case of a failure.

Artemy-Mellanox · 2025-10-08T12:44:54Z

src/uct/ib/mlx5/gdaki/gdaki.c

+     */
    init_attr.cq_len[UCT_IB_DIR_TX] = iface->super.super.config.tx_qp_len *
-                                      UCT_IB_MLX5_MAX_BB;
+                                      UCT_IB_MLX5_MAX_BB * 2;


not sure it completely solves the issue - this fix assume that distance between consumed entries can't exceed the queue size, but what makes this assumption correct? AFAIU if consumers are not synchronized distance is not bound.

Yes, this fix is not perfect
As it's written above, the assumption is that CQ length must be at least [WQ len + num_threads]
This is because we need to give a space to concurrent readers/writers so that they never intersect.

Maybe we fix it like:

#define MAX_THREADS 1024 cq_len = (tx_qp_len * MAX_BB) + MAX_THREADS;

Yes, you are absolutely right, faster threads may post multiple WQEs while slower ones are still inside the progress.
We need to find a real solution, this one (increasing CQ dimension) does not fully solve it

I have implemented the real solution: validate CQE after read the content from it.
I tested it with original CQ size = TX size, and it always works.

So we still need the CQ with extended length? from what I see it's still in the code.

Yes, I updated the comment above to explain the reason:

/* CQE is being read/updated simultaneously by multiple threads. * Overflow detection is disabled for CQ, but overflow is handled in * progress by validating the CQE after reading the content. * We give CQ extra space (x2) to reduce the probability of CQ overflows. */

"We give CQ extra space (x2) to reduce the probability of CQ overflows" - so it's not required because we validate CQE after reading right?
Does keep the CQ size larger improve performance?

In fact I didn't manage to see any performance improvement due to this size, so I revert this change

ofirfarjun7 · 2025-10-09T16:19:55Z

src/uct/ib/mlx5/gdaki/gdaki.cuh

-    uint64_t sq_wqe_pi = ep->sq_wqe_pi;
-    sq_wqe_pi          = ((wqe_cnt - sq_wqe_pi) & 0xffff) + sq_wqe_pi + 1;
+    uint64_t sq_wqe_pi = pi_ref.load(cuda::std::memory_order_relaxed);
+    uint64_t new_qwe_pi;


minor: new_wqe_pi?

ofirfarjun7 · 2025-10-12T06:29:21Z

src/uct/ib/mlx5/gdaki/gdaki.c

+     */
    init_attr.cq_len[UCT_IB_DIR_TX] = iface->super.super.config.tx_qp_len *
-                                      UCT_IB_MLX5_MAX_BB;
+                                      UCT_IB_MLX5_MAX_BB * 2;


"We give CQ extra space (x2) to reduce the probability of CQ overflows" - so it's not required because we validate CQE after reading right?
Does keep the CQ size larger improve performance?

ofirfarjun7 · 2025-10-12T10:30:38Z

src/uct/ib/mlx5/gdaki/gdaki.cuh

+
+    /* Prevent reordering, validate that CQE is still valid */
+    __threadfence_block();
+    uint8_t op_owner_check = READ_ONCE(cqe64->op_own);


Not sure I understand how it helps.
HW can take ownership on this cqe while/after we read wqe_cnt and release it back to SW, so I'm not sure if in this case we consider it as ''valid''?

Maybe we can read cqe64->op_own and cqe64->wqe_counter atomically in a single op? If yes, maybe it's better/safer?

struct mlx5_cqe64 { u8 outer_l3_tunneled; ... ... ... __be16 wqe_counter; u8 signature; u8 op_own; };

That's a great idea that can simplify the code and improve performance.
Implementing it

It greatly simplified the code, but surprisingly does not bring any performance boost

Actually it made it faster on lower dimensions.
With 128-256 threads latency stays the same, but on 1-16 threads it reduced the overhead of this PR by half.
I updated perf impact numbers in PR desc

Artemy-Mellanox · 2025-10-12T08:16:20Z

src/uct/ib/mlx5/gdaki/gdaki.cuh

    if (lane_id == 0) {
        wqe_base = uct_rc_mlx5_gda_reserv_wqe_thread(ep, count);
+    } else {
+        /* Initialize with 0, because __shfl_sync may set only 32 bits */


shouldn't we shuffle higher 32 bits too?

UCP/PERF: Fixed race in progress

bcc2323

iyastreb requested review from Artemy-Mellanox, brminich, ofirfarjun7 and yosefe October 6, 2025 13:46

iyastreb changed the title ~~UCP/PERF: Fixed race in progress~~ UCP/PERF: Fixed races in progress Oct 6, 2025

Artemy-Mellanox reviewed Oct 6, 2025

View reviewed changes

iyastreb added 7 commits October 7, 2025 12:38

UCP/GTEST: Added concurrent litmus test

b489053

UCP: Merge branch 'master' into ucp-device-progress-fix

fb235cc

UCP/GDAKI: Fixed race in FC calculation

7a083d4

UCP/GDAKI: CQ size = 2xTX_QP

d2de303

UCP/GDAKI: Improved FC perf

c719f7e

UCP/GDAKI: Fixed race in progress/reservation with 2 counters

3a0582f

UCP/GTEST: Fixed test infra

8af81b9

Artemy-Mellanox reviewed Oct 8, 2025

View reviewed changes

iyastreb added 5 commits October 8, 2025 13:58

UCP/GDA: Fixed uninitialized bug

ab2cc00

UCP/GDA: Cleanup & minor changes

d1b867a

UCP/GTEST: Added stress test for multi

93141d6

UCP/GDA: Fixed CQ overflow by validating the CQE after read

af9e4c8

UCP/GTEST: Cosmetic changes

e0f7791

ofirfarjun7 reviewed Oct 9, 2025

View reviewed changes

iyastreb added 2 commits October 10, 2025 06:51

UCP: Addressed PR comments

e55d06f

UCP/GTEST: Reduce MAX_THREADS to 128, skip on valgrind

77333cd

ofirfarjun7 reviewed Oct 12, 2025

View reviewed changes

Artemy-Mellanox reviewed Oct 13, 2025

View reviewed changes

iyastreb added 2 commits October 13, 2025 15:20

UCP: Single atomic read in progress

6457d9a

UCP/GDA: Sync 64 bits of wqe_base

7d0ab00

UCP/PERF: Fixed races in progress #10935

Are you sure you want to change the base?

UCP/PERF: Fixed races in progress #10935

Conversation

iyastreb commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ofirfarjun7 Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ofirfarjun7 Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

iyastreb commented Oct 6, 2025 •

edited

Loading

ofirfarjun7 Oct 7, 2025 •

edited

Loading

ofirfarjun7 Oct 12, 2025 •

edited

Loading