Skip to content

Conversation

iyastreb
Copy link
Contributor

@iyastreb iyastreb commented Oct 6, 2025

What?

Fixed 4 races in progress:

  1. Read-after-write for ep->sq_wqe_pi. When we read ep->sq_wqe_pi, it might be already updated by another thread and may be larger than current wqe_cnt => substracting sq_wqe_pi from wqe_cnt gives negative result => we advance sq_wqe_pi by almost 64K => wrong completion results

  2. CQE race.
    The maximum CQ size = WQ size (=1024 on rock).
    The problem happens when window_size * num_threads > CQ size. The root cause is that CQ is being polled simultaneously from multiple threads. Some threads advance much more than the others, so that

  • Thread1: start processing CQ1
  • Thread2: start processing CQ2
  • Thread2: post new WQE => hardware reuses CQ1
  • Thread1: read from stale CQ1, which already belongs to hardware

The fix is to validate the CQE after reading its content.
Also increase the CQ size to reduce the probability of the CQ overflow

  1. FC race.
    We used to check PI for detecting force completion
fc = doca_gpu_dev_verbs_wqe_idx_inc_mask(ep->sq_wqe_pi, ep->sq_wqe_num / 2);

This is not reliable with many producing threads (256), because PI is constantly being updated by other threads and sometimes we can skip the forced update (when PI jumps more that a half). This leads to progress stuck for tests without NODELAY flag and without completion

  1. Race in progress & reservation on sw_wqe_pi
    Fix n1 alone does not fully solve the problem, because fetch_max does not work reliably when counter is updated from other threads. The issue is solved with 2 counters, will check performance impact in ucx_perftest PR

  2. Fixed bug with uninitialized value for wqe_cnt

wqe_base = __shfl_sync(0xffffffff, wqe_base, 0);

This may sync up only 32bits of the value, while the MSB of the wqe_base remains garbage => leads to syndrome 68

  1. Added litmus test for the fixes above

Performance impact:

ucp_put_single_bw latency
Threads  BEFORE  AFTER  DIFF
1        7.6     7.7    1.3%
32       159     166    4.4%
64       217     225    3.7%
256      767     769    0.3%
512      1888    1890   0.1%

@iyastreb iyastreb changed the title UCP/PERF: Fixed race in progress UCP/PERF: Fixed races in progress Oct 6, 2025
sq_wqe_pi = ((wqe_cnt - sq_wqe_pi) & 0xffff) + sq_wqe_pi + 1;
/* Skip CQE if it's older than current producer index, could be already
* processed by another thread */
if (wqe_cnt < (sq_wqe_pi & 0xffff)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if sq_wqe_pi get value of 0xffff all completions with wqe_count != 0xffff will be dropped. can't it stuck progress?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this scenario stuck is possible if for some reason we miss update wqe_count=0xffff.
This shouldn't happen, we must not miss this update according to the algorithm, because we consume CQ elements in a serial order.
But I'm adding a litmus test to run with large amount of threads and check the invariants at the end, to make sure that this is not possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Litmus test actually proves that this change does stuck the progress in scenario when we push many requests without completion => wqe_cnt jumps by 512, not by one => we miss the update with exact number => stuck
I'm modifying the algorithm


/* Add extra space for CQE to avoid override. This extra space is equal to
* the number of threads in a warp = maximum amount of threads that can
* read/write CQE simultaneously */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if CQE are being read from multiple warps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With WARP level I don't see any problem (tested up to 16 warps with 512 threads in total)
Because just the first thread of a warp does progress and there are no other warps running concurrently. We have the maximum concurrency at THREAD level

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you confirm this behavior by some doc, or is it pure experimental suggestion? Would it be safer to make cq_length twice larger than wq_length?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be safer to have larger size
The litmus test that I'm working on should reveal the potential issues. Currently it stucks in some cases.

This logic is mostly driven by experiments, but I found confirmation in PRM:

Producer Counter – A counter maintained by hardware and is incremented
for every CQE that is written to the CQ.
If CQ is full while another CQE needs to be posted to the CQ, and if overflow detection is
disabled, then old CQEs may be overwritten.

We do explicitly disable overflow detection:

cq_attr.flags      |= UCT_IB_MLX5_CQ_IGNORE_OVERRUN;

Copy link
Contributor

@ofirfarjun7 ofirfarjun7 Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should start with sync post wqe & process cqe, maybe with a simple lock. It's probably bad for performance but at least we know for sure that it will solve the issue (If our analysis is correct).

I don't say we will merge it (need to check performance) but after we see that it's stable it will prove we identified all the problems correctly, then we can start think of ways to improve it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have identified all the issues, fixed 2 more races, will push soon.
IMO lock is not an option.
What's important is that now we have a test that executes all the flows (for now only for single API, but I will extend it) under heavy load, and it indicates whether there are some issues. So if my fixes are not enough it will show that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DIscussed lock approach with @Artemy-Mellanox , implemented it and evaluated.
Basically performance of lock approach is not terribly slower.
In worst case (each post comes with completion and 256 threads) the difference is:

(1165 ms) with CAS
(1798 ms) with lock   +50%

For waprs and workflows without explicit completions I don't see much diffrence, maybe within +5% max


uint16_t completed_delta = (wqe_cnt - (uint16_t)sq_wqe_pi) & 0xffff;
new_pi = sq_wqe_pi + completed_delta + 1;
} while (!pi_ref.compare_exchange_weak(sq_wqe_pi, new_pi,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assigning value to sq_wqe_pi without checking opcode for error may lead to reporting of success on request that actually failed.
also this while look strange - if sq_wqe_pi changed since it was loaded it won't return back so why retry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I didn't spent much time thinking on error code, mostly fixed happy path.
Maybe we should use a separate variable for error code to keep it simple?

Regarding the while loop. Not sure I understood the Q
The goal is to properly increase the sq_wqe_pi counter, based on the latest value, not on the stale copy. And this is what CAS does, also updating the sq_wqe_pi counter to the latest value in case of a failure.

*/
init_attr.cq_len[UCT_IB_DIR_TX] = iface->super.super.config.tx_qp_len *
UCT_IB_MLX5_MAX_BB;
UCT_IB_MLX5_MAX_BB * 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure it completely solves the issue - this fix assume that distance between consumed entries can't exceed the queue size, but what makes this assumption correct? AFAIU if consumers are not synchronized distance is not bound.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this fix is not perfect
As it's written above, the assumption is that CQ length must be at least [WQ len + num_threads]
This is because we need to give a space to concurrent readers/writers so that they never intersect.

Maybe we fix it like:

#define MAX_THREADS 1024

cq_len = (tx_qp_len * MAX_BB) + MAX_THREADS;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are absolutely right, faster threads may post multiple WQEs while slower ones are still inside the progress.
We need to find a real solution, this one (increasing CQ dimension) does not fully solve it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented the real solution: validate CQE after read the content from it.
I tested it with original CQ size = TX size, and it always works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we still need the CQ with extended length? from what I see it's still in the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I updated the comment above to explain the reason:

/* CQE is being read/updated simultaneously by multiple threads.
     * Overflow detection is disabled for CQ, but overflow is handled in
     * progress by validating the CQE after reading the content.
     * We give CQ extra space (x2) to reduce the probability of CQ overflows.
     */

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"We give CQ extra space (x2) to reduce the probability of CQ overflows" - so it's not required because we validate CQE after reading right?
Does keep the CQ size larger improve performance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact I didn't manage to see any performance improvement due to this size, so I revert this change

uint64_t sq_wqe_pi = ep->sq_wqe_pi;
sq_wqe_pi = ((wqe_cnt - sq_wqe_pi) & 0xffff) + sq_wqe_pi + 1;
uint64_t sq_wqe_pi = pi_ref.load(cuda::std::memory_order_relaxed);
uint64_t new_qwe_pi;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: new_wqe_pi?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

*/
init_attr.cq_len[UCT_IB_DIR_TX] = iface->super.super.config.tx_qp_len *
UCT_IB_MLX5_MAX_BB;
UCT_IB_MLX5_MAX_BB * 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"We give CQ extra space (x2) to reduce the probability of CQ overflows" - so it's not required because we validate CQE after reading right?
Does keep the CQ size larger improve performance?


/* Prevent reordering, validate that CQE is still valid */
__threadfence_block();
uint8_t op_owner_check = READ_ONCE(cqe64->op_own);
Copy link
Contributor

@ofirfarjun7 ofirfarjun7 Oct 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand how it helps.
HW can take ownership on this cqe while/after we read wqe_cnt and release it back to SW, so I'm not sure if in this case we consider it as ''valid''?

Maybe we can read cqe64->op_own and cqe64->wqe_counter atomically in a single op? If yes, maybe it's better/safer?

struct mlx5_cqe64 {
	u8		outer_l3_tunneled;
...
...
...
	__be16		wqe_counter;
	u8		signature;
	u8		op_own;
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great idea that can simplify the code and improve performance.
Implementing it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It greatly simplified the code, but surprisingly does not bring any performance boost

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it made it faster on lower dimensions.
With 128-256 threads latency stays the same, but on 1-16 threads it reduced the overhead of this PR by half.
I updated perf impact numbers in PR desc

if (lane_id == 0) {
wqe_base = uct_rc_mlx5_gda_reserv_wqe_thread(ep, count);
} else {
/* Initialize with 0, because __shfl_sync may set only 32 bits */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we shuffle higher 32 bits too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants