-
Notifications
You must be signed in to change notification settings - Fork 489
UCP/PERF: Fixed races in progress #10935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
src/uct/ib/mlx5/gdaki/gdaki.cuh
Outdated
sq_wqe_pi = ((wqe_cnt - sq_wqe_pi) & 0xffff) + sq_wqe_pi + 1; | ||
/* Skip CQE if it's older than current producer index, could be already | ||
* processed by another thread */ | ||
if (wqe_cnt < (sq_wqe_pi & 0xffff)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if sq_wqe_pi get value of 0xffff all completions with wqe_count != 0xffff will be dropped. can't it stuck progress?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this scenario stuck is possible if for some reason we miss update wqe_count=0xffff.
This shouldn't happen, we must not miss this update according to the algorithm, because we consume CQ elements in a serial order.
But I'm adding a litmus test to run with large amount of threads and check the invariants at the end, to make sure that this is not possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Litmus test actually proves that this change does stuck the progress in scenario when we push many requests without completion => wqe_cnt jumps by 512, not by one => we miss the update with exact number => stuck
I'm modifying the algorithm
src/uct/ib/mlx5/gdaki/gdaki.c
Outdated
|
||
/* Add extra space for CQE to avoid override. This extra space is equal to | ||
* the number of threads in a warp = maximum amount of threads that can | ||
* read/write CQE simultaneously */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if CQE are being read from multiple warps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With WARP level I don't see any problem (tested up to 16 warps with 512 threads in total)
Because just the first thread of a warp does progress and there are no other warps running concurrently. We have the maximum concurrency at THREAD level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you confirm this behavior by some doc, or is it pure experimental suggestion? Would it be safer to make cq_length twice larger than wq_length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would be safer to have larger size
The litmus test that I'm working on should reveal the potential issues. Currently it stucks in some cases.
This logic is mostly driven by experiments, but I found confirmation in PRM:
Producer Counter – A counter maintained by hardware and is incremented
for every CQE that is written to the CQ.
If CQ is full while another CQE needs to be posted to the CQ, and if overflow detection is
disabled, then old CQEs may be overwritten.
We do explicitly disable overflow detection:
cq_attr.flags |= UCT_IB_MLX5_CQ_IGNORE_OVERRUN;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should start with sync post wqe & process cqe, maybe with a simple lock. It's probably bad for performance but at least we know for sure that it will solve the issue (If our analysis is correct).
I don't say we will merge it (need to check performance) but after we see that it's stable it will prove we identified all the problems correctly, then we can start think of ways to improve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have identified all the issues, fixed 2 more races, will push soon.
IMO lock is not an option.
What's important is that now we have a test that executes all the flows (for now only for single API, but I will extend it) under heavy load, and it indicates whether there are some issues. So if my fixes are not enough it will show that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DIscussed lock approach with @Artemy-Mellanox , implemented it and evaluated.
Basically performance of lock approach is not terribly slower.
In worst case (each post comes with completion and 256 threads) the difference is:
(1165 ms) with CAS
(1798 ms) with lock +50%
For waprs and workflows without explicit completions I don't see much diffrence, maybe within +5% max
src/uct/ib/mlx5/gdaki/gdaki.cuh
Outdated
|
||
uint16_t completed_delta = (wqe_cnt - (uint16_t)sq_wqe_pi) & 0xffff; | ||
new_pi = sq_wqe_pi + completed_delta + 1; | ||
} while (!pi_ref.compare_exchange_weak(sq_wqe_pi, new_pi, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assigning value to sq_wqe_pi without checking opcode for error may lead to reporting of success on request that actually failed.
also this while look strange - if sq_wqe_pi changed since it was loaded it won't return back so why retry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I didn't spent much time thinking on error code, mostly fixed happy path.
Maybe we should use a separate variable for error code to keep it simple?
Regarding the while loop. Not sure I understood the Q
The goal is to properly increase the sq_wqe_pi
counter, based on the latest value, not on the stale copy. And this is what CAS does, also updating the sq_wqe_pi
counter to the latest value in case of a failure.
src/uct/ib/mlx5/gdaki/gdaki.c
Outdated
*/ | ||
init_attr.cq_len[UCT_IB_DIR_TX] = iface->super.super.config.tx_qp_len * | ||
UCT_IB_MLX5_MAX_BB; | ||
UCT_IB_MLX5_MAX_BB * 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure it completely solves the issue - this fix assume that distance between consumed entries can't exceed the queue size, but what makes this assumption correct? AFAIU if consumers are not synchronized distance is not bound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this fix is not perfect
As it's written above, the assumption is that CQ length must be at least [WQ len + num_threads]
This is because we need to give a space to concurrent readers/writers so that they never intersect.
Maybe we fix it like:
#define MAX_THREADS 1024
cq_len = (tx_qp_len * MAX_BB) + MAX_THREADS;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are absolutely right, faster threads may post multiple WQEs while slower ones are still inside the progress.
We need to find a real solution, this one (increasing CQ dimension) does not fully solve it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have implemented the real solution: validate CQE after read the content from it.
I tested it with original CQ size = TX size, and it always works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we still need the CQ with extended length? from what I see it's still in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I updated the comment above to explain the reason:
/* CQE is being read/updated simultaneously by multiple threads.
* Overflow detection is disabled for CQ, but overflow is handled in
* progress by validating the CQE after reading the content.
* We give CQ extra space (x2) to reduce the probability of CQ overflows.
*/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"We give CQ extra space (x2) to reduce the probability of CQ overflows" - so it's not required because we validate CQE after reading right?
Does keep the CQ size larger improve performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact I didn't manage to see any performance improvement due to this size, so I revert this change
src/uct/ib/mlx5/gdaki/gdaki.cuh
Outdated
uint64_t sq_wqe_pi = ep->sq_wqe_pi; | ||
sq_wqe_pi = ((wqe_cnt - sq_wqe_pi) & 0xffff) + sq_wqe_pi + 1; | ||
uint64_t sq_wqe_pi = pi_ref.load(cuda::std::memory_order_relaxed); | ||
uint64_t new_qwe_pi; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: new_wqe_pi?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/uct/ib/mlx5/gdaki/gdaki.c
Outdated
*/ | ||
init_attr.cq_len[UCT_IB_DIR_TX] = iface->super.super.config.tx_qp_len * | ||
UCT_IB_MLX5_MAX_BB; | ||
UCT_IB_MLX5_MAX_BB * 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"We give CQ extra space (x2) to reduce the probability of CQ overflows" - so it's not required because we validate CQE after reading right?
Does keep the CQ size larger improve performance?
src/uct/ib/mlx5/gdaki/gdaki.cuh
Outdated
|
||
/* Prevent reordering, validate that CQE is still valid */ | ||
__threadfence_block(); | ||
uint8_t op_owner_check = READ_ONCE(cqe64->op_own); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand how it helps.
HW can take ownership on this cqe while/after we read wqe_cnt
and release it back to SW, so I'm not sure if in this case we consider it as ''valid''?
Maybe we can read cqe64->op_own
and cqe64->wqe_counter
atomically in a single op? If yes, maybe it's better/safer?
struct mlx5_cqe64 {
u8 outer_l3_tunneled;
...
...
...
__be16 wqe_counter;
u8 signature;
u8 op_own;
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great idea that can simplify the code and improve performance.
Implementing it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It greatly simplified the code, but surprisingly does not bring any performance boost
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it made it faster on lower dimensions.
With 128-256 threads latency stays the same, but on 1-16 threads it reduced the overhead of this PR by half.
I updated perf impact numbers in PR desc
src/uct/ib/mlx5/gdaki/gdaki.cuh
Outdated
if (lane_id == 0) { | ||
wqe_base = uct_rc_mlx5_gda_reserv_wqe_thread(ep, count); | ||
} else { | ||
/* Initialize with 0, because __shfl_sync may set only 32 bits */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we shuffle higher 32 bits too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, done
What?
Fixed 4 races in progress:
Read-after-write for ep->sq_wqe_pi. When we read ep->sq_wqe_pi, it might be already updated by another thread and may be larger than current wqe_cnt => substracting sq_wqe_pi from wqe_cnt gives negative result => we advance sq_wqe_pi by almost 64K => wrong completion results
CQE race.
The maximum CQ size = WQ size (=1024 on rock).
The problem happens when window_size * num_threads > CQ size. The root cause is that CQ is being polled simultaneously from multiple threads. Some threads advance much more than the others, so that
The fix is to validate the CQE after reading its content.
Also increase the CQ size to reduce the probability of the CQ overflow
We used to check PI for detecting force completion
This is not reliable with many producing threads (256), because PI is constantly being updated by other threads and sometimes we can skip the forced update (when PI jumps more that a half). This leads to progress stuck for tests without NODELAY flag and without completion
Race in progress & reservation on sw_wqe_pi
Fix n1 alone does not fully solve the problem, because fetch_max does not work reliably when counter is updated from other threads. The issue is solved with 2 counters, will check performance impact in ucx_perftest PR
Fixed bug with uninitialized value for wqe_cnt
This may sync up only 32bits of the value, while the MSB of the wqe_base remains garbage => leads to syndrome 68
Performance impact: