Description
@tkf recently noticed / ran into a deadlock due to synchronization from branches that ended up divergent because of the compiler performing union splitting. To remedy this, we could use barrier.sync
instead of barrier.sync.aligned
(aka. bar.sync
aka. syncthreads
):
Instruction barrier has optional .aligned modifier. When specified, it indicates that all threads in CTA will execute the same barrier instruction. In conditionally executed code, an aligned barrier instruction should only be used if it is known that all threads in CTA evaluate the condition identically, otherwise behavior is undefined.
I'm not sure it's safe to just have syncthreads
use an unaligned barrier, because we might then be inadvertently matching unrelated barriers:
if (foo)
sync_threads() # thread 1 waits here
sync_threads() # thread 2 waits here
Maybe that only matters for already broken code though, since thread 1 would here just get stuck at the second barrier?
Also interesting: at the PTX level, barriers are identified by a 'name' (int in 0:16) and a thread mask, so we could expose a more fine-grained sync
for use in divergent branches. Maybe we could use this to differentiate barriers, but then we'd need some sort of lexical information to assign a barrier name, and I'd rather not move to a @sync_threads
macro.
cc @vchuravy