[FPSAN] Make scratch load/stores uncached, add cta barrier to global BarrierOp#10055
[FPSAN] Make scratch load/stores uncached, add cta barrier to global BarrierOp#10055pawelszczerbuk wants to merge 2 commits intotriton-lang:mainfrom
Conversation
FPSan uses profile scratch as an intra-CTA communication buffer when emulating dot and related operations. Make scratch accesses use volatile loads and write-through stores, and lower NVIDIA global-memory barriers to a CTA-scope membar before the CTA execution barrier. Co-authored-by: Codex <noreply@openai.com>
|
why .cta? Does this work for multicta kernels? |
| patterns.add<BarrierOpConversion>(typeConverter, | ||
| PatternBenefit(benefit.getBenefit() + 1)); |
There was a problem hiding this comment.
bit of a nit but maybe it should go in third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/MemoryOpToLLVM.cpp so that there can't be mismatch in benefits and it will always be higher priority than the common pattern
There was a problem hiding this comment.
good catch, not a nit at all!
| return StoreOp::create(b, loc, ptrTensor, tensor, Value(), CacheModifier::WT, | ||
| EvictionPolicy::NORMAL, | ||
| /*ignore_cta=*/true); |
There was a problem hiding this comment.
Are cache hints actually guarunteed to create synchronization? I think for correctness we should have an acquire-release pattern like
store scratch
fence.release.cta
(bar.sync or other kernel provided synchronization)
fence.acquire.cta
load scratch
| Location loc = op.getLoc(); | ||
| if (op.hasGlobalRead() || op.hasGlobalWrite()) { | ||
| PTXBuilder ptxBuilder; | ||
| auto &membar = *ptxBuilder.create("membar.cta"); |
There was a problem hiding this comment.
The ptx spec for bar.sync states that:
The barrier{.cta}.sync or barrier{.cta}.red or barrier{.cta}.arrive instruction guarantees that when the barrier completes, prior memory accesses requested by this thread are performed relative to all threads participating in the barrier. The barrier{.cta}.sync and barrier{.cta}.red instruction further guarantees that no new memory access is requested by this thread before the barrier completes.
Which to me suggests that an additional fence shouldn't be required. Is it possible that the issues are only happening in warp-specialized code where you can have the reads and writes happening in different warp partitions?
There was a problem hiding this comment.
Interesting, thanks for digging that out! Doing more experiments to see what exactly is causing the failure.
There was a problem hiding this comment.
So it seems that fence release/acquire is sufficient for the flakiness to go away. But WS does not explain the issue, because test_dot_fma was flaking, while not using WS at all. Not sure what to make of it.
There were two issues in fpsan that could cause racy behavior and indeterministic failures:
ttg::AddrSpace::GlobalRead | ttg::AddrSpace::GlobalWriteflags nvidia lowering was missing memory barrier.This PR:
membar.ctafor BarrierOps with GlobalRead/GlobalWrite flagsWith thee changes random failures in fpsan tests are no longer observed