[FPSAN] Make scratch load/stores uncached, add cta barrier to global BarrierOp by pawelszczerbuk · Pull Request #10055 · triton-lang/triton

pawelszczerbuk · 2026-04-16T16:06:22Z

There were two issues in fpsan that could cause racy behavior and indeterministic failures:

barriers that fpsan inserts after scratch loads and stores were synchronizing execution, but despite ttg::AddrSpace::GlobalRead | ttg::AddrSpace::GlobalWrite flags nvidia lowering was missing memory barrier.
Loads and stores were using cache, so it might have happened that write from one thread was not observable by another even if correctly synchronized with mbarriers.

This PR:

add membar.cta for BarrierOps with GlobalRead/GlobalWrite flags
Uses write-through stores for scratch access
Uses volatile reads for scratch access

With thee changes random failures in fpsan tests are no longer observed

FPSan uses profile scratch as an intra-CTA communication buffer when emulating dot and related operations. Make scratch accesses use volatile loads and write-through stores, and lower NVIDIA global-memory barriers to a CTA-scope membar before the CTA execution barrier. Co-authored-by: Codex <noreply@openai.com>

lezcano · 2026-04-16T17:56:23Z

why .cta? Does this work for multicta kernels?

ThomasRaoux · 2026-04-16T18:02:39Z

+  patterns.add<BarrierOpConversion>(typeConverter,
+                                    PatternBenefit(benefit.getBenefit() + 1));


bit of a nit but maybe it should go in third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/MemoryOpToLLVM.cpp so that there can't be mismatch in benefits and it will always be higher priority than the common pattern

good catch, not a nit at all!

peterbell10 · 2026-04-16T19:16:29Z

+  return StoreOp::create(b, loc, ptrTensor, tensor, Value(), CacheModifier::WT,
+                         EvictionPolicy::NORMAL,
                         /*ignore_cta=*/true);


Are cache hints actually guarunteed to create synchronization? I think for correctness we should have an acquire-release pattern like

store scratch fence.release.cta (bar.sync or other kernel provided synchronization) fence.acquire.cta load scratch

peterbell10 · 2026-04-16T19:29:36Z

+    Location loc = op.getLoc();
+    if (op.hasGlobalRead() || op.hasGlobalWrite()) {
+      PTXBuilder ptxBuilder;
+      auto &membar = *ptxBuilder.create("membar.cta");


The ptx spec for bar.sync states that:

The barrier{.cta}.sync or barrier{.cta}.red or barrier{.cta}.arrive instruction guarantees that when the barrier completes, prior memory accesses requested by this thread are performed relative to all threads participating in the barrier. The barrier{.cta}.sync and barrier{.cta}.red instruction further guarantees that no new memory access is requested by this thread before the barrier completes.

Which to me suggests that an additional fence shouldn't be required. Is it possible that the issues are only happening in warp-specialized code where you can have the reads and writes happening in different warp partitions?

Interesting, thanks for digging that out! Doing more experiments to see what exactly is causing the failure.

So it seems that fence release/acquire is sufficient for the flakiness to go away. But WS does not explain the issue, because test_dot_fma was flaking, while not using WS at all. Not sure what to make of it.

root and others added 2 commits April 16, 2026 08:52

Format

8d9e8ee

pawelszczerbuk requested a review from ThomasRaoux April 16, 2026 16:06

pawelszczerbuk requested a review from ptillet as a code owner April 16, 2026 16:06

ThomasRaoux reviewed Apr 16, 2026

View reviewed changes

peterbell10 reviewed Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FPSAN] Make scratch load/stores uncached, add cta barrier to global BarrierOp#10055

[FPSAN] Make scratch load/stores uncached, add cta barrier to global BarrierOp#10055
pawelszczerbuk wants to merge 2 commits intotriton-lang:mainfrom
pawelszczerbuk:pawel/fpsan_scratch_sync_fix

pawelszczerbuk commented Apr 16, 2026 •

edited

Loading

Uh oh!

lezcano commented Apr 16, 2026

Uh oh!

ThomasRaoux Apr 16, 2026

Uh oh!

pawelszczerbuk Apr 16, 2026

Uh oh!

peterbell10 Apr 16, 2026

Uh oh!

peterbell10 Apr 16, 2026

Uh oh!

pawelszczerbuk Apr 16, 2026

Uh oh!

pawelszczerbuk Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		patterns.add<BarrierOpConversion>(typeConverter,
		PatternBenefit(benefit.getBenefit() + 1));

Conversation

pawelszczerbuk commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lezcano commented Apr 16, 2026

Uh oh!

ThomasRaoux Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

pawelszczerbuk Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

peterbell10 Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

peterbell10 Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

pawelszczerbuk Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

pawelszczerbuk Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pawelszczerbuk commented Apr 16, 2026 •

edited

Loading