Skip to content

[FPSAN] Make scratch load/stores uncached, add cta barrier to global BarrierOp#10055

Open
pawelszczerbuk wants to merge 2 commits intotriton-lang:mainfrom
pawelszczerbuk:pawel/fpsan_scratch_sync_fix
Open

[FPSAN] Make scratch load/stores uncached, add cta barrier to global BarrierOp#10055
pawelszczerbuk wants to merge 2 commits intotriton-lang:mainfrom
pawelszczerbuk:pawel/fpsan_scratch_sync_fix

Conversation

@pawelszczerbuk
Copy link
Copy Markdown
Contributor

@pawelszczerbuk pawelszczerbuk commented Apr 16, 2026

There were two issues in fpsan that could cause racy behavior and indeterministic failures:

  1. barriers that fpsan inserts after scratch loads and stores were synchronizing execution, but despite ttg::AddrSpace::GlobalRead | ttg::AddrSpace::GlobalWrite flags nvidia lowering was missing memory barrier.
  2. Loads and stores were using cache, so it might have happened that write from one thread was not observable by another even if correctly synchronized with mbarriers.

This PR:

  • add membar.cta for BarrierOps with GlobalRead/GlobalWrite flags
  • Uses write-through stores for scratch access
  • Uses volatile reads for scratch access

With thee changes random failures in fpsan tests are no longer observed

root and others added 2 commits April 16, 2026 08:52
FPSan uses profile scratch as an intra-CTA communication buffer when emulating dot and related operations. Make scratch accesses use volatile loads and write-through stores, and lower NVIDIA global-memory barriers to a CTA-scope membar before the CTA execution barrier.

Co-authored-by: Codex <noreply@openai.com>
@lezcano
Copy link
Copy Markdown
Contributor

lezcano commented Apr 16, 2026

why .cta? Does this work for multicta kernels?

Comment on lines +590 to +591
patterns.add<BarrierOpConversion>(typeConverter,
PatternBenefit(benefit.getBenefit() + 1));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bit of a nit but maybe it should go in third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/MemoryOpToLLVM.cpp so that there can't be mismatch in benefits and it will always be higher priority than the common pattern

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, not a nit at all!

Comment on lines +407 to 409
return StoreOp::create(b, loc, ptrTensor, tensor, Value(), CacheModifier::WT,
EvictionPolicy::NORMAL,
/*ignore_cta=*/true);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are cache hints actually guarunteed to create synchronization? I think for correctness we should have an acquire-release pattern like

store scratch
fence.release.cta
(bar.sync or other kernel provided synchronization)
fence.acquire.cta
load scratch

Location loc = op.getLoc();
if (op.hasGlobalRead() || op.hasGlobalWrite()) {
PTXBuilder ptxBuilder;
auto &membar = *ptxBuilder.create("membar.cta");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ptx spec for bar.sync states that:

The barrier{.cta}.sync or barrier{.cta}.red or barrier{.cta}.arrive instruction guarantees that when the barrier completes, prior memory accesses requested by this thread are performed relative to all threads participating in the barrier. The barrier{.cta}.sync and barrier{.cta}.red instruction further guarantees that no new memory access is requested by this thread before the barrier completes.

Which to me suggests that an additional fence shouldn't be required. Is it possible that the issues are only happening in warp-specialized code where you can have the reads and writes happening in different warp partitions?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, thanks for digging that out! Doing more experiments to see what exactly is causing the failure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it seems that fence release/acquire is sufficient for the flakiness to go away. But WS does not explain the issue, because test_dot_fma was flaking, while not using WS at all. Not sure what to make of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants