You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CLC gets its own partition, running over threads 48-63.
We model CLC as we model TMA writes, via a Barrier::EffectWrites.
The idea of this mode is that we link all the writes on the op to the
barrier. We also annotate in the table `barrierWriteRecipients` which
CTAs will become visible once we wait on the associated barrier.
We note something interesting and document it.
`BarrierTrackingMode::Frontier` should be used when we have a
commit/arrive/expect op that affects anything in flight before it.
Instead, we use `BarrierTrackingMode::EffectWrites` when the PTX op
accepts a barrier so the barrier just signals the completion of the op's
particular write.
The other point we add is a flag `bool diagonalEffectRecipientCTAs`.
This differentiates the behaviour between TMA, where after waiting on
the barrier you see all the writes from all the CTAs in the multicas
group, vs. the diagonal version, as in CLC, where waiting on CTAi just
makes the thread see the CTAi memory.
Copy file name to clipboardExpand all lines: include/triton/Dialect/TritonInstrument/IR/TritonInstrument.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,10 +9,10 @@ Auxiliary state is kept in distributed tensors and global scratch memory, with t
9
9
### Thread model
10
10
11
11
- Base threads: 16 warp-specialization (WS) threads (allowing for up to 16 partitions).
12
-
- Peer classes: +16 Tensor Core (TC) threads and +16 TMA threads to model lack of ordering with base threads.
13
-
- Total logical threads: 48. Bitmasks are sized to the next power of two: 64.
12
+
- Peer classes: +16 TMA threads, +16 Tensor Core (TC) threads, and +16 CLC threads to model lack of ordering with base threads.
13
+
- Total logical threads: 64.
14
14
15
-
Indexing uses a logical thread id in [0, 48), with column vectors sized to 64 for layout convenience.
15
+
Indexing uses a logical thread id in [0, 64), with column vectors sized to 64 for layout convenience.
16
16
17
17
## Auxiliary data structures
18
18
@@ -21,7 +21,7 @@ All types are generated on-demand (per partition) based on:
21
21
- B: number of tracked buffers (power-of-two padded)
22
22
- K: number of mbarriers (power-of-two padded)
23
23
- T_bits: 64 (bitmask width)
24
-
- T_commits: 16 (base threads; commit counters do not apply to TC/TMA helpers)
24
+
- T_commits: 16 (base threads; commit counters do not apply to TC/TMA/CLC helpers)
25
25
26
26
“tensor” means a distributed Triton tensor; “scratch” means a pointer into global scratch memory. Shapes below are logical; actual encodings are partition-local blocked layouts.
27
27
@@ -53,7 +53,7 @@ ConSan separates “tracking” from “visibility transfer”:
53
53
- experimental_set_read_visibility / experimental_set_write_visibility updates the appropriate visibility table for the current thread and buffer.
54
54
- experimental_track_visible_reads / experimental_track_visible_writes snapshots current per-buffer visibility into readTracking/writeTracking for the given barrier.
55
55
- At arrive/commit sites (e.g., tc commit, arrive on mbarrier): ConSan emits the track ops for both reads and writes.
56
-
- At waits: experimental_transfer_visible_reads / experimental_transfer_visible_writes propagates tracked visibility from the barrier back into the waiting thread’s visibility, and this transfer is repeated to peer threads (base, TMA, TC) to keep the three classes consistent.
56
+
- At waits: experimental_transfer_visible_reads / experimental_transfer_visible_writes propagates tracked visibility from the barrier back into the waiting thread’s visibility, and this transfer is repeated to peer threads (base, TMA, TC, CLC) to keep the classes consistent.
Copy file name to clipboardExpand all lines: lib/Dialect/TritonInstrument/Transforms/ConcurrencySanitizer.cpp
+19-3Lines changed: 19 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -23,7 +23,7 @@
23
23
// buffers | tensor | <C x B x i64> | Base pointers of all (sub)buffers
24
24
// barriers | tensor | <C x K x i64> | Pointers to all individual mbarriers
25
25
// barrierStates | scratch | <C x K x i64> | Packed barrier phase (bit 0), arrival counts (bits[1..20] init, [21..40] current), and signed tx-count (bits[41..61]); zero means invalid/uninitialized
26
-
// barrierWriteRecipients | scratch | <C x K x i32> | CTA bitsets of write-tracking rows reached by outstanding TMA effects on each barrier
26
+
// barrierWriteRecipients | scratch | <C x K x i32> | CTA bitsets of EffectWrites rows published by each barrier
27
27
// waiting | scratch | <C x K x i32> | Two bits per thread: waiting flag bit (LSB), stored phase bit (bit 1)
28
28
// writeVisibility | scratch | <C x B x i64> | Per-buffer thread-visibility bitmask (bit i => thread i visible)
29
29
// readVisibility | scratch | <C x B x T x i64> | Per-buffer, per-thread visibility lanes (row-updated; values are bitmasks)
0 commit comments