This is a user guide for the roofline-gemm-attn skill.
To use this skill in another local VS Code project, copy the entire folder:
roofline-gemm-attn/
into the target workspace at:
.github/skills/
so the final path becomes:
.github/skills/roofline-gemm-attn/
Then reopen or reload the workspace in VS Code so Copilot / Agent can rescan the skill directory.
Important:
- copy the whole folder, not just
SKILL.md - keep the folder name
roofline-gemm-attnunchanged - keep the referenced
scripts/andreferences/files together with the skill
Use this skill when you want to:
- classify a GEMM or Attention kernel as
compute bound,L1 bound,LLC bound, ormemory bound - estimate arithmetic intensity from shape and dtype
- reason about tiled GEMM design using
Global -> Block -> Tile -> MMA Atom - judge whether a tiledMMA configuration is statically
L1 bound - judge whether a work-group / block configuration is statically
LLC bound - get tuning suggestions when a tile is too L1-heavy
SKILL.md: skill definitionscripts/roofline.py: roofline calculatorreferences/hardware_presets.json: hardware presetsreferences/hardware_sheet_raw_paste.txt: raw hardware notesreferences/gemm_tiling_design.md: three-level tiling notesreferences/l1_bound_judgement_and_tuning.md: subgroup-level L1-bound logic and tuning rulesreferences/llc_bound_judgement_and_tuning.md: block-level LLC-bound logic and tuning rulesreferences/dram_bound_judgement_and_tuning.md: operator-level DRAM-bound logic and tuning rules
- operator type:
gemmorattn - shape
- dtype
- hardware preset, or explicit roofs
- optional profiler data:
time_msbytes_l1bytes_llcbytes_mem
- tiledMMA shape
(M_tile, N_tile, K_tile) - subgroup layout, especially how subgroups split along
M - dtype
- target platform preset
If tiledMMA shape is unknown, provide the kernel source or config snippet that defines it.
- work-group / block tile shape
(M_wg, N_wg, K_wg) - dtype
- target platform preset
- whether the kernel is a plain GEMM or a fused kernel with loop-carried reuse
- if fused, which operands are loaded once versus streamed each outer-loop step
If block tile shape is unknown, provide the kernel source or config snippet that defines it.
- full operator shape
- dtype
- target platform preset
- whether the kernel is plain or fused
- whether profiler
bytes_memis available - if not, the schedule or reuse assumptions needed to estimate memory-side traffic
python3 .github/skills/roofline-gemm-attn/scripts/roofline.py gemm --m 4096 --n 4096 --k 4096 --dtype bf16 --preset bmg580python3 .github/skills/roofline-gemm-attn/scripts/roofline.py attn --b 1 --h 32 --s 4096 --d 128 --dtype bf16 --preset bmg580python3 .github/skills/roofline-gemm-attn/scripts/roofline.py gemm --m 4096 --n 4096 --k 4096 --dtype bf16 --peak-tflops 100 --bw-mem-gbs 456python3 .github/skills/roofline-gemm-attn/scripts/roofline.py gemm --m 4096 --n 4096 --k 4096 --dtype bf16 --preset bmg580 --frequency-mhz 2000For compute roof selection, the skill uses this priority:
- user-provided
peak_tflops - user-provided
frequency_mhz, which scales preset peak when possible - preset default
peak_tflops_by_dtype
Preset values are defaults, not hard overrides.
When judging whether a tile is L1 bound, the skill does not use the final block output shape directly.
Instead, it uses:
- tiledMMA shape
- subgroup layout
- dtype
- platform compute/L1 balance point
The workflow is:
- derive the subgroup-level tile
- compute subgroup FLOPs
- compute subgroup L1 read bytes from
AandB - compute subgroup-level arithmetic intensity
AI_L1_sg - compare
AI_L1_sgagainst the platform balance pointPeak / BW_L1
If output tile shape and tiledMMA shape differ, that means second-level tiling (Block -> Tile), but the L1-bound judgement still uses the tiledMMA shape.
When judging whether a tile is LLC bound, the skill does not use subgroup tile shape as the primary unit.
Instead, it uses:
- work-group / block tile shape
- dtype
- platform compute/LLC balance point
- the kernel's actual outer-loop dataflow
The workflow is:
- derive the block-level work unit
- compute block FLOPs for the relevant outer-loop step
- compute client-side
LLCbytes for that step - compute block-level arithmetic intensity
AI_LLC_wg - compare
AI_LLC_wgagainst the platform balance pointPeak / BW_LLC
For fused kernels, the first step and steady state may differ, so the skill may report both.
When judging whether an operator is DRAM bound, the skill uses the full operator and memory-side traffic as the primary unit.
Instead, it uses:
- full operator shape
- dtype
- platform compute/memory balance point
- either measured
bytes_memor a lower-bound / schedule-aware memory model
The workflow is:
- compute operator FLOPs
- compute or estimate memory-side bytes
Bytes_mem - compute operator arithmetic intensity
AI_mem - compare
AI_memagainst the platform balance pointPeak / BW_mem
If only shape is available, the skill first uses a best-case lower-bound memory model and says so explicitly.
Use this order:
- increase
M_sgfirst - then choose the smallest
N_tilethat crosses the L1 threshold - do not expect
K_tilealone to fix static L1-bound behavior
Candidate tiles should be ranked by a simple accumulator-pressure proxy:
Among tiles that are no longer L1 bound, smaller is preferred unless stronger MMA or layout constraints dominate.
Use this order:
- increase block-level reuse first
- improve block scheduling and locality across neighboring work-groups
- use
K_wgas a secondary knob for pipeline and residency tuning, not as the first fix for static LLC pressure
Use this order:
- reduce true memory traffic first
- improve inter-block reuse and traversal locality
- treat block and subgroup tiling as secondary levers unless they materially change memory traffic
When tuning after a roofline diagnosis, use this order:
- identify the active level:
L1,LLC, orDRAM - choose the matching control unit:
L1: tiledMMA / subgroup tileLLC: block / work-group tileDRAM: full operator and schedule / reuse model
- state the main tradeoff before suggesting a larger tile:
- larger tiledMMA shapes may improve
L1arithmetic intensity but increase GRF pressure and may reduce occupancy - larger block tiles may improve
LLCarithmetic intensity but increase outer-cache working set, may reduce hit rate, and may reduce resident blocks DRAMimprovements should be justified by lower memory-side bytes, not just by larger tiles
- prefer the smallest change that crosses the relevant roofline threshold while preserving occupancy or cache residency
- confirm the hypothesis with the matching profiler bytes counter whenever possible
The calculator prints:
- estimated FLOPs
- estimated bytes
- arithmetic intensity
- roof ceilings
- a high-level bound label
For L1 / LLC / memory attribution, profiler bytes are still preferred.
- Cache-level roofs are modeled as read-only roofs.
- Intel pre-Xe4 style reasoning is described in terms of L1-oriented staging and prefetch, not CUDA-style
cp.asyncassumptions. - Static tile-level judgement is a screening tool, not a final proof. Profiler validation is still recommended.