Skip to content

[gpt-oss] attention decode optimizations#37190

Merged
sraizada-tt merged 3 commits intomainfrom
gpt-attn-optimizations
Feb 6, 2026
Merged

[gpt-oss] attention decode optimizations#37190
sraizada-tt merged 3 commits intomainfrom
gpt-attn-optimizations

Conversation

@sraizada-tt
Copy link
Contributor

@sraizada-tt sraizada-tt commented Feb 5, 2026

Added padding to o_proj weights and bias to ensure tile-aligned dimensions in CCL operations, avoiding expensive untilize-pad-tilize cycles

https://github.com/tenstorrent/tt-metal/actions/runs/21719380426

@sraizada-tt sraizada-tt requested a review from uaydonat as a code owner February 5, 2026 14:31
Copilot AI review requested due to automatic review settings February 5, 2026 14:31
@sraizada-tt sraizada-tt requested review from a team, handrewsTT and mtairum as code owners February 5, 2026 14:31
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces optimizations for GPT attention operations, focusing on reducing overhead in tensor parallelism (TP) scenarios through padding for tile alignment and a fused QK RoPE kernel for decode mode.

Changes:

  • Added padding to o_proj weights and bias to ensure tile-aligned dimensions in CCL operations, avoiding expensive untilize-pad-tilize cycles
  • Implemented fused QK RoPE operation and fused KV cache update for decode mode when batch size ≤ 32
  • Added slice operation after allreduce to remove padding and restore original hidden dimensions

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
models/demos/gpt_oss/tt/attention/weights.py Adds padding logic to o_proj weight and bias for tile alignment in TP operations; updates cache keys to reflect padding
models/demos/gpt_oss/tt/attention/operations.py Adds slice operation after allreduce to remove padding added to support tile-aligned CCL
models/demos/gpt_oss/tt/attention/decode.py Implements fused QK RoPE optimization for batch_size ≤ 32; adds reshape logic for padded dimensions before allreduce
models/demos/gpt_oss/tt/attention/init.py Pre-creates fused transformation matrix for fused QK RoPE to avoid host writes during trace

@sraizada-tt sraizada-tt changed the title Gpt attn optimizations [gpt-oss] attention decode optimizations Feb 5, 2026
@sraizada-tt sraizada-tt added this pull request to the merge queue Feb 6, 2026
Merged via the queue into main with commit e1c0a22 Feb 6, 2026
96 of 104 checks passed
@sraizada-tt sraizada-tt deleted the gpt-attn-optimizations branch February 6, 2026 09:50
handrewsTT added a commit that referenced this pull request Feb 7, 2026
handrewsTT added a commit that referenced this pull request Feb 9, 2026
adrian-pascual-bernal pushed a commit that referenced this pull request Feb 10, 2026
Added padding to o_proj weights and bias to ensure tile-aligned
dimensions in CCL operations, avoiding expensive untilize-pad-tilize
cycles

https://github.com/tenstorrent/tt-metal/actions/runs/21719380426
ssundaramTT pushed a commit that referenced this pull request Feb 10, 2026
Added padding to o_proj weights and bias to ensure tile-aligned
dimensions in CCL operations, avoiding expensive untilize-pad-tilize
cycles

https://github.com/tenstorrent/tt-metal/actions/runs/21719380426
handrewsTT added a commit that referenced this pull request Feb 16, 2026
handrewsTT added a commit that referenced this pull request Feb 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants