Skip to content

[gpt-oss] perf optimization: all to all ops with tokens on dim -2#36720

Merged
handrewsTT merged 6 commits intomainfrom
gpt-128-optimizations
Jan 30, 2026
Merged

[gpt-oss] perf optimization: all to all ops with tokens on dim -2#36720
handrewsTT merged 6 commits intomainfrom
gpt-128-optimizations

Conversation

@sraizada-tt
Copy link
Contributor

@sraizada-tt sraizada-tt commented Jan 29, 2026

(Galaxy) unit tests

(Galaxy) demo tests

Copilot AI review requested due to automatic review settings January 29, 2026 12:18
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the GPT MoE expert throughput implementation by reducing memory access latency and minimizing tensor reshape operations during decode.

Changes:

  • Switched decode memory configuration from DRAM to L1 for improved throughput
  • Refactored decode forward pass to maintain tokens on seq_len dimension (dim -2) throughout the pipeline, reducing reshape operations
  • Updated all_to_all dispatch/combine configurations to use output_concat_dim=2 and output_shard_dim=2 for consistency with the new token dimension strategy
  • Reduced prefill chunk_size from 2048 to 512 as a workaround for diverging outputs (GitHub issue #36335)

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
models/demos/gpt_oss/tt/mlp.py Changed decode_memory_config from DRAM to L1 for better decode throughput
models/demos/gpt_oss/tt/experts_throughput/prefill.py Reduced chunk_size to 512 as temporary workaround for divergence issue
models/demos/gpt_oss/tt/experts_throughput/decode.py Major refactor to keep tokens on seq_len dimension, reducing reshape operations and improving performance
models/demos/gpt_oss/tt/experts_throughput/config.py Updated all_to_all configs to use dim 2 for both concat and shard operations, consistent with decode changes

Comment on lines 316 to 317
while len(expert_output_sparse.shape) > 4:
expert_output_sparse = ttnn.squeeze(expert_output_sparse, 0)
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This while loop to squeeze extra dimensions seems defensive but may indicate uncertainty about the output shape of the sparse_matmul. Consider documenting what conditions would cause expert_output_sparse to have more than 4 dimensions, or if this can be simplified to a fixed number of squeeze operations if the shape is always predictable.

Suggested change
while len(expert_output_sparse.shape) > 4:
expert_output_sparse = ttnn.squeeze(expert_output_sparse, 0)
# sparse_matmul may introduce a leading singleton batch dimension, yielding:
# [1, total_tokens/block, experts, block, H]. We only expect at most one such
# dimension here; enforce that contract explicitly rather than squeezing in a loop.
if len(expert_output_sparse.shape) == 5:
expert_output_sparse = ttnn.squeeze(expert_output_sparse, 0)
elif len(expert_output_sparse.shape) != 4:
raise RuntimeError(
f"Unexpected expert_output_sparse rank {len(expert_output_sparse.shape)}; "
"expected 4D or 5D with a leading singleton batch dimension."
)

Copilot uses AI. Check for mistakes.
program_config: ThroughputProgramConfig,
mesh_device,
chunk_size: int = 2048,
chunk_size: int = 512, # TODO: increasing this causes diverging outputs for last mesh row (https://github.com/tenstorrent/tt-metal/issues/36335)
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chunk_size reduction from 2048 to 512 is a workaround for diverging outputs rather than an optimization. This TODO references issue #36335, suggesting this is a temporary fix. Consider adding a more prominent warning or tracking mechanism to ensure this gets reverted once the underlying issue is resolved, as the smaller chunk size may impact performance.

Copilot uses AI. Check for mistakes.
@@ -112,60 +112,68 @@
Returns:
Output tensor [batch_size_per_device, 1, seq_len, hidden_size]
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring states the output tensor has shape [batch_size_per_device, 1, seq_len, hidden_size], but the actual implementation returns [1, 1, tokens_per_device, hidden_size] where tokens_per_device = batch_size_per_device * seq_len. The docstring should be updated to reflect the actual output shape, or the implementation should reshape the output back to match the documented shape if that's what callers expect.

Suggested change
Output tensor [batch_size_per_device, 1, seq_len, hidden_size]
Output tensor [1, 1, tokens_per_device, hidden_size], where
tokens_per_device = batch_size_per_device * seq_len.

Copilot uses AI. Check for mistakes.
@handrewsTT handrewsTT force-pushed the gpt-128-optimizations branch from 3e304a9 to 6f05076 Compare January 30, 2026 11:03
@sraizada-tt sraizada-tt changed the title Gpt 128 optimizations [gpt-oss] perf optimization: all to all ops with tokens on dim -2 Jan 30, 2026
@handrewsTT handrewsTT added this pull request to the merge queue Jan 30, 2026
Merged via the queue into main with commit cc4e0bf Jan 30, 2026
79 checks passed
@handrewsTT handrewsTT deleted the gpt-128-optimizations branch January 30, 2026 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants