[gpt-oss] perf optimization: all to all ops with tokens on dim -2 by sraizada-tt · Pull Request #36720 · tenstorrent/tt-metal

sraizada-tt · 2026-01-29T12:18:42Z

Copilot

Pull request overview

This PR optimizes the GPT MoE expert throughput implementation by reducing memory access latency and minimizing tensor reshape operations during decode.

Changes:

Switched decode memory configuration from DRAM to L1 for improved throughput
Refactored decode forward pass to maintain tokens on seq_len dimension (dim -2) throughout the pipeline, reducing reshape operations
Updated all_to_all dispatch/combine configurations to use output_concat_dim=2 and output_shard_dim=2 for consistency with the new token dimension strategy
Reduced prefill chunk_size from 2048 to 512 as a workaround for diverging outputs (GitHub issue #36335)

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
models/demos/gpt_oss/tt/mlp.py	Changed decode_memory_config from DRAM to L1 for better decode throughput
models/demos/gpt_oss/tt/experts_throughput/prefill.py	Reduced chunk_size to 512 as temporary workaround for divergence issue
models/demos/gpt_oss/tt/experts_throughput/decode.py	Major refactor to keep tokens on seq_len dimension, reducing reshape operations and improving performance
models/demos/gpt_oss/tt/experts_throughput/config.py	Updated all_to_all configs to use dim 2 for both concat and shard operations, consistent with decode changes

Copilot · 2026-01-29T12:28:31Z

models/demos/gpt_oss/tt/experts_throughput/decode.py

+    while len(expert_output_sparse.shape) > 4:
+        expert_output_sparse = ttnn.squeeze(expert_output_sparse, 0)


This while loop to squeeze extra dimensions seems defensive but may indicate uncertainty about the output shape of the sparse_matmul. Consider documenting what conditions would cause expert_output_sparse to have more than 4 dimensions, or if this can be simplified to a fixed number of squeeze operations if the shape is always predictable.

Suggested change

while len(expert_output_sparse.shape) > 4:

expert_output_sparse = ttnn.squeeze(expert_output_sparse, 0)

# sparse_matmul may introduce a leading singleton batch dimension, yielding:

# [1, total_tokens/block, experts, block, H]. We only expect at most one such

# dimension here; enforce that contract explicitly rather than squeezing in a loop.

if len(expert_output_sparse.shape) == 5:

expert_output_sparse = ttnn.squeeze(expert_output_sparse, 0)

elif len(expert_output_sparse.shape) != 4:

raise RuntimeError(

f"Unexpected expert_output_sparse rank {len(expert_output_sparse.shape)}; "

"expected 4D or 5D with a leading singleton batch dimension."

)

Copilot · 2026-01-29T12:28:32Z

models/demos/gpt_oss/tt/experts_throughput/prefill.py

    program_config: ThroughputProgramConfig,
    mesh_device,
-    chunk_size: int = 2048,
+    chunk_size: int = 512,  # TODO: increasing this causes diverging outputs for last mesh row (https://github.com/tenstorrent/tt-metal/issues/36335)


The chunk_size reduction from 2048 to 512 is a workaround for diverging outputs rather than an optimization. This TODO references issue #36335, suggesting this is a temporary fix. Consider adding a more prominent warning or tracking mechanism to ensure this gets reverted once the underlying issue is resolved, as the smaller chunk size may impact performance.

Copilot · 2026-01-29T12:28:32Z

models/demos/gpt_oss/tt/experts_throughput/decode.py

@@ -112,60 +112,68 @@
    Returns:
        Output tensor [batch_size_per_device, 1, seq_len, hidden_size]


The docstring states the output tensor has shape [batch_size_per_device, 1, seq_len, hidden_size], but the actual implementation returns [1, 1, tokens_per_device, hidden_size] where tokens_per_device = batch_size_per_device * seq_len. The docstring should be updated to reflect the actual output shape, or the implementation should reshape the output back to match the documented shape if that's what callers expect.

Suggested change

Output tensor [batch_size_per_device, 1, seq_len, hidden_size]

Output tensor [1, 1, tokens_per_device, hidden_size], where

tokens_per_device = batch_size_per_device * seq_len.

…nstorrent#36720) [![(Galaxy) unit tests](https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-unit-tests.yaml/badge.svg?branch=gpt-128-optimizations)](https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-unit-tests.yaml) [![(Galaxy) demo tests](https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-demo-tests.yaml/badge.svg?branch=gpt-128-optimizations)](https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-demo-tests.yaml) --------- Co-authored-by: handrewsTT <handrews@tenstorrent.com>

…6720) [![(Galaxy) unit tests](https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-unit-tests.yaml/badge.svg?branch=gpt-128-optimizations)](https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-unit-tests.yaml) [![(Galaxy) demo tests](https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-demo-tests.yaml/badge.svg?branch=gpt-128-optimizations)](https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-demo-tests.yaml) --------- Co-authored-by: handrewsTT <handrews@tenstorrent.com>

Copilot AI review requested due to automatic review settings January 29, 2026 12:18

sraizada-tt requested review from a team, handrewsTT, mtairum and uaydonat as code owners January 29, 2026 12:18

Copilot started reviewing on behalf of sraizada-tt January 29, 2026 12:19 View session

Copilot AI reviewed Jan 29, 2026

View reviewed changes

sraizada-tt and others added 2 commits January 30, 2026 10:33

chnage token dim for mlp

4c9287d

bump perf targets

6f05076

handrewsTT force-pushed the gpt-128-optimizations branch from 3e304a9 to 6f05076 Compare January 30, 2026 11:03

sraizada-tt changed the title ~~Gpt 128 optimizations~~ [gpt-oss] perf optimization: all to all ops with tokens on dim -2 Jan 30, 2026

bump 20b

30551dd

handrewsTT approved these changes Jan 30, 2026

View reviewed changes

handrewsTT added 3 commits January 30, 2026 15:43

add decode tol

3ad187a

bump

53b6221

2x high tol

d81ddad

handrewsTT added this pull request to the merge queue Jan 30, 2026

Merged via the queue into main with commit cc4e0bf Jan 30, 2026
79 checks passed

handrewsTT deleted the gpt-128-optimizations branch January 30, 2026 22:22

handrewsTT mentioned this pull request Feb 2, 2026

[GPT-OSS] ND hangs #36395

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[gpt-oss] perf optimization: all to all ops with tokens on dim -2#36720

[gpt-oss] perf optimization: all to all ops with tokens on dim -2#36720
handrewsTT merged 6 commits intomainfrom
gpt-128-optimizations

sraizada-tt commented Jan 29, 2026 •

edited by handrewsTT

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 29, 2026

Uh oh!

Copilot AI Jan 29, 2026

Uh oh!

Copilot AI Jan 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		while len(expert_output_sparse.shape) > 4:
		expert_output_sparse = ttnn.squeeze(expert_output_sparse, 0)

-    while len(expert_output_sparse.shape) > 4:
-        expert_output_sparse = ttnn.squeeze(expert_output_sparse, 0)
+    # sparse_matmul may introduce a leading singleton batch dimension, yielding:
+    # [1, total_tokens/block, experts, block, H]. We only expect at most one such
+    # dimension here; enforce that contract explicitly rather than squeezing in a loop.
+    if len(expert_output_sparse.shape) == 5:
+        expert_output_sparse = ttnn.squeeze(expert_output_sparse, 0)
+    elif len(expert_output_sparse.shape) != 4:
+        raise RuntimeError(
+            f"Unexpected expert_output_sparse rank {len(expert_output_sparse.shape)}; "
+            "expected 4D or 5D with a leading singleton batch dimension."
+        )

		@@ -112,60 +112,68 @@
		Returns:
		Output tensor [batch_size_per_device, 1, seq_len, hidden_size]

	Output tensor [batch_size_per_device, 1, seq_len, hidden_size]
	Output tensor [1, 1, tokens_per_device, hidden_size], where
	tokens_per_device = batch_size_per_device * seq_len.

Conversation

sraizada-tt commented Jan 29, 2026 • edited by handrewsTT Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sraizada-tt commented Jan 29, 2026 •

edited by handrewsTT

Loading