Revert "A balanced traffic pattern for AG minimal. (#36607)" by rdraskicTT · Pull Request #37832 · tenstorrent/tt-metal

rdraskicTT · 2026-02-13T09:07:40Z

Causes incorrect outputs and hangs for llama models
Example failing runs https://github.com/tenstorrent/tt-metal/actions/runs/21972598335/job/63477567742

This reverts commit 039d07a.

Copilot

Pull request overview

Reverts the previously introduced “balanced traffic pattern” behavior in the minimal all-gather-async implementation, motivated by incorrect outputs and hangs observed in Llama-model workloads.

Changes:

Removes split-forwarding (“balanced traffic”) logic from minimal all-gather async reader/writer device kernels.
Removes a fabric packet-size vs page-size TT_FATAL guard in the minimal default program factory.
Deletes a WAN-focused ring perf/trace test block and related helper/imports from the nightly test.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
`ttnn/cpp/ttnn/operations/experimental/ccl/all_gather_async/device/kernels/minimal_default_writer.cpp`	Reverts split-forwarding logic and restores simpler per-slice forwarding behavior.
`ttnn/cpp/ttnn/operations/experimental/ccl/all_gather_async/device/kernels/minimal_default_reader.cpp`	Reverts split-forwarding logic and simplifies forwarding/wait behavior accordingly.
`ttnn/cpp/ttnn/operations/experimental/ccl/all_gather_async/device/all_gather_async_default_program_factory.cpp`	Removes validation tying fabric packet payload size to tensor page size.
`tests/nightly/tg/ccl/test_minimal_all_gather_async.py`	Removes WAN ring perf/trace test and related helper/imports; leaves some imports now unused.

Comments suppressed due to low confidence (1)

ttnn/cpp/ttnn/operations/experimental/ccl/all_gather_async/device/all_gather_async_default_program_factory.cpp:393

Removing the packet_size_bytes >= page_size validation can lead to num_pages_per_packet == 0, which makes num_tiles_to_write_per_packet == 0. That will result in zero-sized CBs and/or infinite loops in the reader/writer kernels (tiles_read never advances), i.e. hangs. Please restore the guard (TT_FATAL) or otherwise enforce num_tiles_to_write_per_packet >= 1 with a clear error when the fabric packet payload is smaller than a tensor page size.

    // L1 Scratch CB Creation
    const size_t packet_size_bytes = tt::tt_fabric::get_tt_fabric_channel_buffer_size_bytes();
    uint32_t l1_scratch_cb_page_size_bytes = page_size;

    // scatter-write currently supports 4 distinct noc addresses
    uint32_t max_target_noc_addresses_per_packet = 4;

    // for bfloat8_b, tile_num_per_link=6, we would need to send 2 packages, but they can be of size 3 instead of 4
    uint32_t num_pages_per_packet = packet_size_bytes / l1_scratch_cb_page_size_bytes;
    uint32_t num_tiles_to_write_per_packet = std::min(max_target_noc_addresses_per_packet, num_pages_per_packet);
    uint32_t cb_num_pages = 3 * num_tiles_to_write_per_packet;  // triple buffering

Copilot · 2026-02-13T09:21:22Z

tests/nightly/tg/ccl/test_minimal_all_gather_async.py

 from tests.nightly.t3000.ccl.test_minimal_all_gather_async import run_all_gather_impl
-from tests.ttnn.multidevice_perf_tests.sweep_all_gather_hyperparameters_t3000 import get_max_chunks_per_sync
 from models.common.utility_functions import skip_for_blackhole, skip_for_wormhole_b0
 from tests.tt_eager.python_api_testing.sweep_tests.comparison_funcs import comp_equal, comp_pcc


comp_equal and comp_pcc are imported but only referenced in commented-out code, so they are currently unused. If the repo runs ruff/flake8 in CI, this will fail with an unused-import error; please remove the unused imports or re-enable the assertions that use them.

Suggested change

from tests.tt_eager.python_api_testing.sweep_tests.comparison_funcs import comp_equal, comp_pcc

rdraskicTT · 2026-02-13T09:27:00Z

@llongTT fyi

dpopovTT · 2026-02-13T09:53:55Z

/codeowners bypass

tenstorrent-github-bot

✅ CodeOwners bypass approval granted by @dpopovTT (metalium-developers-infra team)

tenstorrent-github-bot · 2026-02-13T09:54:17Z

✅ CodeOwners Bypass Approval Granted\n\nThis PR has been approved by @dpopovTT (metalium-developers-infra team) using the bypass mechanism.\n\n⚠️ Note: This bypass should only be used for emergency fixes or when standard approval process is blocked.

Revert "A balanced traffic pattern for AG minimal. (#36607)"

7778f90

This reverts commit 039d07a.

rdraskicTT marked this pull request as ready for review February 13, 2026 09:11

rdraskicTT requested review from a team as code owners February 13, 2026 09:11

Copilot AI review requested due to automatic review settings February 13, 2026 09:11

Copilot started reviewing on behalf of rdraskicTT February 13, 2026 09:11 View session

Copilot AI reviewed Feb 13, 2026

View reviewed changes

rdraskicTT enabled auto-merge February 13, 2026 09:25

rdraskicTT requested a review from llongTT February 13, 2026 09:26

tenstorrent-github-bot approved these changes Feb 13, 2026

View reviewed changes

rdraskicTT added this pull request to the merge queue Feb 13, 2026

Merged via the queue into main with commit f250fa3 Feb 13, 2026
158 of 164 checks passed

rdraskicTT deleted the rdraskic/revert-039d07a branch February 13, 2026 10:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "A balanced traffic pattern for AG minimal. (#36607)"#37832

Revert "A balanced traffic pattern for AG minimal. (#36607)"#37832
rdraskicTT merged 1 commit intomainfrom
rdraskic/revert-039d07a

rdraskicTT commented Feb 13, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 13, 2026

Uh oh!

rdraskicTT commented Feb 13, 2026

Uh oh!

dpopovTT commented Feb 13, 2026

Uh oh!

tenstorrent-github-bot left a comment

Uh oh!

tenstorrent-github-bot commented Feb 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rdraskicTT commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

rdraskicTT commented Feb 13, 2026

Uh oh!

dpopovTT commented Feb 13, 2026

Uh oh!

tenstorrent-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

tenstorrent-github-bot commented Feb 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rdraskicTT commented Feb 13, 2026 •

edited

Loading