Revert "A balanced traffic pattern for AG minimal. (#36607)"#37832
Revert "A balanced traffic pattern for AG minimal. (#36607)"#37832rdraskicTT merged 1 commit intomainfrom
Conversation
This reverts commit 039d07a.
There was a problem hiding this comment.
Pull request overview
Reverts the previously introduced “balanced traffic pattern” behavior in the minimal all-gather-async implementation, motivated by incorrect outputs and hangs observed in Llama-model workloads.
Changes:
- Removes split-forwarding (“balanced traffic”) logic from minimal all-gather async reader/writer device kernels.
- Removes a fabric packet-size vs page-size
TT_FATALguard in the minimal default program factory. - Deletes a WAN-focused ring perf/trace test block and related helper/imports from the nightly test.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
ttnn/cpp/ttnn/operations/experimental/ccl/all_gather_async/device/kernels/minimal_default_writer.cpp |
Reverts split-forwarding logic and restores simpler per-slice forwarding behavior. |
ttnn/cpp/ttnn/operations/experimental/ccl/all_gather_async/device/kernels/minimal_default_reader.cpp |
Reverts split-forwarding logic and simplifies forwarding/wait behavior accordingly. |
ttnn/cpp/ttnn/operations/experimental/ccl/all_gather_async/device/all_gather_async_default_program_factory.cpp |
Removes validation tying fabric packet payload size to tensor page size. |
tests/nightly/tg/ccl/test_minimal_all_gather_async.py |
Removes WAN ring perf/trace test and related helper/imports; leaves some imports now unused. |
Comments suppressed due to low confidence (1)
ttnn/cpp/ttnn/operations/experimental/ccl/all_gather_async/device/all_gather_async_default_program_factory.cpp:393
- Removing the packet_size_bytes >= page_size validation can lead to num_pages_per_packet == 0, which makes num_tiles_to_write_per_packet == 0. That will result in zero-sized CBs and/or infinite loops in the reader/writer kernels (tiles_read never advances), i.e. hangs. Please restore the guard (TT_FATAL) or otherwise enforce num_tiles_to_write_per_packet >= 1 with a clear error when the fabric packet payload is smaller than a tensor page size.
// L1 Scratch CB Creation
const size_t packet_size_bytes = tt::tt_fabric::get_tt_fabric_channel_buffer_size_bytes();
uint32_t l1_scratch_cb_page_size_bytes = page_size;
// scatter-write currently supports 4 distinct noc addresses
uint32_t max_target_noc_addresses_per_packet = 4;
// for bfloat8_b, tile_num_per_link=6, we would need to send 2 packages, but they can be of size 3 instead of 4
uint32_t num_pages_per_packet = packet_size_bytes / l1_scratch_cb_page_size_bytes;
uint32_t num_tiles_to_write_per_packet = std::min(max_target_noc_addresses_per_packet, num_pages_per_packet);
uint32_t cb_num_pages = 3 * num_tiles_to_write_per_packet; // triple buffering
| from tests.nightly.t3000.ccl.test_minimal_all_gather_async import run_all_gather_impl | ||
| from tests.ttnn.multidevice_perf_tests.sweep_all_gather_hyperparameters_t3000 import get_max_chunks_per_sync | ||
| from models.common.utility_functions import skip_for_blackhole, skip_for_wormhole_b0 | ||
| from tests.tt_eager.python_api_testing.sweep_tests.comparison_funcs import comp_equal, comp_pcc |
There was a problem hiding this comment.
comp_equal and comp_pcc are imported but only referenced in commented-out code, so they are currently unused. If the repo runs ruff/flake8 in CI, this will fail with an unused-import error; please remove the unused imports or re-enable the assertions that use them.
| from tests.tt_eager.python_api_testing.sweep_tests.comparison_funcs import comp_equal, comp_pcc |
|
@llongTT fyi |
|
/codeowners bypass |
tenstorrent-github-bot
left a comment
There was a problem hiding this comment.
✅ CodeOwners bypass approval granted by @dpopovTT (metalium-developers-infra team)
|
✅ CodeOwners Bypass Approval Granted\n\nThis PR has been approved by @dpopovTT (metalium-developers-infra team) using the bypass mechanism.\n\n |
Causes incorrect outputs and hangs for llama models
Example failing runs https://github.com/tenstorrent/tt-metal/actions/runs/21972598335/job/63477567742