Skip to content

Fix trace_region_size#35271

Open
hschoi4448 wants to merge 10 commits intomainfrom
hschoi/fix_trace_region_size
Open

Fix trace_region_size#35271
hschoi4448 wants to merge 10 commits intomainfrom
hschoi/fix_trace_region_size

Conversation

@hschoi4448
Copy link
Contributor

@hschoi4448 hschoi4448 commented Jan 5, 2026

Ticket

Link to Github Issue
#35259

Problem description

Provide context for the problem.

Currently, when trace_region_size is set to 100 MB, the init_one_bank_per_channel function allocates 100 MB of memory per DRAM bank. Consequently, a total of 1200 MB is allocated on the device, implying that trace_region_size is treated as the size per bank.

However, the implementation of the populate_mesh_buffer function, which checks for sufficient trace_region_size, treats it as the total memory size per device. This inconsistency leads to logic errors during memory validation.

void MeshTrace::populate_mesh_buffer(MeshCommandQueue& mesh_cq, std::shared_ptr<MeshTraceBuffer>& trace_buffer) {
    uint64_t unpadded_size = trace_buffer->desc->total_trace_size;
    auto num_banks = mesh_cq.device()->allocator()->get_num_banks(BufferType::DRAM);
    size_t page_size = trace_dispatch::compute_interleaved_trace_buf_page_size(unpadded_size, num_banks);
    size_t padded_size = round_up(unpadded_size, page_size * num_banks);

    const auto current_trace_buffers_size = mesh_cq.device()->get_trace_buffers_size();
    mesh_cq.device()->set_trace_buffers_size(current_trace_buffers_size + padded_size);
    auto trace_region_size = mesh_cq.device()->allocator()->get_config().trace_region_size;
    TT_FATAL(
        mesh_cq.device()->get_trace_buffers_size() <= trace_region_size,
        "Creating trace buffers of size {}B on MeshDevice {}, but only {}B is allocated for trace region.",
        mesh_cq.device()->get_trace_buffers_size(),
        mesh_cq.device()->id(),
        trace_region_size);
std::size_t compute_interleaved_trace_buf_page_size(uint32_t buf_size, const uint32_t num_banks) {

In the compute_interleaved_trace_buf_page_size function, the buf_size parameter represents the buffer size per device

The definition of trace_region_size should be unified across both functions, as either a per-DRAM bank or a per-device value.

What's changed

Describe the approach used to solve the problem.

  1. void AllocatorImpl::init_one_bank_per_channel() ,
    Previously, each bank was allocated a full trace_region_size. This has been updated to distribute the size across banks, allocating trace_region_size / num_banks per bank.

  2. void MeshTrace::populate_mesh_buffer
    Fixed the alignment of trace_region_size. In an interleaved memory format, the size must be aligned with page_size * num_banks, whereas it was previously only aligned with page_size.

  3. AllocatorConfig L1BankingAllocator::generate_config()
    When user set large and not aligned valud for trace_region_size, it cause hang.
    So I put alignment with DRAM * num_banks here.

Addressed a potential system hang caused by large, unaligned trace_region_size values. I have implemented an alignment requirement of DRAM_alignment * num_banks to ensure stability.

Checklist

  • All post-commit tests
  • Blackhole Post commit
  • cpp-unit-tests
  • New/Existing tests provide coverage for changes

Model tests

If your changes cover model-related code, you should run tests corresponding to affected models and platforms (Single card, T3K, Galaxy). "Choose your pipeline" workflows facilitate running multiple kinds of tests in a single run. Each offers models-mandatory and models-extended presets.
The former includes a minimal set of tests, to be run always. The latter extends that with additional ones - use your best judgement in deciding which is the most appropriate for your PR.

@hschoi4448 hschoi4448 self-assigned this Jan 5, 2026
Copilot AI review requested due to automatic review settings January 5, 2026 14:59
@hschoi4448 hschoi4448 added the bug Something isn't working label Jan 5, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an inconsistency in how trace_region_size is interpreted across the codebase. Previously, it was ambiguously treated as both a per-DRAM-bank size and a per-device size, leading to incorrect memory allocation and validation logic errors.

Key Changes:

  • Standardized trace_region_size to represent total size per device (not per bank)
  • Fixed per-bank allocation by dividing total size by number of DRAM channels
  • Corrected padding alignment for interleaved buffers to account for all banks

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
tt_metal/impl/allocator/l1_banking_allocator.cpp Adds alignment of trace_region_size to DRAM_alignment * num_banks to prevent hangs with large unaligned values
tt_metal/impl/allocator/allocator.cpp Divides trace_region_size by num_dram_channels to allocate the correct per-bank size, fixing the per-device vs per-bank semantic inconsistency
tt_metal/distributed/mesh_trace.cpp Fixes padding calculation to align to page_size * num_banks instead of just page_size, matching the interleaved buffer layout

@@ -34,7 +34,8 @@ void AllocatorImpl::validate_bank_assignments() const {

void AllocatorImpl::init_one_bank_per_channel() {
// DRAM bank is between unreserved start and trace_region start: UNRESERVED | DRAM BANK | TRACE REGION
DeviceAddr dram_bank_size = config_->dram_bank_size - config_->dram_unreserved_base - config_->trace_region_size;
auto trace_region_size_per_bank = config_->trace_region_size / config_->num_dram_channels;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should assert that config_->trace_region_size % config_->num_dram_channels == 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    TT_FATAL(
        config_->trace_region_size % config_->num_dram_channels == 0,
        "config_->trace_region_size {} should be multiple of config_->num_dram_channels {}",
        config_->trace_region_size,
        config_->num_dram_channels);

I added assert
add assert

Copy link
Contributor

@abhullar-tt abhullar-tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, please run the pipelines before merging

@hschoi4448
Copy link
Contributor Author

lgtm, please run the pipelines before merging

TT has many pipelines. Could you let me know which one I should run? @abhullar-tt

@hschoi4448 hschoi4448 force-pushed the hschoi/fix_trace_region_size branch from 4c00e2a to aaff07e Compare January 9, 2026 02:12
@abhullar-tt
Copy link
Contributor

lgtm, please run the pipelines before merging

TT has many pipelines. Could you let me know which one I should run? @abhullar-tt

In addition to all post commit. I think the Fast dispatch unit tests + model unit tests from https://github.com/tenstorrent/tt-metal/actions/workflows/tt-metal-l2-nightly.yaml?query=branch:hschoi/fix_trace_region_size would be good to run.

@skhorasganiTT do you know of specific model pipelines that specify a trace region?

@skhorasganiTT
Copy link
Contributor

lgtm, please run the pipelines before merging

TT has many pipelines. Could you let me know which one I should run? @abhullar-tt

In addition to all post commit. I think the Fast dispatch unit tests + model unit tests from https://github.com/tenstorrent/tt-metal/actions/workflows/tt-metal-l2-nightly.yaml?query=branch:hschoi/fix_trace_region_size would be good to run.

@skhorasganiTT do you know of specific model pipelines that specify a trace region?

@abhullar-tt @hschoi4448 most model demos specify trace_region_size so it would be good to run single card demos, t3k demos, galaxy demos

@jbaumanTT
Copy link
Contributor

/codeowners ping

@github-actions
Copy link
Contributor

github-actions bot commented Jan 13, 2026

CodeOwners Group Analysis

This PR requires approval from one member of each of the following groups:

Summary: 2 pending groups, 3 approved groups

Group Information:




  • tt_metal/distributed/ (Group) - Members: Aditya Saigal, Allan Liu, John Bauman, Joseph Chu, Nigel Huang | Approved by: John Bauman

    📁 Files owned by this group (1 files)

Note: At least one approval from each group is sufficient.

@github-actions
Copy link
Contributor

Hi Ashai Reddy Ginuga [@arginugaTT], Joseph Chu [@cfjchu], Dalar Vartanians [@dvartaniansTT], Evan Smal [@esmalTT], Aditya Saigal [@tt-asaigal], Utku Aydonat [@uaydonat], this PR Fix trace_region_size by Choi HyungSuk(최형석) [@hschoi4448] needs your approval/review to merge this.

@jbaumanTT
Copy link
Contributor

I recently bumped UNET trace region sizes in another patch. You'll need to rebase your patch, and may need to bump trace region sizes again.

@hschoi4448 hschoi4448 force-pushed the hschoi/fix_trace_region_size branch from e59be70 to e46214e Compare January 14, 2026 07:16
@ihnpark551
Copy link

Hi @jbaumanTT , we did rebase and related works you recommended. Could TT team can review this ticket to be merged? Thanks! cc. @zzigler-tt

UNET_FULL_MODEL_PCC_BH = 0.99780

UNET_TRACE_REGION_SIZE = 768 * 1024
UNET_TRACE_REGION_SIZE = 540672
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't reduce the size here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revert change.

@jbaumanTT
Copy link
Contributor

/codeowners ping

@github-actions
Copy link
Contributor

🔄 CodeOwners Summary Updated

CodeOwners summary updated here

💡 Tip: Use /codeowners new to post a fresh summary comment instead of updating the existing one.

@github-actions
Copy link
Contributor

Hi Joseph Chu [@cfjchu], Dalar Vartanians [@dvartaniansTT], Evan Smal [@esmalTT], Mohamed Bahnas [@mbahnasTT], Sudhanshu Singhal [@ssinghalTT], Aditya Saigal [@tt-asaigal], Utku Aydonat [@uaydonat], this PR Fix trace_region_size by Choi HyungSuk(최형석) [@hschoi4448] needs your approval/review to merge this.

@ihnpark551
Copy link

Hi TT team, can you review for this PR?
If you have any comments, please let us know.
cc. @jbaumanTT @zzigler-tt

@hschoi4448
Copy link
Contributor Author

lgtm, please run the pipelines before merging

TT has many pipelines. Could you let me know which one I should run? @abhullar-tt

In addition to all post commit. I think the Fast dispatch unit tests + model unit tests from https://github.com/tenstorrent/tt-metal/actions/workflows/tt-metal-l2-nightly.yaml?query=branch:hschoi/fix_trace_region_size would be good to run.
@skhorasganiTT do you know of specific model pipelines that specify a trace region?

@abhullar-tt @hschoi4448 most model demos specify trace_region_size so it would be good to run single card demos, t3k demos, galaxy demos

I can't run "t3k demos", "galaxy demos" because they are broken from few weeks ago.

t3k demos: https://github.com/tenstorrent/tt-metal/actions/workflows/t3000-demo-tests.yaml
glaxy demos: https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-demo-tests.yaml

Ideally, we’d wait for these two tests to be fixed, but I’ll likely be leaving the company before that happens. Is there any other way to get this PR merged sooner?

@skhorasganiTT

@skhorasganiTT
Copy link
Contributor

skhorasganiTT commented Jan 20, 2026

I can't run "t3k demos", "galaxy demos" because they are broken from few weeks ago.

t3k demos: https://github.com/tenstorrent/tt-metal/actions/workflows/t3000-demo-tests.yaml glaxy demos: https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-demo-tests.yaml

Ideally, we’d wait for these two tests to be fixed, but I’ll likely be leaving the company before that happens. Is there any other way to get this PR merged sooner?

@skhorasganiTT

@hschoi4448 You don't need to wait for the pipelines to be fully green to run them or to merge this PR. Just need to check that there are no new failing jobs/tests.

@hschoi4448 hschoi4448 force-pushed the hschoi/fix_trace_region_size branch from c5a2fe1 to d93d70f Compare January 23, 2026 02:11
@hschoi4448
Copy link
Contributor Author

I can't run "t3k demos", "galaxy demos" because they are broken from few weeks ago.
t3k demos: https://github.com/tenstorrent/tt-metal/actions/workflows/t3000-demo-tests.yaml glaxy demos: https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-demo-tests.yaml
Ideally, we’d wait for these two tests to be fixed, but I’ll likely be leaving the company before that happens. Is there any other way to get this PR merged sooner?
@skhorasganiTT

@hschoi4448 You don't need to wait for the pipelines to be fully green to run them or to merge this PR. Just need to check that there are no new failing jobs/tests.

I runned t3k demo and galaxy demo.
There are no new failing tests.

t3k demo

@skhorasganiTT cc: @ihnpark551

@csehydrogen csehydrogen force-pushed the hschoi/fix_trace_region_size branch from f74d74d to 962460f Compare February 2, 2026 06:53
@csehydrogen csehydrogen requested a review from a team as a code owner February 2, 2026 06:53
@csehydrogen
Copy link
Member

@csehydrogen csehydrogen enabled auto-merge February 3, 2026 03:36
@csehydrogen
Copy link
Member

/codeowners ping

@tenstorrent-github-bot
Copy link

🔄 CodeOwners Summary Updated

CodeOwners summary updated here

💡 Tip: Use /codeowners new to post a fresh summary comment instead of updating the existing one.

@tenstorrent-github-bot
Copy link

Hi Ashai Reddy Ginuga (@arginugaTT), Dalar Vartanians (@dvartaniansTT), this PR Fix trace_region_size by Choi HyungSuk(최형석) (@hschoi4448) needs your approval/review to merge this.

@ihnpark551
Copy link

Hi Team, it looks like some simple reviews are left before this PR being merged.
Can you check this PR?
cc. @jbaumanTT @zzigler-tt

@zzigler-tt
Copy link

Hi, @tenstorrent/cse-developer-ttnn @cfjchu @tt-asaigal @aliuTT @dvartaniansTT can one of you please review and provide comments / approve. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants