Fix `trace_region_size` by hschoi4448 · Pull Request #35271 · tenstorrent/tt-metal

hschoi4448 · 2026-01-05T14:59:55Z

Ticket

Link to Github Issue
#35259

Problem description

Provide context for the problem.

Currently, when trace_region_size is set to 100 MB, the init_one_bank_per_channel function allocates 100 MB of memory per DRAM bank. Consequently, a total of 1200 MB is allocated on the device, implying that trace_region_size is treated as the size per bank.

However, the implementation of the populate_mesh_buffer function, which checks for sufficient trace_region_size, treats it as the total memory size per device. This inconsistency leads to logic errors during memory validation.

void MeshTrace::populate_mesh_buffer(MeshCommandQueue& mesh_cq, std::shared_ptr<MeshTraceBuffer>& trace_buffer) {
    uint64_t unpadded_size = trace_buffer->desc->total_trace_size;
    auto num_banks = mesh_cq.device()->allocator()->get_num_banks(BufferType::DRAM);
    size_t page_size = trace_dispatch::compute_interleaved_trace_buf_page_size(unpadded_size, num_banks);
    size_t padded_size = round_up(unpadded_size, page_size * num_banks);

    const auto current_trace_buffers_size = mesh_cq.device()->get_trace_buffers_size();
    mesh_cq.device()->set_trace_buffers_size(current_trace_buffers_size + padded_size);
    auto trace_region_size = mesh_cq.device()->allocator()->get_config().trace_region_size;
    TT_FATAL(
        mesh_cq.device()->get_trace_buffers_size() <= trace_region_size,
        "Creating trace buffers of size {}B on MeshDevice {}, but only {}B is allocated for trace region.",
        mesh_cq.device()->get_trace_buffers_size(),
        mesh_cq.device()->id(),
        trace_region_size);

std::size_t compute_interleaved_trace_buf_page_size(uint32_t buf_size, const uint32_t num_banks) {

In the compute_interleaved_trace_buf_page_size function, the buf_size parameter represents the buffer size per device

The definition of trace_region_size should be unified across both functions, as either a per-DRAM bank or a per-device value.

What's changed

Describe the approach used to solve the problem.

void AllocatorImpl::init_one_bank_per_channel() ,
Previously, each bank was allocated a full trace_region_size. This has been updated to distribute the size across banks, allocating trace_region_size / num_banks per bank.
void MeshTrace::populate_mesh_buffer
Fixed the alignment of trace_region_size. In an interleaved memory format, the size must be aligned with page_size * num_banks, whereas it was previously only aligned with page_size.
AllocatorConfig L1BankingAllocator::generate_config()
When user set large and not aligned valud for trace_region_size, it cause hang.
So I put alignment with DRAM * num_banks here.

Addressed a potential system hang caused by large, unaligned trace_region_size values. I have implemented an alignment requirement of DRAM_alignment * num_banks to ensure stability.

Checklist

New/Existing tests provide coverage for changes

Model tests

If your changes cover model-related code, you should run tests corresponding to affected models and platforms (Single card, T3K, Galaxy). "Choose your pipeline" workflows facilitate running multiple kinds of tests in a single run. Each offers models-mandatory and models-extended presets.
The former includes a minimal set of tests, to be run always. The latter extends that with additional ones - use your best judgement in deciding which is the most appropriate for your PR.

Copilot

Pull request overview

This PR fixes an inconsistency in how trace_region_size is interpreted across the codebase. Previously, it was ambiguously treated as both a per-DRAM-bank size and a per-device size, leading to incorrect memory allocation and validation logic errors.

Key Changes:

Standardized trace_region_size to represent total size per device (not per bank)
Fixed per-bank allocation by dividing total size by number of DRAM channels
Corrected padding alignment for interleaved buffers to account for all banks

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
tt_metal/impl/allocator/l1_banking_allocator.cpp	Adds alignment of `trace_region_size` to `DRAM_alignment * num_banks` to prevent hangs with large unaligned values
tt_metal/impl/allocator/allocator.cpp	Divides `trace_region_size` by `num_dram_channels` to allocate the correct per-bank size, fixing the per-device vs per-bank semantic inconsistency
tt_metal/distributed/mesh_trace.cpp	Fixes padding calculation to align to `page_size * num_banks` instead of just `page_size`, matching the interleaved buffer layout

tt_metal/impl/allocator/l1_banking_allocator.cpp

tt_metal/impl/allocator/allocator.cpp

tt_metal/impl/allocator/l1_banking_allocator.cpp

jbaumanTT · 2026-01-08T19:41:42Z

tt_metal/impl/allocator/allocator.cpp

@@ -34,7 +34,8 @@ void AllocatorImpl::validate_bank_assignments() const {

 void AllocatorImpl::init_one_bank_per_channel() {
    // DRAM bank is between unreserved start and trace_region start: UNRESERVED | DRAM BANK | TRACE REGION
-    DeviceAddr dram_bank_size = config_->dram_bank_size - config_->dram_unreserved_base - config_->trace_region_size;
+    auto trace_region_size_per_bank = config_->trace_region_size / config_->num_dram_channels;


We should assert that config_->trace_region_size % config_->num_dram_channels == 0.

TT_FATAL( config_->trace_region_size % config_->num_dram_channels == 0, "config_->trace_region_size {} should be multiple of config_->num_dram_channels {}", config_->trace_region_size, config_->num_dram_channels);

I added assert
add assert

abhullar-tt

lgtm, please run the pipelines before merging

hschoi4448 · 2026-01-09T01:33:39Z

lgtm, please run the pipelines before merging

TT has many pipelines. Could you let me know which one I should run? @abhullar-tt

abhullar-tt · 2026-01-09T18:31:28Z

lgtm, please run the pipelines before merging

TT has many pipelines. Could you let me know which one I should run? @abhullar-tt

In addition to all post commit. I think the Fast dispatch unit tests + model unit tests from https://github.com/tenstorrent/tt-metal/actions/workflows/tt-metal-l2-nightly.yaml?query=branch:hschoi/fix_trace_region_size would be good to run.

@skhorasganiTT do you know of specific model pipelines that specify a trace region?

skhorasganiTT · 2026-01-09T19:28:18Z

lgtm, please run the pipelines before merging

TT has many pipelines. Could you let me know which one I should run? @abhullar-tt

In addition to all post commit. I think the Fast dispatch unit tests + model unit tests from https://github.com/tenstorrent/tt-metal/actions/workflows/tt-metal-l2-nightly.yaml?query=branch:hschoi/fix_trace_region_size would be good to run.

@skhorasganiTT do you know of specific model pipelines that specify a trace region?

@abhullar-tt @hschoi4448 most model demos specify trace_region_size so it would be good to run single card demos, t3k demos, galaxy demos

jbaumanTT · 2026-01-13T02:36:55Z

/codeowners ping

github-actions · 2026-01-13T02:38:20Z

CodeOwners Group Analysis

This PR requires approval from one member of each of the following groups:

Summary: 2 pending groups, 3 approved groups

Group Information:

⏳ tenstorrent/cse-developer-ttnn (Team) - Members: Mohamed Bahnas, Dalar Vartanians, Ashai Reddy Ginuga, Sudhanshu Singhal | Pending approval
📁 Files owned by this team (1 files)
- models/demos/vision/segmentation/vanilla_unet/tt/common.py

⏳ models/demos/vision/segmentation/vanilla_unet (Group) - Members: Dalar Vartanians | Pending approval
📁 Files owned by this group (1 files)
- models/demos/vision/segmentation/vanilla_unet/tt/common.py

✅ tenstorrent/metalium-developers-metal-distributed (Team) - Members: Austin Ho, Brian Liu, Joseph Chu, Aditya Saigal, Allan Liu | Approved by: Austin Ho
📁 Files owned by this team (1 files)
- tests/tt_metal/distributed/test_end_to_end_eltwise.cpp

✅ tt_metal/distributed/ (Group) - Members: Aditya Saigal, Allan Liu, John Bauman, Joseph Chu, Nigel Huang | Approved by: John Bauman
📁 Files owned by this group (1 files)
- tt_metal/distributed/mesh_trace.cpp

✅ tt_metal/impl/allocator/ (Group) - Members: Almeet Bhullar, Austin Ho | Approved by: Almeet Bhullar, Austin Ho
📁 Files owned by this group (2 files)
- tt_metal/impl/allocator/allocator.cpp
- tt_metal/impl/allocator/l1_banking_allocator.cpp

Note: At least one approval from each group is sufficient.

github-actions · 2026-01-13T02:38:28Z

Hi Ashai Reddy Ginuga [@arginugaTT], Joseph Chu [@cfjchu], Dalar Vartanians [@dvartaniansTT], Evan Smal [@esmalTT], Aditya Saigal [@tt-asaigal], Utku Aydonat [@uaydonat], this PR Fix trace_region_size by Choi HyungSuk(최형석) [@hschoi4448] needs your approval/review to merge this.

jbaumanTT · 2026-01-13T19:50:22Z

I recently bumped UNET trace region sizes in another patch. You'll need to rebase your patch, and may need to bump trace region sizes again.

ihnpark551 · 2026-01-16T07:38:00Z

Hi @jbaumanTT , we did rebase and related works you recommended. Could TT team can review this ticket to be merged? Thanks! cc. @zzigler-tt

jbaumanTT · 2026-01-16T17:10:46Z

models/experimental/functional_unet/tests/common.py

 UNET_FULL_MODEL_PCC_BH = 0.99780

-UNET_TRACE_REGION_SIZE = 768 * 1024
+UNET_TRACE_REGION_SIZE = 540672


Please don't reduce the size here.

I revert change.

jbaumanTT · 2026-01-16T17:11:12Z

/codeowners ping

github-actions · 2026-01-16T17:12:46Z

🔄 CodeOwners Summary Updated

✅ CodeOwners summary updated here

💡 Tip: Use /codeowners new to post a fresh summary comment instead of updating the existing one.

github-actions · 2026-01-16T17:12:54Z

Hi Joseph Chu [@cfjchu], Dalar Vartanians [@dvartaniansTT], Evan Smal [@esmalTT], Mohamed Bahnas [@mbahnasTT], Sudhanshu Singhal [@ssinghalTT], Aditya Saigal [@tt-asaigal], Utku Aydonat [@uaydonat], this PR Fix trace_region_size by Choi HyungSuk(최형석) [@hschoi4448] needs your approval/review to merge this.

ihnpark551 · 2026-01-20T05:12:51Z

Hi TT team, can you review for this PR?
If you have any comments, please let us know.
cc. @jbaumanTT @zzigler-tt

hschoi4448 · 2026-01-20T05:54:26Z

lgtm, please run the pipelines before merging

TT has many pipelines. Could you let me know which one I should run? @abhullar-tt

In addition to all post commit. I think the Fast dispatch unit tests + model unit tests from https://github.com/tenstorrent/tt-metal/actions/workflows/tt-metal-l2-nightly.yaml?query=branch:hschoi/fix_trace_region_size would be good to run.
@skhorasganiTT do you know of specific model pipelines that specify a trace region?

@abhullar-tt @hschoi4448 most model demos specify trace_region_size so it would be good to run single card demos, t3k demos, galaxy demos

I can't run "t3k demos", "galaxy demos" because they are broken from few weeks ago.

t3k demos: https://github.com/tenstorrent/tt-metal/actions/workflows/t3000-demo-tests.yaml
glaxy demos: https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-demo-tests.yaml

Ideally, we’d wait for these two tests to be fixed, but I’ll likely be leaving the company before that happens. Is there any other way to get this PR merged sooner?

@skhorasganiTT

skhorasganiTT · 2026-01-20T16:24:04Z

I can't run "t3k demos", "galaxy demos" because they are broken from few weeks ago.

t3k demos: https://github.com/tenstorrent/tt-metal/actions/workflows/t3000-demo-tests.yaml glaxy demos: https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-demo-tests.yaml

Ideally, we’d wait for these two tests to be fixed, but I’ll likely be leaving the company before that happens. Is there any other way to get this PR merged sooner?

@skhorasganiTT

@hschoi4448 You don't need to wait for the pipelines to be fully green to run them or to merge this PR. Just need to check that there are no new failing jobs/tests.

hschoi4448 · 2026-01-23T06:51:17Z

I can't run "t3k demos", "galaxy demos" because they are broken from few weeks ago.
t3k demos: https://github.com/tenstorrent/tt-metal/actions/workflows/t3000-demo-tests.yaml glaxy demos: https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-demo-tests.yaml
Ideally, we’d wait for these two tests to be fixed, but I’ll likely be leaving the company before that happens. Is there any other way to get this PR merged sooner?
@skhorasganiTT

@hschoi4448 You don't need to wait for the pipelines to be fully green to run them or to merge this PR. Just need to check that there are no new failing jobs/tests.

I runned t3k demo and galaxy demo.
There are no new failing tests.

t3k demo

main: https://github.com/tenstorrent/tt-metal/actions/runs/21269822192
this pr: https://github.com/tenstorrent/tt-metal/actions/runs/21272218938
galaxy demo
main: https://github.com/tenstorrent/tt-metal/actions/runs/21275477949
this pr: https://github.com/tenstorrent/tt-metal/actions/runs/21272222901

@skhorasganiTT cc: @ihnpark551

csehydrogen · 2026-02-02T06:58:02Z

Rebased to resolve conflicts.
Galaxy tests: https://github.com/tenstorrent/tt-metal/actions/runs/21580374213
T3K tests: https://github.com/tenstorrent/tt-metal/actions/runs/21580376390

csehydrogen · 2026-02-11T05:23:42Z

/codeowners ping

tenstorrent-github-bot · 2026-02-11T05:24:26Z

🔄 CodeOwners Summary Updated

✅ CodeOwners summary updated here

💡 Tip: Use /codeowners new to post a fresh summary comment instead of updating the existing one.

tenstorrent-github-bot · 2026-02-11T05:24:32Z

Hi Ashai Reddy Ginuga (@arginugaTT), Dalar Vartanians (@dvartaniansTT), this PR Fix trace_region_size by Choi HyungSuk(최형석) (@hschoi4448) needs your approval/review to merge this.

ihnpark551 · 2026-02-13T04:16:05Z

Hi Team, it looks like some simple reviews are left before this PR being merged.
Can you check this PR?
cc. @jbaumanTT @zzigler-tt

zzigler-tt · 2026-02-13T05:02:17Z

Hi, @tenstorrent/cse-developer-ttnn @cfjchu @tt-asaigal @aliuTT @dvartaniansTT can one of you please review and provide comments / approve. Thanks

hschoi4448 self-assigned this Jan 5, 2026

Copilot AI review requested due to automatic review settings January 5, 2026 14:59

hschoi4448 added the bug Something isn't working label Jan 5, 2026

hschoi4448 requested review from a team, abhullar-tt, aliuTT, cfjchu, jbaumanTT, nhuang-tt, tt-aho and tt-asaigal as code owners January 5, 2026 14:59

github-project-automation bot added this to External Requests and Reports Jan 5, 2026

github-project-automation bot moved this to 🆕 New in External Requests and Reports Jan 5, 2026

Copilot started reviewing on behalf of hschoi4448 January 5, 2026 15:00 View session

Copilot AI reviewed Jan 5, 2026

View reviewed changes

jbaumanTT approved these changes Jan 8, 2026

View reviewed changes

abhullar-tt approved these changes Jan 9, 2026

View reviewed changes

hschoi4448 force-pushed the hschoi/fix_trace_region_size branch from 4c00e2a to aaff07e Compare January 9, 2026 02:12

hschoi4448 requested review from a team, dvartaniansTT, esmalTT and uaydonat as code owners January 9, 2026 03:03

hschoi4448 force-pushed the hschoi/fix_trace_region_size branch from e59be70 to e46214e Compare January 14, 2026 07:16

jbaumanTT reviewed Jan 16, 2026

View reviewed changes

hschoi4448 force-pushed the hschoi/fix_trace_region_size branch from c5a2fe1 to d93d70f Compare January 23, 2026 02:11

hschoi4448 added 10 commits February 2, 2026 15:51

fix trace_region_size

7257934

fix dram_bank_size

b65c073

fix trace_region_size alignemnt

012dc4e

fix lint

416306f

change var name

c983dbb

add assert

c133357

fix trace_region_size

a38c517

fix trace_region_size

ecd8970

resolve conflict

7bc134e

revert unet trace region size

962460f

csehydrogen force-pushed the hschoi/fix_trace_region_size branch from f74d74d to 962460f Compare February 2, 2026 06:53

csehydrogen requested a review from a team as a code owner February 2, 2026 06:53

tt-aho approved these changes Feb 2, 2026

View reviewed changes

csehydrogen enabled auto-merge February 3, 2026 03:36

Conversation

hschoi4448 commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ticket

Problem description

What's changed

Checklist

Model tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jbaumanTT Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

hschoi4448 Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

abhullar-tt left a comment

Choose a reason for hiding this comment

Uh oh!

hschoi4448 commented Jan 9, 2026

Uh oh!

abhullar-tt commented Jan 9, 2026

Uh oh!

skhorasganiTT commented Jan 9, 2026

Uh oh!

jbaumanTT commented Jan 13, 2026

Uh oh!

github-actions bot commented Jan 13, 2026 • edited by tenstorrent-github-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodeOwners Group Analysis

Group Information:

Uh oh!

github-actions bot commented Jan 13, 2026

Uh oh!

jbaumanTT commented Jan 13, 2026

Uh oh!

ihnpark551 commented Jan 16, 2026

Uh oh!

jbaumanTT Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

hschoi4448 Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

jbaumanTT commented Jan 16, 2026

Uh oh!

github-actions bot commented Jan 16, 2026

🔄 CodeOwners Summary Updated

Uh oh!

github-actions bot commented Jan 16, 2026

Uh oh!

ihnpark551 commented Jan 20, 2026

Uh oh!

hschoi4448 commented Jan 20, 2026

Uh oh!

skhorasganiTT commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hschoi4448 commented Jan 23, 2026

Uh oh!

csehydrogen commented Feb 2, 2026

Uh oh!

csehydrogen commented Feb 11, 2026

Uh oh!

tenstorrent-github-bot commented Feb 11, 2026

🔄 CodeOwners Summary Updated

Uh oh!

tenstorrent-github-bot commented Feb 11, 2026

Uh oh!

ihnpark551 commented Feb 13, 2026

Uh oh!

zzigler-tt commented Feb 13, 2026

hschoi4448 commented Jan 5, 2026 •

edited

Loading

github-actions bot commented Jan 13, 2026 •

edited by tenstorrent-github-bot

Loading

skhorasganiTT commented Jan 20, 2026 •

edited

Loading