Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes an inconsistency in how trace_region_size is interpreted across the codebase. Previously, it was ambiguously treated as both a per-DRAM-bank size and a per-device size, leading to incorrect memory allocation and validation logic errors.
Key Changes:
- Standardized
trace_region_sizeto represent total size per device (not per bank) - Fixed per-bank allocation by dividing total size by number of DRAM channels
- Corrected padding alignment for interleaved buffers to account for all banks
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| tt_metal/impl/allocator/l1_banking_allocator.cpp | Adds alignment of trace_region_size to DRAM_alignment * num_banks to prevent hangs with large unaligned values |
| tt_metal/impl/allocator/allocator.cpp | Divides trace_region_size by num_dram_channels to allocate the correct per-bank size, fixing the per-device vs per-bank semantic inconsistency |
| tt_metal/distributed/mesh_trace.cpp | Fixes padding calculation to align to page_size * num_banks instead of just page_size, matching the interleaved buffer layout |
| @@ -34,7 +34,8 @@ void AllocatorImpl::validate_bank_assignments() const { | |||
|
|
|||
| void AllocatorImpl::init_one_bank_per_channel() { | |||
| // DRAM bank is between unreserved start and trace_region start: UNRESERVED | DRAM BANK | TRACE REGION | |||
| DeviceAddr dram_bank_size = config_->dram_bank_size - config_->dram_unreserved_base - config_->trace_region_size; | |||
| auto trace_region_size_per_bank = config_->trace_region_size / config_->num_dram_channels; | |||
There was a problem hiding this comment.
We should assert that config_->trace_region_size % config_->num_dram_channels == 0.
There was a problem hiding this comment.
TT_FATAL(
config_->trace_region_size % config_->num_dram_channels == 0,
"config_->trace_region_size {} should be multiple of config_->num_dram_channels {}",
config_->trace_region_size,
config_->num_dram_channels);I added assert
add assert
abhullar-tt
left a comment
There was a problem hiding this comment.
lgtm, please run the pipelines before merging
TT has many pipelines. Could you let me know which one I should run? @abhullar-tt |
4c00e2a to
aaff07e
Compare
In addition to all post commit. I think the Fast dispatch unit tests + model unit tests from https://github.com/tenstorrent/tt-metal/actions/workflows/tt-metal-l2-nightly.yaml?query=branch:hschoi/fix_trace_region_size would be good to run. @skhorasganiTT do you know of specific model pipelines that specify a trace region? |
@abhullar-tt @hschoi4448 most model demos specify trace_region_size so it would be good to run single card demos, t3k demos, galaxy demos |
|
/codeowners ping |
CodeOwners Group AnalysisThis PR requires approval from one member of each of the following groups: Summary: 2 pending groups, 3 approved groups Group Information:
Note: At least one approval from each group is sufficient. |
|
Hi Ashai Reddy Ginuga [@arginugaTT], Joseph Chu [@cfjchu], Dalar Vartanians [@dvartaniansTT], Evan Smal [@esmalTT], Aditya Saigal [@tt-asaigal], Utku Aydonat [@uaydonat], this PR Fix |
|
I recently bumped UNET trace region sizes in another patch. You'll need to rebase your patch, and may need to bump trace region sizes again. |
e59be70 to
e46214e
Compare
|
Hi @jbaumanTT , we did rebase and related works you recommended. Could TT team can review this ticket to be merged? Thanks! cc. @zzigler-tt |
| UNET_FULL_MODEL_PCC_BH = 0.99780 | ||
|
|
||
| UNET_TRACE_REGION_SIZE = 768 * 1024 | ||
| UNET_TRACE_REGION_SIZE = 540672 |
There was a problem hiding this comment.
Please don't reduce the size here.
There was a problem hiding this comment.
I revert change.
|
/codeowners ping |
🔄 CodeOwners Summary Updated✅ CodeOwners summary updated here 💡 Tip: Use |
|
Hi Joseph Chu [@cfjchu], Dalar Vartanians [@dvartaniansTT], Evan Smal [@esmalTT], Mohamed Bahnas [@mbahnasTT], Sudhanshu Singhal [@ssinghalTT], Aditya Saigal [@tt-asaigal], Utku Aydonat [@uaydonat], this PR Fix |
|
Hi TT team, can you review for this PR? |
I can't run "t3k demos", "galaxy demos" because they are broken from few weeks ago. t3k demos: https://github.com/tenstorrent/tt-metal/actions/workflows/t3000-demo-tests.yaml Ideally, we’d wait for these two tests to be fixed, but I’ll likely be leaving the company before that happens. Is there any other way to get this PR merged sooner? |
@hschoi4448 You don't need to wait for the pipelines to be fully green to run them or to merge this PR. Just need to check that there are no new failing jobs/tests. |
c5a2fe1 to
d93d70f
Compare
I runned t3k demo and galaxy demo. t3k demo
|
f74d74d to
962460f
Compare
|
Rebased to resolve conflicts. |
|
/codeowners ping |
🔄 CodeOwners Summary Updated✅ CodeOwners summary updated here 💡 Tip: Use |
|
Hi Ashai Reddy Ginuga (@arginugaTT), Dalar Vartanians (@dvartaniansTT), this PR Fix |
|
Hi Team, it looks like some simple reviews are left before this PR being merged. |
|
Hi, @tenstorrent/cse-developer-ttnn @cfjchu @tt-asaigal @aliuTT @dvartaniansTT can one of you please review and provide comments / approve. Thanks |
Ticket
Link to Github Issue
#35259
Problem description
Provide context for the problem.
Currently, when
trace_region_sizeis set to 100 MB, theinit_one_bank_per_channelfunction allocates 100 MB of memory per DRAM bank. Consequently, a total of 1200 MB is allocated on the device, implying thattrace_region_sizeis treated as the sizeper bank.However, the implementation of the
populate_mesh_bufferfunction, which checks for sufficienttrace_region_size, treats it as the total memory sizeper device. This inconsistency leads to logic errors during memory validation.In the
compute_interleaved_trace_buf_page_sizefunction, the buf_size parameter represents the buffer size per deviceThe definition of
trace_region_sizeshould be unified across both functions, as either aper-DRAM bankor aper-devicevalue.What's changed
Describe the approach used to solve the problem.
void AllocatorImpl::init_one_bank_per_channel(),Previously, each bank was allocated a full
trace_region_size. This has been updated to distribute the size across banks, allocating trace_region_size / num_banks per bank.void MeshTrace::populate_mesh_bufferFixed the alignment of
trace_region_size. In an interleaved memory format, the size must be aligned withpage_size * num_banks, whereas it was previously only aligned withpage_size.AllocatorConfig L1BankingAllocator::generate_config()When user set large and not aligned valud for
trace_region_size, it cause hang.So I put alignment with
DRAM * num_bankshere.Addressed a potential system hang caused by large, unaligned
trace_region_sizevalues. I have implemented an alignment requirement of DRAM_alignment * num_banks to ensure stability.Checklist
Model tests
If your changes cover model-related code, you should run tests corresponding to affected models and platforms (Single card, T3K, Galaxy). "Choose your pipeline" workflows facilitate running multiple kinds of tests in a single run. Each offers
models-mandatoryandmodels-extendedpresets.The former includes a minimal set of tests, to be run always. The latter extends that with additional ones - use your best judgement in deciding which is the most appropriate for your PR.
models-mandatorypreset (runs:Device perf regressions and
Frequent model and ttnn tests)
Devcie perf regressions : https://github.com/tenstorrent/tt-metal/actions/runs/20719760081
models-extendedpreset (runs:the mandatory tests, plus Demo
and Model perf tests)
demo: https://github.com/tenstorrent/tt-metal/actions/runs/20840256203
model perf: https://github.com/tenstorrent/tt-metal/actions/runs/20839766567
models-mandatorypreset (runs:Unit tests)
models-extendedpreset(runs:
the mandatory tests, plus Demo and
Model perf tests)
models-mandatorypreset (runs:Quick tests)
models-extendedpreset(runs:
the mandatory tests, plus Demo and
Model perf tests)