Releases: tenstorrent/tt-metal
v0.66.0-dev20260116
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/21051111065
📦 Uncategorized
- Rename ops params and input structs
- PR: #35857
- fix(ttml): resolve nanobind duplicate type registration errors
- PR: #35423
- [skip ci] Add download-artifact-with-retry action to fix corrupted .deb downloads
- PR: #35853
- Add Qwen Image CI tests
- PR: #35800
- Adding pi0 model to TTNN
- PR: #35833
- Add support for other dtypes and L1 for multicore pad OP
- PR: #35869
- Revert changes to
create_arange_vector_of_bfloat16- PR: #35715
- Add support of choosing position_ids in testing MLA
- PR: #35789
- #35313 fix sdpa with attn sinks
- PR: #35817
- [UPSAMPLE] Add floating point scale factor support to TTNN upsample
- PR: #35508
- Move uv to base stage
- PR: #35896
- [TT-Transformers] Enable fused rotary and paged cache update ops in attention module
- PR: #35111
- Fix Wan postprocess spatial output
- PR: #35871
- Check for disallowed params combination in chunked SDPA
- PR: #35811
- Use distributed LN in TT-DiT models
- PR: #35831
- Add support for paged KV cache and chunked prefill to ring distributed sdpa
- PR: #35742
- migrate to HF cross attention vision transformer of mLlama
- PR: #35750
- Add checks for cgroup memory since Docker uses namespaces to limit things
- PR: #35450
- Allow docs deployment to be from main
- PR: #35910
- #0: Add actual device perf check in ops post commit
- PR: #35473
- Add user configurable max packet size to fabric
- PR: #35848
- L2 nightly test failure with ttnn.where()
- PR: #35879
- [Bug fix] Altering ALU config from TRISC0
- PR: #35090
- CCL Program Cache Updates
- PR: #35400
- Data Movement Program Cache Fixes
- PR: #35429
- [Fabric] Pkt hdr updates - support for upto 4X64 mesh
- PR: #35494
- Z Router device changes
- PR: #34561
- [skip ci] Delete ttnn/api/ttnn/Untitled
- PR: #35951
- Enable MeshWorkload in ttnn.generic op
- PR: #35323
- Fix Quasar FW compilation
- PR: #35926
- Allow Logical to Physical Pinnings in MGD
- PR: #34996
v0.66.0-dev20260115
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/21014818709
📦 Uncategorized
- Clean up CMakeLists.txt messages and flags
- PR: #35553
- [skip ci] Fix uv command
- PR: #35799
- #34755: Add Fused Implementation of Deepseek MOE Gate for Deepseek B1
- PR: #35759
- [skip ci] Add suite-level device fixture to reduce CI overhead
- PR: #35783
- Remove things related to old device operation
- PR: #35720
- Improve trace tracking to fix device perf
- PR: #35791
- [skip ci] Fix dead store warning in topology_mapper.cpp
- PR: #35693
- Add Wan and Flux BH LB configuration
- PR: #35360
- 34250 issue: include of test header to the production codebase
- PR: #34406
- 1136 bug: tensor to torch workflow implementation through host conversion
- PR: #35266
- Add Functional Qwen-Image on WH
- PR: #34627
- [skip CI] Fixes for t3k perf pipeline changes
- PR: #35801
- Migrate legacy tt_metal tests to gtest framework
- PR: #35680
- #33778: Add uint16 support for bitwise shift ops
- PR: #35164
- [skip ci] Fix create_venv.sh and finish uv propagation
- PR: #35804
- Add stability test suite for BH GLX 2D Torus (1D and 2D)
- PR: #35718
- Add metal api to all enqueue_read into PinnedMemory.
- PR: #28957
- [skip ci] Remove CCL sharded address generator sweep tests (infinite speedup)
- PR: #35797
- fix: Add [[maybe_unused]] to benchmark loop variables to silence clang static analyzer
- PR: #35794
- #34880: Add llk kernel for addcmul
- PR: #35221
- [skip ci] Update the description for the Eth link status check
- PR: #35796
- Add ttnn.experimental.isin to TTNN Python and C++ APIs (2nd attempt)
- PR: #29607
- Don't conditionally dispatch on individual devices during
ttnn.paged_update_cache- PR: #35656
- Config Tensors in DRAM for Pool2D
- PR: #35212
- #32879: Simple accurate softplus op
- PR: #33766
- LLK uninits for BH
- PR: #35645
- Gemma3-27b DP4 on TG added to vLLM-nightly
- PR: #35752
- [skip ci] Fix download artifacts script
- PR: #35824
- [skip ci] Fix mismatched model name in T3K unit pipeline
- PR: #35672
- Revert "Don't conditionally dispatch on individual devices during
ttnn.paged_update_cache(#35656)"- PR: #35829
- [DM]: Removing unused
mesh_deviceparameter- PR: #35786
- Add OWL-ViT model using TTNN APIs
- PR: #35461
- #35572: Use TensorAccessor for sharded untilize
- PR: #35686
- [Fabric] Fix ccl tests after pkt hdr updates
- PR: #35559
- Fix models_common_unit_tests in t3000 e2e tests CI
- PR: #35803
- Update op perf report reading to support new op type format
- PR: #35821
- Fix wormhole llk_uninit missing default values error
- PR: #35822
- Fix variable shadowing and improve error handling in pad RM multi-core
- PR: #35700
- [skip ci] auto-generate owners from pipeline reorg
- PR: #35777
- Restore test_clean_init as standalone executable
- PR: #35835
- 2erisc coordinated retrain on BH
- PR: #35666
- [tt-train] SDPA Backward Pass operation
- PR: #29259
- Avoid including dataflow_api.h in firmware builds.
- PR: #35345
- [skip ci] Search multiple pip indexes
- PR: #35840
- Update blackhole golden dispatch file
- PR: #35782
- SDXL Img2img accuracy
- PR: #35737
- Remove redundant return-type usings from device ops
- PR: #35808
- #32998: Use bcast scalar with dest reuse for RMSNorm
- PR: #35843
- Cache step independent computations in Wan2.2 pipeline
- PR: #34237
- Capture src/dst addr and useful NoC counters in NoC Debug Packets
- PR: #35682
- Remove dead store for num_cores in embeddings_fused_program_factory
- PR: #35703
- Fix reading into pinned memory on tunneled devices
- PR: #35810
- SFPI 7.16.0 168
- PR: #35849
- [skip ci] make workflow yaml as template for analyzing ND failures workflow
- PR: #35860
- [skip_ci] Add CODEOWNERS entry for llk_api/llk_sfpu
- PR: #35695
- [UMD Bump] Automated UMD Bump 08.01.2026
- PR: #35440
- [TT Transformers] DRAM Prefetcher Bring up on BH with Ring MM Unit test
- PR: #35709
- [skip ci] Remove docker-job subdirectory workaround (phase 1)
- PR: #35867
- [skip ci] Move install_uv and update create_venv
- PR: #35862
v0.66.0-dev20260114
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/20977570832
📦 Uncategorized
- [tt-train] Revert test skip in NIGHTLY_UnusedParametersInModuleSGD
- PR: #35524
- Added Fabric Benchmark Upload Guards
- PR: #35677
- Fix Qwen T3k demo + perplexity tests due to missing seq len cutoff in warm up and incorrect max_seq_len
- PR: #35255
- Update Mixtral model tests to use HF as reference
- PR: #35644
- [skip ci] Fix calling of deploy docs
- PR: #35735
- Update conv2d performance targets and threshold
- PR: #35729
- #0: Add missing python comparison operator for CoreRangeSet
- PR: #35756
- Fix sliding_window SDPA program caching
- PR: #35749
- Adding stallwaits to first batch of uninits
- PR: #35287
- Fix qwen25_vl unit tests
- PR: #35754
- [skip ci] Add philei-tt & jmalone-tt to tt-train codeowners
- PR: #35764
- Fabric tests were missing from merge gate status checks
- PR: #35717
- Add 6u cyclic multiprocess tests to CI
- PR: #35472
- [skip ci] Add check-prs cursor command for PR status monitoring
- PR: #35712
- [skip ci] Add @mateusznowakTT to CODEOWNERS
- PR: #35768
- [skip ci] Switch from pip to uv pip
- PR: #35707
- Bump versions of deps that are so old pip is compiling it from scratch
- PR: #35767
- Improve triage debug messages
- PR: #35676
- Adding Warning when downgrading Mesh shape because of Connectivity
- PR: #35771
- Cluster validation updates for characterizing BH Link Health
- PR: #35714
- Replace assert()/TT_ASSERT() with reliable checks in tests
- PR: #35665
- #28087 revert the binary compute core optimization revert and more changes
- PR: #35420
- Allow multiple output tensors
- PR: #32193
- declaring rta and crta thread_local, fixing linker values
- PR: #35545
- [skip ci] Metal Profiler Tech Report Update
- PR: #35412
- Topology Solver: Adjacency Graphs and Constraints API
- PR: #35769
- Added codeowners for docs without owners
- PR: #35763
- Support two risc in UDM mode
- PR: #35327
- Add specialized Distributed Layernorm for DiT models
- PR: #35657
- #35670: create a new job to determine the runner labels for Git Dispatch workflow
- PR: #35671
- Add a script to count the number of pytest including parametrize expansions given a path
- PR: #35705
- Fix N150 profiler
- PR: #35787
- ci: Change pr-gate default build-type from ASanCoverage to ASan
- PR: #35779
v0.66.0-dev20260113
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/20939760521
📦 Uncategorized
- [Fabric] Disable 80B header for 2D
- PR: #35584
- Add override_output_sharding_config param to BlockShardedStrategyConfiguration
- PR: #35465
- [skip ci] Enhance run_conv2d_short_sweep function to accept additional params
- PR: #35640
- Deepseek module changes to ensure compatibility with higher sequence lengths
- PR: #35370
- Improving error messages across owned scripts
- PR: #35649
- Ring Attention datamovement optimization
- PR: #34929
- Improve performance of accurate exponential
- PR: #32968
- Fix dead store warnings in ternary_program_factory.cpp
- PR: #35583
- Optimize SD Profiler Reads
- PR: #35581
- [skip CI] Fixes for t3k demo pipeline changes
- PR: #35668
- Add DeepSeekV3 unit tests to T3K unit and APC pipelines
- PR: #35409
- Trigger 2x WH GLX similar to T3K multihosts
- PR: #35554
- Remove '_no_pack' Tilize Variants
- PR: #35557
- Fix parameter shadowing bug in BlockRep constructor
- PR: #35579
- Use multicast when initializing metal context
- PR: #35188
- Increased timeout for t3k integration llama3 test
- PR: #35674
- Add dst addrs to NoC async read/write debug packets
- PR: #35414
- Optimize page size in traces for performance.
- PR: #34752
- Test fixes after moving 2.0 into experimental
- PR: #35675
- 33696: Remove sub_device_manager_tracker from device
- PR: #35452
- [skip ci] upstream image: give other users r/o permissions to the home directory
- PR: #35642
- fix tracy .str conversion for when special_parent_text col is empty
- PR: #35687
- Update tt-logger version to 1.1.7
- PR: #35599
- Move DPRINT parsing logic to separate class
- PR: #33161
- Fix Qwen garbage output
- PR: #35555
- Cleanup dispatch_core_common.hpp
- PR: #35489
- Remove
metal_soc_descriptor.hfrom public Runtime API- PR: #34178
- Fix OOM in XQKV prefill matmul on P100 Llama 8b
- PR: #35683
v0.66.0-dev20260112
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/20904408807
📦 Uncategorized
- Improve accuracy of atan/atan2
- PR: #35470
- Fix static analyzer false positive in device_operation.hpp
- PR: #35588
- Modify unary_bcast API in metal to add new data formats
- PR: #35304
- Fix ring matmul runtime arg hang and bad outputs in llama70b
- PR: #35368
- [skip CI] Fixes for t3k pipeline changes
- PR: #35602
v0.66.0-dev20260111
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/20886677172
📦 Uncategorized
- [Fabric] Add device freq validation for perf modes
- PR: #35301
- Cleanup rms norm function
- PR: #35455
- [skip ci] Cleanup "using" in llrt.hpp
- PR: #35568
- Migrate op to new infra: matmul
- PR: #34466
- Vit bh combined tech report
- PR: #35567
- Strip out unused symbols for the bfloat utilties
- PR: #34364
- Add time budget controls for t3k pipelines and renames frequent, nightly, model perf to integration, e2e, perf
- PR: #35551
- fix-matmul-wrong-clang-tidy-fix
- PR: #35582
- Trace Deepseek V3 on 1x Galaxy
- PR: #35507
- Fix clang-tidy misc-unused-params warnings
- PR: #35433
v0.66.0-dev20260110
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/20869506581
📦 Uncategorized
- Migrate op to new infra: FusedRMSNormPreAllGather
- PR: #35488
- Force Single ERSIC Kernel Execution in run_cluster_validation
- PR: #35476
- Telemetry: Add fabric bandwidth telemetry metrics (v2)
- PR: #34816
- Fix models/common/tests CI failure
- PR: #35482
- Increase Vovnet treshold
- PR: #35267
- [skip ci] #35313 [GPT-OSS] disable 4k prefill unit tests
- PR: #35505
- Relax T3K Qwen2.5-Coder-32B CI target
- PR: #35324
- Fix T3K Mixtral Perplexity tests with missing
is_mixture_of_expertsflag- PR: #35254
- #32289: remove duplicate file
- PR: #32637
- Fixes after prefix caching
- PR: #35503
- [tt-train] Add KV cache support to tt-train's LLaMA
- PR: #33169
- Changing DM PCC check to bitwise comparison
- PR: #35403
- Improve accuracy of tanh on float32
- PR: #34927
- migrate vision encoder unit test to HF
- PR: #35448
- #35236: remove deepseek blitz ops tests from models unit tests
- PR: #35518
- add missing pytest import
- PR: #35517
- Sagarwal/profiler noc trace bug
- PR: #33730
- Update on profiler CI options and remove other nightlies
- PR: #35513
- Consolidate fabric init postcodes and telemetry status
- PR: #35481
- PR: Fix rotary_embedding_llama sweep test with proper golden function
- PR: #34742
- Mbahnas/vit bh hires 1211
- PR: #35426
- Updating CB doc to indicate there is only 1 reader and 1 writer
- PR: #34282
- Generalize timeStampedData function
- PR: #35479
- Moving 2.0 apis into experimental and updating compute kernels to use CB abstraction
- PR: #35495
- [Fabric] Fix ubench pipeline
- PR: #35471
- TT-Transformers version 2 modules -- MLP
- PR: #35095
- #35342: Revert PR #32045
- PR: #35492
- Bump ttsim version to v1.2.0
- PR: #35562
- use subordinate_sync correctly
- PR: #35566
v0.65.1
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/20869526990
📦 Uncategorized
- Remove prefetcher dangling reference from previous test
- PR: #35061
- Fix batched prefill pcc issue
- PR: #35059
- Llama-3.1-8B decode TSU optimizations
- PR: #35142
- [skip ci] Re-gen Docker containers (#35305)
- PR: #35438
What's Changed
- MM/Fused/Reduce Docs Touchups by @edwinleeTT in #33510
- [skip ci] TG Resnet50 test_perf_trace_2cqs tweak by @astancovTT in #33672
- Making check arc robust to firmware version when calculating uptime by @adjordjevic-TT in #33592
- [skip ci] Update tt_transformers docs (and comments) to remove mentions of LLAMA_DIR by @gwangTT in #33669
- Migrate op to new infra: send_async by @kevinwuTT in #33005
- Migrate op to new infra: recv_async by @kevinwuTT in #33200
- #33584: SDXL demo/accuracy test e2e time reporting by @ipotkonjak-tt in #33646
- split_work_to_cores pybind in ttnn module by @arichinsTT in #32997
- Migrate op to new infra: rotary_embedding_llama_fused_qk by @philei-tt in #33655
- Fix GraphQL query in set-opened-on workflow by @jbakerTT in #33687
- Adding 1x glx demo test + testing in CI for Deepseek V3 by @yalrawwashTT in #33508
- Fix motif galaxy demo test by @sadesoyeTT in #33688
- [Reduce APC] Remove cpp N150 runs from APC by @kkabilarTT in #33683
- Fix torch reference tensor in sharded layernorm tests by @rmillerTT in #33564
- #32710: Migrate op to new infra: nlp_create_qkv_heads_decode by @ssundaramTT in #33516
- Removing devicePool from the API by @mfiltser-TT in #33668
- add more cores to 2x harvesting to support further harvested P150s by @yugaoTT in #33613
- Migrate op to new infra: conv2d by @awliu-TT in #33019
- Enable 2 ERISC mode on bh glx upstream tests by @nnyamagoudar-TT in #33571
- chore: update LLK submodule to 91fa6c2 by @fvranicTT in #33685
- Migrate op to new infra: nlp_create_qkv_heads_vit by @philei-tt in #33658
- Migrate op to new infra: nlp_kv_cache_load_slice by @philei-tt in #33650
- [skip ci] updating auto triage token by @ebanerjeeTT in #33700
- #0: Fix (Galaxy) Demo - Motif job by adding NO_PROMPT variable by @dimitri-tenstorrent in #33712
- 1D to support 1x32 chip routing by @daminakaTT in #32575
- Fix check_noc_status on non-default setups by @jbaumanTT in #33522
- Increasing timeout for manual hang detection in triage tests, enable logging when test fails by @adjordjevic-TT in #33670
- Migrate op to new infra: nlp_create_qkv_heads_segformer by @philei-tt in #33664
- Added a fix to the invalid test in test_strided_all_gather_minimal_matmul_async for t3k by @jvegaTT in #33717
- fix to layout regression by @jvegaTT in #33665
- add missing write barrier after noc_semaphore_set by @kpaigwar in #33710
- [skip ci] Fix and simplify set-opened-on workflow by @jbakerTT in #33724
- [skip ci] Add libc++ to CMakePresets by @afuller-TT in #33727
- Add dynamic power throttling to BH by @rdjogoTT in #33627
- Migrate op to new infra: bcast by @vtsilytskyiTT in #33657
- Remove checkout from setup-job action to eliminate SHA-pinned rollout hazard by @Copilot in #33691
- Add additional argument handling in graph serializer by @dgomezTT in #29563
- New Model:
TG Qwen3-32bwithTG Llama3-70bOptimizations at 65 t/s/u by @ricozhu-TT in #31018 - Migrate op to new infra: prod_nc by @shutovilyaep in #33562
- Migrate op to new infra: prod_all by @shutovilyaep in #33568
- Migrate op to new infra: argmax by @shutovilyaep in #33310
- enhance fetching time from dram to l1 cmddat_q through prefetching by @mdingTT in #33537
- Whisper - decoder and encoder optimization by @mbahnasTT in #33450
- [skip ci] Adding support for 4x timeshare by @akirby-TT in #33693
- Remove silent defaults from DeepSeekV3 demo and tests by @esmalTT in #33686
- optimize bh fabric rx ack credit path by @SeanNijjar in #33524
- #32626: Remove
Morehoperations from docs by @mgajewskiTT in #33765 - Set auto triage to false and revert 8dfb324 by @dpopovTT in #33774
- Add ClosetBox fabric test configuration by @jpanasiukTT in #33596
- Implement in-memory GSD/FSD validation to avoid disk I/O by @jpanasiukTT in #33055
- Re-enable Triage tests by @afuller-TT in #33529
- Autopacketization support for fabric data movement - Part 1 by @tlevinTT in #33081
- Revert "Implement in-memory GSD/FSD validation to avoid disk I/O (#33055)" by @tt-rkim in #33792
- Fix set-opened-on workflow: null handling, token validation, and pagination by @jbakerTT in #33781
- Run paged llama attention prefill unit test instead of default attention by @alingTT in #33681
- Decreasing low threshold for heartbeat per seconds to avoid ND test fail by @adjordjevic-TT in #33788
- Add full grid worker forwarding channels for UDM Mux [4/n] by @yugaoTT in #33364
- SFPI 7.12.0 by @nathan-TT in #33720
- [skip ci] Fix GraphQL date type in set-opened-on workflow by @jbakerTT in #33797
- Update Falcon7b PCC and expected output jsons after layernorm op changes broke tests by @skhorasganiTT in #33786
- Fixing print tests with ND failures in tt-sim by @kstevensTT in #33520
- Enabling tt-triage in APC by @tt-vjovanovic in #33798
- Allow the profiler DRAM buffer size to be dynamically allocated depending on a user-specified op count by @sagarwalTT in #33004
- remove deprecated fabric latency tests by @SeanNijjar in #33762
- Fixing 1D Mapping Algorithm in Mesh Device for flipped coords by @Riddy21 in #33499
- [Reduce APC] Run ccl from cpp-unit-tests on merge gate by @kkabilarTT in #33731
- Updated CODEOWNERS to include all files in ttnn/cpp/ttnn/deprecated/ by @fplavecTT in #33807
- [Fabric] fabric unicast scatter multi-chunk by @daminakaTT in #32395
- Optimize fused strided all gather and minimal matmul to read local slice from AG input by @jonathansuTT in #33703
- Improve accuracy of accurate sigmoid_tile by @nmauriceTT in #31266
- [skip ci] Remove cron jobs from t3k and galaxy demo tests by @dpopovTT in #33814
- [Fabric] telemetry to be controled by env var in more detail by @daminakaTT in #33523
- [skip ci] Optimize set-opened-on workflow by hardcoding IDs by @jbakerTT in #33810
- TT-Fabric Intermesh Traffic VC (VC1) Support [1/n] by @ubcheema in #33750
- [skip ci] Refactor wheel building CI job by @afuller-TT in #33530
- Mi...
v0.65.0
TT-Metal v0.65.0 Release Notes
This release contains significant improvements and new features.
Changes
See CHANGELOG.txt for detailed commit history.
Installation
Refer to INSTALLING.md for installation instructions.
Model Updates
New
- New Op Infrastructure Enablement for LLM & Diffusion Models
Core transformer execution paths (QKV, rotary embeddings, SDPA decode) migrated to the new op infra, forming the backbone for scalable LLM and diffusion support.
PR #33209 – Migrate op to new infra: sdpa_decode
Model Performance & Accuracy Updates
- Stable Diffusion / SDXL Accuracy Fix
Corrected SDXL VAE accuracy issues that impacted image quality and downstream validation.
PR #33156 – SDXL vae batch encode accuracy fix
Improvements and New Features
-
Sub-Core Grid Scaling Across Ops
Enabled sub-core grid support for core unary ops, unblocking better utilization and scaling on large devices.
PR #33157 – Add sub_core_grids to unary infra and ops -
Numerical Accuracy Fixes in Core Math Ops
Fixed accuracy issues in exponential-related ops that directly affect model convergence and output quality.
PR #33139 – Fix expm1 accuracy -
Large-Kernel Support
Added support for huge kernels, enabling execution of larger and more complex workloads without fragmentation.
PR #32956 – Huge kernel support -
Improved Error Propagation in Build System
Ensured exceptions in build threads correctly propagate to the main thread, preventing silent failures.
PR #33205 – Ensure exceptions in build threads are propagated -
Fabric Router Heartbeat
Added heartbeat support to the fabric router, significantly improving detection of stalled or unhealthy links.
PR #31255 – Fabric router heartbeat feature -
Telemetry Firmware Visibility
Exposed remaining firmware versions via telemetry, improving fleet visibility and debugging.
PR #33158 – Telemetry: Expose remaining firmware versions -
CI & Workflow Hardening
Embedded pytest commands directly into Galaxy workflows, reducing CI flakiness and improving debuggability.
PR #32991 – Embed Pytest commands in Galaxy workflows
Full Changelog: v0.64.5...v0.65.0
v0.66.0-dev20260109
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/20836658890
📦 Uncategorized
- [skip ci] Add timeout to package installation step
- PR: #35417
- Expanding module tests to ensure added seq len functionality for Deepseek 671B model
- PR: #34177
- Fix BH performance: Remove unnecessary NOC_BRCST_EXCLUDE resets
- PR: #34368
- Graph tracing improvement
- PR: #34263
- #35326: Add Deepseek Blitz unit tests to CI
- PR: #35344
- Make allocate_tensor_on_device private, use create_device_tensor instead
- PR: #32948
- [Fabric] Add infra for dynamic packet header sizing
- PR: #34976
- Add sweeps for new model traced ops
- PR: #35361
- Improve Out of Memory Error Message
- PR: #32150
- [skip ci] update gpt-oss README
- PR: #35398
- Add teacher forcing demo test for Deepseek 671B model
- PR: #33967
- [DM] Update data movement tests
- PR: #35026
- #34947:
ttnn_tracer_modelttnn tutorial fix- PR: #35372
- Add memory usage tracking for DRAM & L1 in training loop
- PR: #35316
- #0: [skip ci] Add P100 support in git bisect
- PR: #35427
- Update ttexalens reference version to 0.2.0
- PR: #35451
- [skip ci] Enable t3k demo tests cron job
- PR: #35453
- adds
TT_METAL_JIT_ANALYTICSenvironment variable- PR: #35388
- Add support for Automatic Prefix Caching in TT-Transformers
- PR: #33883
- Reenable fabric manager tests in Galaxy Quick
- PR: #35402
- #32983: Remove some initial calls to test_system_health as it's being deprecated
- PR: #35094
- Expose Hyperparams to Standard Namespace AG & RS
- PR: #35322
- Strip unused symbols in
sub_device.hpp- PR: #34348
- Launch dispatch kernels in parallel on multiple devices
- PR: #34750
- [skip ci] Update Wheel Artifact Naming Convention in CI
- PR: #35432
- Reduce channel count when not all channels are needed.
- PR: #35155
- allow subordinate_sync_t per architecture
- PR: #35399
- [skip ci] Add bh demo tests and bh multi card test to release testing
- PR: #35469
- [skip ci] Optimize clang-tidy presets: disable tt-train and switch to Debug config
- PR: #35475
- Apascual/30094 test mixtral decoder against hf
- PR: #35138
- [skip ci] update merge gate alerts
- PR: #35478
- [TT-Train] GSM8K Finetuning example with dashboard and Galaxy support
- PR: #31108
- Fix swapped BASE_DIRS in kernel_helper_functions CMakeLists.txt
- PR: #35477
- Moved get_batch_size to shape file
- PR: #32873
- Move compute_flat_indices to shape
- PR: #32862
- Add owners of vLLM integration tech report
- PR: #35480
- feat: refactor import_tracy_op_logs
- PR: #35310
- Migrate op to new infra: all_gather_async
- PR: #34975
- [skip ci] zstd for .debs
- PR: #35466
- #35441: Fix
ttnn.visualize_tensor()crash on multi-host systems- PR: #35464
- Haibo sun/issue#29156
- PR: #35349