Skip to content

Releases: tenstorrent/tt-metal

v0.67.0-dev20260210

11 Feb 00:55
Immutable release. Only release title and notes can be modified.
807ee3d

Choose a tag to compare

v0.67.0-dev20260210 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/21846858892

📦 Uncategorized

  • [skip ci] Disable pytest timeout for Stable Diffusion device perf tests
  • Implement KV store-and-forward chain optimization for non-causal SDPA
  • [GPT-OSS] Add high throughput model to vLLM nightly
  • Update ResNet50 batch_size=32 performance target for Blackhole
  • Improve tracing tooling to provide the whole inputs for ttnn operations
  • SDXL Relax encoder2 perf targets
  • chore: update LLK submodule to 7e7cf4f
  • Set medgemma's max_prefill_chunk_size the same as gemma-3
  • Fix setuptools pkg_resources issue
  • Update SDXL VAE device perf targets after SDPA KV chain forwarding optimization
  • In post sdpa op, mcast to 13x10 grid
  • Removed program cache when no_dispatch
  • Fix FP32 precision loss in untilize for wide tensors
  • [skip ci] Remove Fabric Sanity Benchmark from BH post-commit tests
  • DeepSeek Blitz MOE routed expert
  • Bump blackhole deepseek blitz op tests timeout
  • Add fix for Qsr packet_tag breaking compilation
  • Use up-to-date main() declaration in all kernels & docs
  • [skip ci] #0: remove mamba from perf models yaml
  • Increase coverage of unpack reconfig
  • Add 4 chunks scatter_write and extra ring optimization to all_to_all_async_generic
  • Optimize decode for Llama3-70B for TG
  • [skip ci] Optimize pkg-resources patch
  • Add the accuracy_tips tech report

v0.66.0-rc10

10 Feb 20:45
Immutable release. Only release title and notes can be modified.
719fbb9

Choose a tag to compare

v0.66.0-rc10 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/21846883882

📦 Uncategorized

  • Remove sending lm_head persistent_buffer to DRAM
  • Optimize decode for Llama3-70B for TG for stable branch

v0.67.0-dev20260209

10 Feb 00:52
Immutable release. Only release title and notes can be modified.
44e8cdc

Choose a tag to compare

v0.67.0-dev20260209 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/21808434016

📦 Uncategorized

  • Add Multi-Threaded Test using H <-> D Sockets
  • Migrate device headers from tt_metal/include to tt_metal/hw/inc

v0.66.0-rc9

10 Feb 10:34
Immutable release. Only release title and notes can be modified.

Choose a tag to compare

v0.66.0-rc9 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/21820952145

  • no changes

v0.67.0-dev20260208

09 Feb 00:58
Immutable release. Only release title and notes can be modified.
2a9fa01

Choose a tag to compare

v0.67.0-dev20260208 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/21789607116

📦 Uncategorized

  • Bump ttsim version to v1.3.4
  • Add Support For Fused Shared-Expert Kernel
  • Remove assert on ARCH_NAME in data collection step in workflows
  • Sparse checkout optimizations for workflows
  • Fix various problems seen in test_prefetcher and test_dispatcher
  • chore: update LLK submodule to 02a4c57

v0.67.0-dev20260207

08 Feb 00:52
Immutable release. Only release title and notes can be modified.
430d1f4

Choose a tag to compare

v0.67.0-dev20260207 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/21770820481

📦 Uncategorized

  • [skip ci] Update vLLM nightly to test sampling
  • Fuse TP/SP Broadcast into pre_sdpa
  • [skip ci] set-cpu-governor VM handling
  • Move tt_dit out of experimental directory
  • chore: update LLK submodule to e050aab
  • #36149: Add native llk kernel for addcdiv
  • Fix Clang Static Analyzer warning: virtual call in GraphProcessor constructor
  • [gpt-oss] attention decode optimizations
  • Fix Blackhole op performance model FPU and DRAM utilization calculations
  • [Gemma3] Test fix: Ref MLP uses float32 for long sequences
  • Fix undefined behavior in fabric worker memory allocation
  • Encapsulate noc non blocking reads in cq into separate files
  • Expose Parameters in all gather
  • [Watcher] In order to get tt-train-cpp-unit group of tests green there was a need to skip some tests with watcher
  • SFPI 7.23.0 243
  • Using known interval to calculate uptime in check arc
  • Fix ttnn.{gcd,lcm} docs.
  • [Watcher] ttnn-unit-test group skips with watcher
  • Add 32x4 quad BH rankbindings file
  • move fabric benchmark test and update golden
  • #37259: add ifdef guard for layernorm kernels
  • [Quasar DFB]: Add support for multi-threaded producer/consumer + make blocked consumer use remapper
  • Add time budget controls for Galaxy model perf -> Galaxy perf pipeline
  • [skip ci] bring back BH GLX tests in CI
  • Add fabric telemetry neighbor node id exchange
  • Remove harvesting info from build_key when coordinate virtualization is enabled
  • [skip ci] Update CODEOWNERS for programming_examples
  • #0: add models timeout for bh
  • Make perf test timeout explicit for stable_diffusion_1_4 model
  • Fix DeepSeek V3 config loading when model path is a symlink
  • Refactor TTNN tests to use shared config for CI and TTSim
  • Add time budget controls for Galaxy demo pipeline
  • Remove tests/scripts/run_tests.sh and stress-fast-dispatch-build-and-unit-tests.yaml pipeline
  • #36852 BinaryNg kernel deadlocks with reshard
  • Fix segfaults on ttnn.ones, ttnn.zeros, ttnn.empty
  • [Merge stable to main] Llama3.3-70b and 3.1-8b - Fix sampling parameters
  • D2H Sockets
  • Fuse Post SDPA with TP All Reduce.
  • [skip ci] Increase timeout for blackhole deepseek blitz tests
  • #36881: add validation check for sharded softmax

v0.67.0-dev20260206

07 Feb 03:50
Immutable release. Only release title and notes can be modified.
0474b44

Choose a tag to compare

v0.67.0-dev20260206 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/21734080303

📦 Uncategorized

  • Topology Mapper Pinning Regression Tests
  • Remove deprecated Grayskull (tt::ARCH::GRAYSKULL) architecture support
  • latency result superset export
  • [BEVFormer] Update PCC
  • SDXL Refiner Matmul memory configs optimization
  • Fix conv2d reader kernel runtime arg mismatch for height-sharded conv
  • [TT-Transformers] Reduce batch-32 prompt length to avoid some tokenizers going over 1024 tokens
  • SDXL override global timeot
  • Remove unused type declarations in tt_metal identified by clangd
  • [skip ci] Reduce CMake install message verbosity for incremental builds
  • #36225: Handle Binary_op_type for mixed dtypes - FPU for EQ
  • SDXL disable timeout
  • add per core compile args
  • Add fused post sdpa op
  • #28532: Add Installer validation as a CI workflow
  • Add time budget controls for Galaxy frequent pipeline -> now Galaxy integration pipeline
  • [skip ci] fix(copilot-autofix-clangsa): fix broken pipe error in jq query
  • Improve custom_mm to performantly cover more shapes and enable transpose
  • [skip ci] Add workflow comparison script for CI analysis
  • [tt-train] TP+DP Llama training
  • #23354 more data type support for llk bcast
  • Fix noc debugging tool test when run back to back
  • [skip ci] Increase timeouts for longer running BH multicard model tests
  • Fix hard-coded action hash causing CI errors
  • Add versioning system to fabric telemetry
  • Increase hang detection timeout for data movement tests
  • #0 - Tests scripts update
  • [deepseek] Fix test_model decode reference for non‑zero position ids
  • Lower Tensor Utilities to Runtime Staging Area
  • Add commands to do packed large linear reads/unicast writes
  • [skip ci] Add pytest timeout flags to long-running model tests in CI
  • Add new all_to_all_dispatch variant for DeepSeek that supports multiple algorithms, fabric mux variants, and persistent buffer/semaphore optimizations
  • Add scattered core support for gather operation
  • Support for local tile reduce using DST accum
  • Add support for 'export TT_METAL_DISABLE_SFPLOADMACRO=1'
  • Remove unused types from ttnn
  • Optimized number of workers for ReduceScatterMinimalMatmul for Llama 70B on Galaxy
  • Extend CB tests

v0.67.0-dev20260205

06 Feb 03:11
Immutable release. Only release title and notes can be modified.
c1416c2

Choose a tag to compare

v0.67.0-dev20260205 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/21694041502

📦 Uncategorized

  • TTNN Tensor Creation APIs - Update Docs, Create Example
  • [skip ci]: Adding additional codeowners to dataflow buffers
  • [skip ci] Split out file lists from infra definitions for code ownership purposes
  • H2D Sockets
  • Move blitz CCL to generic framework and fuse CCL Broadcast and RMSNorm
  • Add fabric API for querying neighboring devices
  • Fuse KV Cache to Main
  • Add option to pass chunk_start_idx as tensor.
  • [UMD Bump] Automated UMD Bump 02.02.2026
  • Fix demo test workflows failing on schedule triggers
  • [ROTATE] Implementation of rotate bilinear operation
  • chore: update LLK submodule to ace8fa5
  • [#36026] Improve analyze_validation_results.sh for faster operator triage
  • #37098: add missing unused runtime arg to softmax
  • [#37052] Merge cluster configs and allow overlapping hostnames for intra pod config merge
  • [skip ci] temp skip BH GLX tests until another glx is back in CI
  • [tt-train] AdamW as a fused operation
  • Implemented tilize support for width sharded case
  • Simplifying untilize ND sharding kernel logic
  • Matmul - Port Batched DRAM MM to BH
  • Quad GLX Deepseek CI Improvements
  • Make perf test timeout explicit for DiT models
  • Update docker to use zstd / OCI
  • TT-Train TTML python module: prefer CPP artifacts built by build_metal.sh, fallback to standalone 'uv pip install' artifacts
  • #36225: inf/-inf fix for ttnn.eq
  • [skip ci] revamp clang tidy job
  • don't use cache write around when not needed
  • Add performance tests for mla deepseek
  • Re-enable saving T3K perf data

v0.66.0-rc8

06 Feb 01:03
Immutable release. Only release title and notes can be modified.

Choose a tag to compare

v0.66.0-rc8 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/21694062915

  • no changes

v0.67.0-dev20260204

04 Feb 11:19
Immutable release. Only release title and notes can be modified.
099f579

Choose a tag to compare

v0.67.0-dev20260204 Pre-release
Pre-release

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/21653460697

📦 Uncategorized

  • Bump ttsim version to v1.3.3
  • Remove pytest-xdist and custom watchdog system
  • [skip ci] Remove unnecessary TT_METAL_HOME export instructions from programming examples
  • Add SfpuType::[unary_]{max,min}_[u]int32.
  • fix(sweep): correct schedule mappings for lead models and model trace…
  • Exabox CPU only Mock Tests
  • #35535: Fuse QRoPE and QNoPE (matmul3) into fused kernel
  • [skip ci] Fix broken sweep workflow
  • Move RTA sentinel constants
  • [skip ci] Comment out reset_tensix call in teardown
  • Refactor gemma model config
  • Use transaction IDs in kernels with unaligned access
  • Fix non-deterministic pytest test collection in ttnn unit tests
  • #36094: enable large tensor rms norm
  • Bump exalens version
  • Skipping cores in reset when dumping debug bus signals in triage
  • Fix CI failures related to #35077
  • [gpt-oss] fix long context demo
  • move possibly unused var inside useage scope
  • [#36107] Cluster Validation Performance Improvements
  • Add more watcher fields to triage DispatcherCoreData
  • Support writing to device from pinned memory
  • Update targets for galaxy Whisper test
  • Update CI tests for wan2.2 with support for image to video
  • Disabling dumping debug bus signals in CI
  • Add documentation for NOC debug dump. rename env to TT_METAL_NOC_DEBUG_DUMP
  • Revert "#36094: enable large tensor rms norm"
  • Remove OFT and 20 core PDL from CI due to OOM issue
  • Add support for 1x16 fabric testing
  • chore: update LLK submodule to 80e7617
  • #37021: use preferred read/write nocs in ema
  • issue:32603 - memcpy functions should be renamed. resolved
  • #37020: add ifndef to ln kernel to avoid reading undefined vars
  • Fix DPRINT << TSLICE so it prints correct tiles in a loop
  • DeepSeek reduce_to_one