Skip to content

v0.67.0-dev20260214

Pre-release
Pre-release

Choose a tag to compare

@github-actions github-actions released this 14 Feb 07:56
· 32 commits to main since this release
Immutable release. Only release title and notes can be modified.
bfd3b81

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/22007703479

📦 Uncategorized

  • [Refactor] Math Fidelity enum
  • Restore max torch threads for multi-host runs
  • [skip ci] #0: update matmul test timeout for tt-sim
  • Add scatter to sdpa_reduce_to_all OP
  • Update SDPA to use the new fast approximate exponential function
  • Mark init functions as deprecated
  • Fix LM head norm config for qwen3vl
  • Exclude graph_argument_serializer.cpp from unity builds to reduce build time
  • Extend multi-user isolation tests with ASIC/TP2 scenarios
  • Adding Qwen3-VL vllm support
  • Implement models.experimental.ops.composite.launch()
  • Add precompiled headers to tt-train for faster compilation
  • Pipeline reorg cleanups
  • Simplified the way to validate operations
  • Move CRTAs to Kernel groups
  • Fix moreh failure
  • Adding uneven output shard support to untilize
  • PDL: PCC drop on instance embedding fix
  • [WATCHER] Increasing timeout for bh post commit with watcher enabled
  • Add Custom Init for Packing Contiguous Block from DEST
  • round up mem config shapes
  • Fix minor typos in unary max/min comments.
  • Move prefetcher pytest option to avoid breaking CI tests
  • [Gemma3] Fix for gemma3 failing unit tests
  • [GPT-OSS] Add fused op unit tests for MoE
  • Disable stable_diffusion model perf test on blackhole (#37617)
  • Add program configs for Matmul ops in Embedding block to run across 40 cores in the SDXL Refiner
  • [tt-train] Add training log comparison plotting script
  • [skip ci] Enable watcher apc nightly debug
  • Adding test harness to check cache on device compatibility for Deepseek 671B
  • [Watcher] tt-train-cpp-unit tests have new watcher enabled fails due to recent changes
  • chore: update LLK submodule to 346a830
  • removes meta lib dependencies
  • [WATCHER] Following issues are detected when watcher is enabled on BH post commit
  • [skip ci] Add P300-viommu to BHPC multi card fast tests
  • SGLang generator
  • [tt-train] Complete nanoGPT Python impl
  • Add new CI pipeline for Deepseek to test long seq lens and refactor tests
  • Topology Mapper Integration with Topology Solver API
  • Make TP All reduce optional in Post SDPA
  • Fix misleading comment in dataflow_api for multicasts
  • [skip ci] Update llama demo upstream test id's
  • Enable multi-host neighbor-pad and RingAttentionAllGather CCLs
  • LLK API support for 8x32 tilize
  • Upgrade Pillow -> 12.1.1 to fix CVE-2026-25990
  • Fix moreh kernel runtime arg bounds issues (#37193, #37040)

  • Convert Sparse Multicast Static Asserts to Runtime Asserts
  • Do not use internal bh name in builtins
  • Quasar compute API bringup V1.0
  • [Deepseek Blitz] Split q a proj mm on inner dim
  • Reduce to one generic op and fusing it with moe routed expert
  • [TTTv2] Add attention_1d module with comprehensive unit tests
  • Matmul - Add Support for 2D DRAM interleaved in0 + batched height sharded in1
  • Changes for quad module tests CI
  • Subtract grid offset when computing 0-based indices in sharded LN factory
  • Decouple Cluster initialization from HAL
  • Switch llama 8b to DP=4 in vllm nightly
  • A balanced traffic pattern for AG minimal.
  • [skip ci] Remove t3k select pipeline extra-tag inputs
  • #36982: create_q_heads tilizes to 8x32 tiles
  • Enable (very) basic compute kernels
  • Migrate conv operations to free function style
  • Migrate fast dispatch frequent tests to CIv2 runners
  • reduction: migrate to free function binding + generic cleanup
  • Use gh_run_number for Superset dashboard links in Slack notifications
  • Fix race condition in parallel multi-source jit build
  • chore: update LLK submodule to f7cf929
  • Move SDPA and MLA tests from tt_eager/misc to ttnn/operations/sdpa
  • Revert "A balanced traffic pattern for AG minimal. (#36607)"
  • [skip ci] Fix galaxy perf tests yaml (bad merge)
  • [DM] Update data movement multi_interleaved tests
  • SDXL clip encoder perf targets updated
  • Fix timeouts in vllm nightly
  • DeepSeek Blitz moe fusion
  • Generate Welford reciprocals in Python and pass into distributed layernorm ops
  • Fix TTTv2 MLP 1d from model args mismatch + BH Stress test pytest id
  • [skip ci] Fix Package and release workflow
  • Update compute kernel API to reflect new changes to fast tilize
  • Fix timeouts for qwen in vllm nightly
  • [skip ci] Add back missing schedule to BH demos
  • Pool2D Alignment Fixes for Watcher
  • Add LLK_ASSERTs for verifying tile index in dest accumulator
  • Make mm respect first core from subdevice
  • Add TTTv2 rmsnorm module unit tests to T3K e2e pipeline
  • Unify kernel and firmware JIT build deduplication into JitBuildCache