Release v0.65.1 · tenstorrent/tt-metal

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/20869526990

📦 Uncategorized

Remove prefetcher dangling reference from previous test
- PR: #35061
Fix batched prefill pcc issue
- PR: #35059
Llama-3.1-8B decode TSU optimizations
- PR: #35142
[skip ci] Re-gen Docker containers (#35305)
- PR: #35438

What's Changed

MM/Fused/Reduce Docs Touchups by @edwinleeTT in #33510
[skip ci] TG Resnet50 test_perf_trace_2cqs tweak by @astancovTT in #33672
Making check arc robust to firmware version when calculating uptime by @adjordjevic-TT in #33592
[skip ci] Update tt_transformers docs (and comments) to remove mentions of LLAMA_DIR by @gwangTT in #33669
Migrate op to new infra: send_async by @kevinwuTT in #33005
Migrate op to new infra: recv_async by @kevinwuTT in #33200
#33584: SDXL demo/accuracy test e2e time reporting by @ipotkonjak-tt in #33646
split_work_to_cores pybind in ttnn module by @arichinsTT in #32997
Migrate op to new infra: rotary_embedding_llama_fused_qk by @philei-tt in #33655
Fix GraphQL query in set-opened-on workflow by @jbakerTT in #33687
Adding 1x glx demo test + testing in CI for Deepseek V3 by @yalrawwashTT in #33508
Fix motif galaxy demo test by @sadesoyeTT in #33688
[Reduce APC] Remove cpp N150 runs from APC by @kkabilarTT in #33683
Fix torch reference tensor in sharded layernorm tests by @rmillerTT in #33564
#32710: Migrate op to new infra: nlp_create_qkv_heads_decode by @ssundaramTT in #33516
Removing devicePool from the API by @mfiltser-TT in #33668
add more cores to 2x harvesting to support further harvested P150s by @yugaoTT in #33613
Migrate op to new infra: conv2d by @awliu-TT in #33019
Enable 2 ERISC mode on bh glx upstream tests by @nnyamagoudar-TT in #33571
chore: update LLK submodule to 91fa6c2 by @fvranicTT in #33685
Migrate op to new infra: nlp_create_qkv_heads_vit by @philei-tt in #33658
Migrate op to new infra: nlp_kv_cache_load_slice by @philei-tt in #33650
[skip ci] updating auto triage token by @ebanerjeeTT in #33700
#0: Fix (Galaxy) Demo - Motif job by adding NO_PROMPT variable by @dimitri-tenstorrent in #33712
1D to support 1x32 chip routing by @daminakaTT in #32575
Fix check_noc_status on non-default setups by @jbaumanTT in #33522
Increasing timeout for manual hang detection in triage tests, enable logging when test fails by @adjordjevic-TT in #33670
Migrate op to new infra: nlp_create_qkv_heads_segformer by @philei-tt in #33664
Added a fix to the invalid test in test_strided_all_gather_minimal_matmul_async for t3k by @jvegaTT in #33717
fix to layout regression by @jvegaTT in #33665
add missing write barrier after noc_semaphore_set by @kpaigwar in #33710
[skip ci] Fix and simplify set-opened-on workflow by @jbakerTT in #33724
[skip ci] Add libc++ to CMakePresets by @afuller-TT in #33727
Add dynamic power throttling to BH by @rdjogoTT in #33627
Migrate op to new infra: bcast by @vtsilytskyiTT in #33657
Remove checkout from setup-job action to eliminate SHA-pinned rollout hazard by @Copilot in #33691
Add additional argument handling in graph serializer by @dgomezTT in #29563
New Model: TG Qwen3-32b with TG Llama3-70b Optimizations at 65 t/s/u by @ricozhu-TT in #31018
Migrate op to new infra: prod_nc by @shutovilyaep in #33562
Migrate op to new infra: prod_all by @shutovilyaep in #33568
Migrate op to new infra: argmax by @shutovilyaep in #33310
enhance fetching time from dram to l1 cmddat_q through prefetching by @mdingTT in #33537
Whisper - decoder and encoder optimization by @mbahnasTT in #33450
[skip ci] Adding support for 4x timeshare by @akirby-TT in #33693
Remove silent defaults from DeepSeekV3 demo and tests by @esmalTT in #33686
optimize bh fabric rx ack credit path by @SeanNijjar in #33524
#32626: Remove Moreh operations from docs by @mgajewskiTT in #33765
Set auto triage to false and revert 8dfb324 by @dpopovTT in #33774
Add ClosetBox fabric test configuration by @jpanasiukTT in #33596
Implement in-memory GSD/FSD validation to avoid disk I/O by @jpanasiukTT in #33055
Re-enable Triage tests by @afuller-TT in #33529
Autopacketization support for fabric data movement - Part 1 by @tlevinTT in #33081
Revert "Implement in-memory GSD/FSD validation to avoid disk I/O (#33055)" by @tt-rkim in #33792
Fix set-opened-on workflow: null handling, token validation, and pagination by @jbakerTT in #33781
Run paged llama attention prefill unit test instead of default attention by @alingTT in #33681
Decreasing low threshold for heartbeat per seconds to avoid ND test fail by @adjordjevic-TT in #33788
Add full grid worker forwarding channels for UDM Mux [4/n] by @yugaoTT in #33364
SFPI 7.12.0 by @nathan-TT in #33720
[skip ci] Fix GraphQL date type in set-opened-on workflow by @jbakerTT in #33797
Update Falcon7b PCC and expected output jsons after layernorm op changes broke tests by @skhorasganiTT in #33786
Fixing print tests with ND failures in tt-sim by @kstevensTT in #33520
Enabling tt-triage in APC by @tt-vjovanovic in #33798
Allow the profiler DRAM buffer size to be dynamically allocated depending on a user-specified op count by @sagarwalTT in #33004
remove deprecated fabric latency tests by @SeanNijjar in #33762
Fixing 1D Mapping Algorithm in Mesh Device for flipped coords by @Riddy21 in #33499
[Reduce APC] Run ccl from cpp-unit-tests on merge gate by @kkabilarTT in #33731
Updated CODEOWNERS to include all files in ttnn/cpp/ttnn/deprecated/ by @fplavecTT in #33807
[Fabric] fabric unicast scatter multi-chunk by @daminakaTT in #32395
Optimize fused strided all gather and minimal matmul to read local slice from AG input by @jonathansuTT in #33703
Improve accuracy of accurate sigmoid_tile by @nmauriceTT in #31266
[skip ci] Remove cron jobs from t3k and galaxy demo tests by @dpopovTT in #33814
[Fabric] telemetry to be controled by env var in more detail by @daminakaTT in #33523
[skip ci] Optimize set-opened-on workflow by hardcoding IDs by @jbakerTT in #33810
TT-Fabric Intermesh Traffic VC (VC1) Support [1/n] by @ubcheema in #33750
[skip ci] Refactor wheel building CI job by @afuller-TT in #33530
Migrate op to new infra: neighbor_pad_async by @ayerofieiev-tt in #33631
#33225: cleanup tanh accurate by @KalaivaniMCW in #33226
Convert some compile time arguments to runtime arguments for dispatch kernels by @mpiseTT in #33379
Lightweight kernel asserts by @tt-vjovanovic in #33451
Enable PinnedMemory on Wormhole by @jbaumanTT in #33583
Modified galaxy CI tests to account for torus links by @jvegaTT in #33822
[skip ci] Fix create draft release condition by @dpopovTT in #33846
[skip ci] Update install example sw versions in INSTALLING.md by @gsarabandoTT in #33615
#28593: [skip ci] Move CCL BH GLX tests to using torus as it's now available on CI and we want SysEng ot run it by @tt-rkim in #33757
Strip unused symbols out of hal.hpp in Runtime Host API by @riverwuTT in #32115
[skip ci] Speed up MultiProducerCommandQueueTest by @blozano-tt in #33839
concat 1D tensors with tile or rm layout by @jungeunlim-TT in #33634
Add 2x galaxy DeepSeekV3 module tests to 4x galaxy workflow by @esmalTT in #33862
[skip ci] Speedup MultiProducerCommandQueueTest.EventSync by @blozano-tt in #33864
Make Test2CQMultiDevicePrograms* Tests Faster by @blozano-tt in #33843
make t3k fabric BW test tolerances bigger to only catch large violations by @SeanNijjar in #33868
Implement DRAM Slicing for conv_transpose2d by @sankarmanoj-tt in #33136
#32965: Follow up work on ternary sharding by @mouliraj-mcw in #33151
[Fabric] Build refactor and initial support for switch builder by @aagarwalTT in #33541
chore: update LLK submodule to d1d37ed by @fvranicTT in #33812
[skip ci] Fix vllm nightly workflow by @dpopovTT in #33884
[skip ci] Speed up test_async_runtime by @blozano-tt in #33867
[skip ci] Make ttnn-core team owner of nd-reshard by @philei-tt in #33887
[skip ci] Disable unity builds for CodeCoverage build types by @Copilot in #33874
Migrate op to new infra: reshard by @philei-tt in #33270
Migrate run_cluster_validation to cxxopts by @jpanasiukTT in #33574
#33805: add noc async write barriers to some kernels by @bbradelTT in #33811
Fix block-sharded matmul tile calculation by @mvasiljevicTT in #33777
Fix Reduce Scatter Composite Ring by @jvegaTT in #33840
Removed stable_diffusion temporarily from blackhole APC by @astancovTT in #33907
Improve accuracy of non-Welford layernorm reduce kernels by @rmillerTT in #33120
#32694: Migrate op to new infra: layernorm_pre_all_gather_op by @ssundaramTT in #33526
[skip ci] Add initial copilot instructions to enhance review by @blozano-tt in #29611
[skip ci] Add rate limit diagnostics to set-opened-on workflow by @jbakerTT in #33836
[skip ci] Make Auto-Triage Automatically Run on Regressions by @ebanerjeeTT in #33909
Migrate op to new infra: sdpa by @awliu-TT in #33733
Fixed moreh loss BH alignment issues by @fplavecTT in #33569
[skip ci] Disable warmup in smoke tests by @blozano-tt in #33926
[Reduce apc] Move All C++, dispatch, distributed, and tools in cpp-unit-tests from APC to L2-Nightly by @kkabilarTT in #33847
Migrate op to new infra: concat by @vtsilytskyiTT in #33779
TT-Fabric Intermesh Traffic VC (VC1) Support [2/n] by @ubcheema in #33876
[DM] Write After Read Transaction ID Testing by @ryanzhuTT in #33091
SDXL Add refiner accuracy and refactor sdxl accuracy tests by @jmitrovicTT in #33773
use fromPaddedShape while compute output specs for Clone Op by @hkwonTT in #33593
[skip ci] Speedup tt-metalium-validation-smoke tests by @blozano-tt in #33918
Cleanup Allocator.hpp by @riverwuTT in #31105
Fix reduce scatter cache issue by @sjameelTT in #33808
fix: add constexpr to sfpu calculate_gelu by @fvranicTT in #33954
Fix demo testing when building with fresh sandbox by @akirby-TT in #33946
Revert "removing AG hang workaround for Deepseek V3 (#31025)" by @yalrawwashTT in #33953
TT-Triage - Summarize running operations across cores by @miacim in #33936
[skip ci] make auto triage send nested slack messages by @ebanerjeeTT in #33960
[TTT] fix batch resetting by @sraizada-tt in #33772
Refactor GPT-OSS codebase by @sraizada-tt in #33648
Move lightmetal into experimental by @riverwuTT in #33948
Adjusted Glx perf for Whisper by @atupe-tt in #33962
Added support for prompt param by @atupe-tt in #33460
[skip ci] readability-reference-to-constructed-temporary by @blozano-tt in #33969
Bug fixes for transposed conv2d by @pavlejosipovic in #33465
improve PCC of ttnn.experimental.intimg on client's architecture by @jbbieniekTT in #33642
Add sharded support for ttnn.clone operation by @mradosavljevicTT in #33471
refactor: avoid branching in hardmish using sfpi::vec_min_max by @fvranicTT in #33984
fix bug in metal-exalens remapping by @dzivanovicTT in #33674
fix dependencie for metal_device_id_mapping, no inspector rpc data, t… by @dzivanovicTT in #33899
Revert "Bug fixes for transposed conv2d (#33465)" by @dpopovTT in #33990
Implement in-memory GSD/FSD validation to avoid disk I/O by @jpanasiukTT in #33800
Speeding up triage by caching dispatcher data by @tt-vjovanovic in #33959
Update TT-NN Visualizer link in TTNN tools by @dcblundell in #33923
[skip ci] Fix promotion to prerelease and build without tracy by @dpopovTT in #33989
ttnn.log calls wrong recip init on Blackhole for float32 by @nmauriceTT in #33803
Fix model trace sweep tests by @Aswinmcw in #33021
disable dcache usage in fabric on BH by @SeanNijjar in #33509
[skip ci] bugfixing pipeline status tracker by @ebanerjeeTT in #33998
Migrate op to new infra: ring_distributed_sdpa by @awliu-TT in #33863
fix bug 33919 by @mdingTT in #33939
[skip ci] New Preset to export clang-tidy fixes in parallel by @blozano-tt in #33979
adding deepseek demo test back to CI by @yalrawwashTT in #34017
Fixed errors in the all broadcast functions that only appeared in Galaxy by @jvegaTT in #33930
Add Missing Functionality in Cluster Validation by @jpanasiukTT in #33573
Wan 2.2 Image-to-Video by @ricozhu-TT in #33850
Remove obsolete models/tt_transformers/requirements.txt by @gwangTT in #33921
[skip ci] Update ETA for release 0.65.0 by @bbeggsTT in #33206
Moving DevicePool into MetalContext by @mfiltser-TT in #33598
Revert removal of get_pcie_alignment in hal by @riverwuTT in #33981
[skip ci] Remove YOLO entries from models/README by @gsarabandoTT in #33454
SFPI 7.13.0 by @nathan-TT in #34016
Clean up quad galaxy CI health check options by @aliuTT in #33952
#0: Fix test_distributed_layernorm_pre_allgather.py by @ssundaramTT in #34012
bugprone-unused-local-non-trivial-variable by @blozano-tt in #33956
9974: test_transpose_hc misalignment test fix by @bzimmermanTT in #33054
[skip ci] restore timeout times by @subinleeTT in #34029
use fromPaddedShape while compute Output specs for Unary Ops by @hkwonTT in #33949
Const qualify global pointers by @nathan-TT in #34015
Quick fix for broken test: test_qwen_accuracy by @ricozhu-TT in #34046
Migrate op to new infra: ccl/all_broadcast by @awliu-TT in #33940
TT-Fabric Intermesh Traffic VC (VC1) Support [3/n] by @ubcheema in #33950
Fix ttnn nightly L2 moreh tests by @fplavecTT in #34041
Remove matmul_batched_weights by @aliaksei-sala in #33857
fix kernel compile issue when watcher is enabled by @SeanNijjar in #34045
Pjosipovic/restore transposed conv2d fix by @pavlejosipovic in #33999
Override HF download for stable diffusion on BAPC by @astancovTT in #33995
modernize-concat-nested-namespaces by @blozano-tt in #33965
Test system health enable visible devices by @mbezuljTT in #34043
Fix all_reduce_async hang in Galaxy nightly CI by @itarabanTT in #34068
Add fixed version of legacy noc non_blocking api by @vvukomanovicTT in #33654
#33644: ttnn.sort indices not producing uint32 output fix by @mgajewskiTT in #33983
#0: TopK implementation docs added. by @mgajewskiTT in #34064
Allow creating MeshDevice spanning subset of ranks by @pstankiewiczTT in #32651
chore: update LLK submodule to a01054d by @fvranicTT in #34074
Add deadlock avoidance for UDM [5/n] by @yugaoTT in #33611
Fix t3k_llama3_70b_tests in t3k demo tests by @dpopovTT in #34069
[skip ci] Update perf and latest features for llm models (Dec 8) by @skhorasganiTT in #34023
Revert instrn_buffer to initialized pointer by @nathan-TT in #34078
#33714: Add Deepseek micro-ops/benchmarks for blitz decode by @TT-BrianLiu in #33263
Refactor warmup traces to be called from vLLM before first healthy signal by @nostojicTT in #33143
performance-faster-string-find by @blozano-tt in #34067
Making test clone compatible with the blackhole grid selection by @jvegaTT in #34082
Migrate op to new infra: all_reduce_create_qkv_heads by @shutovilyaep in #33826
[skip ci] Fix CODEOWNERS validation errors for non-existent users by @Copilot in #34095
TT-Train: Enable SIMD RNG by default, fix tests by @athompsonTT in #33500
Add TT_METAL_DISABLE_BACKTRACE env var to skip backtrace generation by @rpavlovicTT in #34027
Remove cb_get_tile and cb_release_tile by @pavlejosipovic in #33889
Adding stateful 2.0 read and write apis by @abhullar-tt in #33121
Update defaults for reduce scatter and all gather CCL parameters by @jvegaTT in #34032
Add DiT dashboard metrics by @sosborne-TT in #33622
Revert "Remove matmul_batched_weights (#33857)" by @aliaksei-sala in #34101
MGD Auto Discovery When MGD not provided by @Riddy21 in #33204
Migrate op to new infra: all_gather_concat_heads_fused by @awliu-TT in #34051
Adding env variable to ensure CI passes for deepseek-demo by @yalrawwashTT in #34084
Fix profiler events key error by @mo-tenstorrent in #34110
[skip ci] Revert "Fix variable shadowing in test_waypoint.cpp" by @blozano-tt in #34106
modernize-use-bool-literals by @blozano-tt in #34120
refactor: remove unused parameters from llk functions by @fvranicTT in #33992
[skip ci] Implement bypass approval command in GitHub Actions workflow by @Aswinmcw in #34131
Migrate op to new infra: data_movement/repeat by @bklockiewiczTT in #33397
refactor: avoid branching in hardtanh by using TTI macros directly by @fvranicTT in #33985
Gemma3 refactorization of model_config.py primarily by @pmilojevicTT in #34007
Cross attention cache for Whisper by @atupe-tt in #34099
reduce 120b glx ttft by @handrewsTT in #34085
Reduce to root op by @nardoTT in #34057
[skip ci] Use disable profiler flag in git bisect by @dpopovTT in #33986
High accuracy fp32 exp by @nmauriceTT in #33563
Added Neighbour Exchange Fabric Topology by @jhaiTT in #33201
[Reduce APC] Move 2 tests from ttnn misc to L2Nightly by @kkabilarTT in #34116
Update vllm gen to align with demo gen by @pprajapatiTT in #34044
[skip ci] Add TT_TRIAGE_JOB_HANG failure signature for device timeout detection by @dpopovTT in #34165
Abstract away memory areas from linker script by @nathan-TT in #34096
#32430: Support tuneable block size to reduce L1 usage in ttnn.convert_to_hwc by @esmalTT in #33866
[skip ci] making profiler artifact mismatch errors more clear by @ebanerjeeTT in #34171
Fix SDPA decode for Q heads greater than 32 (GQA support) by @alingTT in #34113
feat: support LLK_ASSERT via rtoptions by @fvranicTT in #34153
Migrate op to new infra: all_reduce_async by @kevinwuTT in #33705
chore: update LLK submodule to 5218b2c by @fvranicTT in #34155
[Reduce APC] Move the last 2 cpp jobs to NightlyL2 and remove cpp completely from APC by @kkabilarTT in #34169
Migrate op to new infra: copy by @vtsilytskyiTT in #33243
TT-Train flatbuffers by @athompsonTT in #32326
Fixes for ring attention with bfloat8_b and bfloat4_b data types by @sosborne-TT in #33628
[skip ci] Update issue templates: bug report and bounty model templates by @minaliuTT in #34087
Pool2D Race Condition by @wransom-TT in #34115
SDPA - minor compute optimization by @cglagovichTT in #34092
PCC threshold change for the new integral image implementation by @ddjekicTT in #34154
Fix pack untilize for cache update for ct dim > 8 to use regular untilize by @alingTT in #33932
[skip ci] Simplify clang-tidy config by @blozano-tt in #34187
Migrate op to new infra: data_movement/transpose by @shutovilyaep in #33423
#33711: Add semaphore ID to program descriptor by @ssundaramTT in #33735
Migrate op to new infra: pad op by @MaximArtemovEPAM in #33552
Reduce num traffic iterations for BH Galaxy health check by @tt-asaigal in #34188
Re-enable system config based MGD lookup for Multi-Host systems by @tt-asaigal in #34199
Adding T3K schedule by @akirby-TT in #34201
Add Multi-Mesh/Proc Pipeline on BH Galaxy by @tt-asaigal in #33638
[skip ci] Update unary operation documentation example tests by @Aswinmcw in #34065
[skip ci] Update documentation for GCD and LCM binary op by @Aswinmcw in #34063
[skip ci] Update model tracer README.md to include currently traced models by @Aswinmcw in #34150
Fix OOM happend by training on 1x32 with 2D config by @daminakaTT in #34058
bugprone-unhandled-self-assignment by @wilderfield in #34208
Fix sdpa reduce copy init parameter issue on BH by @alingTT in #34206
Extend usage of UNITY_BUILD to more projects by @pavlejosipovic in #33972
chore: update LLK submodule to 6d67375 by @fvranicTT in #34217
Make prefill trace optional for GPT-OSS by @handrewsTT in #34223
Automated Fabric Test Config Generation by @jpanasiukTT in #33595
#0: Revert layernorm all gather changes for new infra to un-hang a model on BH QB GE by @tt-rkim in #34157
Add log probs feature to TT-Transformers on T3K by @djordje-tt in #33343
fix: include risc_attribs.h in watcher_common.h to fix tt_l1_ptr undefined error by @fvranicTT in #34231
[skip ci] Enable a custom pipeline workflow for model and model-adjacent PRs by @pbaraTT in #32372
Change the way sequence lengths are padded in tt_transformers + model warmup sequence length changes by @nostojicTT in #34114
Revert "Automated Fabric Test Config Generation" by @ebanerjeeTT in #34241
migrate meta's image attention to HF equivalent by @epam-ioannis-alexiou in #33553
Enable sharding support in all_to_al_async_generic by @itarabanTT in #34124
[skip ci] turn on pinging for auto-triage by @ebanerjeeTT in #34243
Fix trace only run by @mo-tenstorrent in #33662
Exposing local variables and arguments in lightweight asserts by @tt-vjovanovic in #34218
[Deepseek] Add workaround for prefill hang by @pprajapatiTT in #34030
[skip ci] adding option to disable slack pinging by @ebanerjeeTT in #34264
[Reduce APC] Move N150 profiler to L2Nightly by @kkabilarTT in #34269
Adding welford support, Gamma beta Tile support, Sliding window Computation to Distributed Layernorm by @vsureshTT in #31702
Use prebuilt binaries for TT-Sim by @afuller-TT in #34203
feat: start using fix: use calculate_square from tt-llk by @fvranicTT in #34191
Add a version of tile reshape that does not cache mappings on device by @nardoTT in #33359
[skip-ci] Fix: Scheduled t3000 tests do not run by @pbaraTT in #34270
Migrate op to new infra: all_gather_matmul_async by @awliu-TT in #34109
Changed attn_out memory_config to skip_mem_cfg by @maksim-tsishkouski-epam in #33917
#28087 binary op sharding performance optimization by @dchenTT in #34132
bugprone-crtp-constructor-accessibility by @wilderfield in #34238
Add minimal broadcast for deepseek batch1 by @nardoTT in #34168
[skip ci] ViT-p150 Readme Update batch size from 8 to 10 in README by @mbahnasTT in #34280
add z-router support to fabric builder by @SeanNijjar in #33975
[skip ci] Retag the image we use to cut the release by @blozano-tt in #31752
Add lm head to Galaxy unit tests by @alingTT in #34207
chore: update LLK submodule to c2ed028 by @fvranicTT in #34284
[skip ci] Add lightweight asserts and llk asserts to setup job by @dpopovTT in #34221
Fix which sequence lengths will be warmed up for a model by @nostojicTT in #34248
Migrate op to new infra: sharded_to_interleaved/sharded_to_interleaved_partial by @philei-tt in #33996
Set reset_batch=False as default by @rdraskicTT in #34272
#30261: Migrate logit as a device op by @mouliraj-mcw in #31296
Inline [] operator in ShapeBase by @rpavlovicTT in #34035
[skip ci] ci: provide more information when doing the llk uplift (added issue information) by @fvranicTT in #34224
Migrate op to new infra: move by @vtsilytskyiTT in #33173
Revert "#28087 binary op sharding performance optimization (#34132)" by @mbezuljTT in #34300
E2E demo for Panoptic DeepLab on 20 cores by @ianastasijevicTT in #34081
readability-container-data-pointer by @wilderfield in #34311
[CONV] Relaxing the pcc threshold for the recently regressed test case by @dstoiljkovicTT in #34317
Updates to rounding ops. by @jasondavies in #34149
#30047: Refine conv2d function documentation by @bbeggsTT in #33728
#33538: Move swish from composite infra to unary infra by @mouliraj-mcw in #33641
28564: set runtime args for all cores in override_runtime_args_mc_hc_tiled_interleaved by @bbradelTT in #34260
Setting new values needed by Quasar linker by @arikTT in #34189
[skip ci] Temporarily skip failing tests to reduce glx queue times in CI by @subinleeTT in #34257
readability-non-const-parameter by @wilderfield in #34321
Move DeviceManager to MetalContext by @mfiltser-TT in #34056
Fix syntax error in CI configuration file by @sosborne-TT in #34329
Adjust qwen25 CI perf thresholds by @yieldthought in #33902
[TT-Transformers] Fix long context test cases with max_seq_len override by @gwangTT in #34136
fix: adjust const char* size checks using strlen for assertions by @fvranicTT in #34318
[skip ci] Fixing Slack Messaging Error in Auto Triage by @ebanerjeeTT in #34332
Changed perf targets for OFT by @ddjekicTT in #34304
[skip-ci] T3000 model perf tests: fix JSON object in generate-matrix steo by @pbaraTT in #34341
#32775: Migrate op to new infra: reduce_scatter_minimal_async by @ssundaramTT in #33929
bugprone-forwarding-reference-overload by @wilderfield in #34253
deepseek_v3: make transfer_row safe for sharded tensors by @yieldthought in #34334
modernize-use-default-member-init by @blozano-tt in #34176
Add sub grid support to many of the to-layout conditions and to tilize with padding, untilize, untilize with unpadding by @jvegaTT in #34268
Integrate watcher sanitize into safe L1 accessor by @nhuang-tt in #34275
Fix blackhole multi card demo test failure due to incorrect max_seq_len with proper prefill tracing enabled by @alingTT in #34285
Add flag to enable benchmark mode in fabric test kernels by @nnyamagoudar-TT in #34173
Fix Qwen prefetcher perf tests in Llama galaxy unit test CI (overflowed global cb max num pages) by @alingTT in #34293
Added tracing for Whisper by @atupe-tt in #33927
Make permute/transpose consistent with nullopt by @nsextonTT in #34292
[skip ci] Add new user mappings in codeowners-group-analysis.yaml by @Aswinmcw in #34392
adding a skip, since cb_wait_front is not working in distributed_laye… by @vsureshTT in #34390
Fix Padded prefill end idx {some number} exceeds max seq len {some number} during prefill warmup by @nostojicTT in #34277
Nanobind support by @ThisIsFineTM in #23160
Revert "Nanobind support" by @nsextonTT in #34395
#23667: Move test_matmul_benchmark.cpp to benchmark directory by @mgajewskiTT in #34230
[skip ci] Delete models/docs/MODEL_GRADUATION.md by @bbeggsTT in #25921
[skip ci] Create .rst file for TT-SMI by @bbeggsTT in #31847
[skip ci] Updating Advanced Perf doc MCQ section. by @bbeggsTT in #33087
[skip ci] Add changelog for release v0.65.0-dev20251129 by @bbeggsTT in #34052
[skip ci] add changelog file for release v0.65.0-dev20251205 by @bbeggsTT in #34049
chore: update LLK submodule to dfd8dc0 by @fvranicTT in #34374
Update submodules when new ref is checked out for release models image by @dpopovTT in #34402
test: disable constantly failing layernorm test by @fvranicTT in #34411
[skip ci] Delete models/docs/MODEL_ADD.md by @bbeggsTT in #25920
test: turn off test_move_op on BH P100 by @fvranicTT in #34418
Cleanup the low level tt/tti instructions to move them into tt-llk by @CodeMan62 in #28465
[Reduce APC] Move n150 tt-train-cpp-unit-tests from APC to L2Nightly by @kkabilarTT in #34419
meta's lib cross attention block replaced to HF equivalent by @epam-ioannis-alexiou in #34242
fix output coreranges ordering in all reduce by @kpaigwar in #34198
Copy from device after every trace execution in tt-transformers prefill by @rdraskicTT in #34597
fix penalties bugs by @sraizada-tt in #34697
SDXL perf targets reverted (#34779) by @mbezuljTT in #34788
SDXL TP=2 prompt batch hotfix by @mbezuljTT in #34429
force batched prefill for users>=16 by @sraizada-tt in #34701
Add logprobs for llama3.3-70b Galaxy by @djordje-tt in #34676
penalties fixes for llama 8b by @sraizada-tt in #34776
Device sampling in prefill for llama70b by @tchedaTT in #34972
Add prefill sampling support to TTT models by @sraizada-tt in #35021
Remove prefetcher dangling reference from previous test by @djordje-tt in #35061
Fix batched prefill pcc issue by @rdraskicTT in #35059
Llama-3.1-8B decode TSU optimizations by @jonathansuTT in #35142
[skip ci] Re-gen Docker containers (#35305) by @acvejicTT in #35438

New Contributors

@arichinsTT made their first contribution in #32997
@bklockiewiczTT made their first contribution in #33397
@CodeMan62 made their first contribution in #28465

Full Changelog: v0.65.0-rc14...v0.65.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.65.1

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

📦 Uncategorized

What's Changed

New Contributors

Contributors

Uh oh!