v0.65.1
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/20869526990
📦 Uncategorized
- Remove prefetcher dangling reference from previous test
- PR: #35061
- Fix batched prefill pcc issue
- PR: #35059
- Llama-3.1-8B decode TSU optimizations
- PR: #35142
- [skip ci] Re-gen Docker containers (#35305)
- PR: #35438
What's Changed
- MM/Fused/Reduce Docs Touchups by @edwinleeTT in #33510
- [skip ci] TG Resnet50 test_perf_trace_2cqs tweak by @astancovTT in #33672
- Making check arc robust to firmware version when calculating uptime by @adjordjevic-TT in #33592
- [skip ci] Update tt_transformers docs (and comments) to remove mentions of LLAMA_DIR by @gwangTT in #33669
- Migrate op to new infra: send_async by @kevinwuTT in #33005
- Migrate op to new infra: recv_async by @kevinwuTT in #33200
- #33584: SDXL demo/accuracy test e2e time reporting by @ipotkonjak-tt in #33646
- split_work_to_cores pybind in ttnn module by @arichinsTT in #32997
- Migrate op to new infra: rotary_embedding_llama_fused_qk by @philei-tt in #33655
- Fix GraphQL query in set-opened-on workflow by @jbakerTT in #33687
- Adding 1x glx demo test + testing in CI for Deepseek V3 by @yalrawwashTT in #33508
- Fix motif galaxy demo test by @sadesoyeTT in #33688
- [Reduce APC] Remove cpp N150 runs from APC by @kkabilarTT in #33683
- Fix torch reference tensor in sharded layernorm tests by @rmillerTT in #33564
- #32710: Migrate op to new infra: nlp_create_qkv_heads_decode by @ssundaramTT in #33516
- Removing devicePool from the API by @mfiltser-TT in #33668
- add more cores to 2x harvesting to support further harvested P150s by @yugaoTT in #33613
- Migrate op to new infra: conv2d by @awliu-TT in #33019
- Enable 2 ERISC mode on bh glx upstream tests by @nnyamagoudar-TT in #33571
- chore: update LLK submodule to 91fa6c2 by @fvranicTT in #33685
- Migrate op to new infra: nlp_create_qkv_heads_vit by @philei-tt in #33658
- Migrate op to new infra: nlp_kv_cache_load_slice by @philei-tt in #33650
- [skip ci] updating auto triage token by @ebanerjeeTT in #33700
- #0: Fix (Galaxy) Demo - Motif job by adding NO_PROMPT variable by @dimitri-tenstorrent in #33712
- 1D to support 1x32 chip routing by @daminakaTT in #32575
- Fix check_noc_status on non-default setups by @jbaumanTT in #33522
- Increasing timeout for manual hang detection in triage tests, enable logging when test fails by @adjordjevic-TT in #33670
- Migrate op to new infra: nlp_create_qkv_heads_segformer by @philei-tt in #33664
- Added a fix to the invalid test in test_strided_all_gather_minimal_matmul_async for t3k by @jvegaTT in #33717
- fix to layout regression by @jvegaTT in #33665
- add missing write barrier after noc_semaphore_set by @kpaigwar in #33710
- [skip ci] Fix and simplify set-opened-on workflow by @jbakerTT in #33724
- [skip ci] Add libc++ to CMakePresets by @afuller-TT in #33727
- Add dynamic power throttling to BH by @rdjogoTT in #33627
- Migrate op to new infra: bcast by @vtsilytskyiTT in #33657
- Remove checkout from setup-job action to eliminate SHA-pinned rollout hazard by @Copilot in #33691
- Add additional argument handling in graph serializer by @dgomezTT in #29563
- New Model:
TG Qwen3-32bwithTG Llama3-70bOptimizations at 65 t/s/u by @ricozhu-TT in #31018 - Migrate op to new infra: prod_nc by @shutovilyaep in #33562
- Migrate op to new infra: prod_all by @shutovilyaep in #33568
- Migrate op to new infra: argmax by @shutovilyaep in #33310
- enhance fetching time from dram to l1 cmddat_q through prefetching by @mdingTT in #33537
- Whisper - decoder and encoder optimization by @mbahnasTT in #33450
- [skip ci] Adding support for 4x timeshare by @akirby-TT in #33693
- Remove silent defaults from DeepSeekV3 demo and tests by @esmalTT in #33686
- optimize bh fabric rx ack credit path by @SeanNijjar in #33524
- #32626: Remove
Morehoperations from docs by @mgajewskiTT in #33765 - Set auto triage to false and revert 8dfb324 by @dpopovTT in #33774
- Add ClosetBox fabric test configuration by @jpanasiukTT in #33596
- Implement in-memory GSD/FSD validation to avoid disk I/O by @jpanasiukTT in #33055
- Re-enable Triage tests by @afuller-TT in #33529
- Autopacketization support for fabric data movement - Part 1 by @tlevinTT in #33081
- Revert "Implement in-memory GSD/FSD validation to avoid disk I/O (#33055)" by @tt-rkim in #33792
- Fix set-opened-on workflow: null handling, token validation, and pagination by @jbakerTT in #33781
- Run paged llama attention prefill unit test instead of default attention by @alingTT in #33681
- Decreasing low threshold for heartbeat per seconds to avoid ND test fail by @adjordjevic-TT in #33788
- Add full grid worker forwarding channels for UDM Mux [4/n] by @yugaoTT in #33364
- SFPI 7.12.0 by @nathan-TT in #33720
- [skip ci] Fix GraphQL date type in set-opened-on workflow by @jbakerTT in #33797
- Update Falcon7b PCC and expected output jsons after layernorm op changes broke tests by @skhorasganiTT in #33786
- Fixing print tests with ND failures in tt-sim by @kstevensTT in #33520
- Enabling tt-triage in APC by @tt-vjovanovic in #33798
- Allow the profiler DRAM buffer size to be dynamically allocated depending on a user-specified op count by @sagarwalTT in #33004
- remove deprecated fabric latency tests by @SeanNijjar in #33762
- Fixing 1D Mapping Algorithm in Mesh Device for flipped coords by @Riddy21 in #33499
- [Reduce APC] Run ccl from cpp-unit-tests on merge gate by @kkabilarTT in #33731
- Updated CODEOWNERS to include all files in ttnn/cpp/ttnn/deprecated/ by @fplavecTT in #33807
- [Fabric] fabric unicast scatter multi-chunk by @daminakaTT in #32395
- Optimize fused strided all gather and minimal matmul to read local slice from AG input by @jonathansuTT in #33703
- Improve accuracy of accurate sigmoid_tile by @nmauriceTT in #31266
- [skip ci] Remove cron jobs from t3k and galaxy demo tests by @dpopovTT in #33814
- [Fabric] telemetry to be controled by env var in more detail by @daminakaTT in #33523
- [skip ci] Optimize set-opened-on workflow by hardcoding IDs by @jbakerTT in #33810
- TT-Fabric Intermesh Traffic VC (VC1) Support [1/n] by @ubcheema in #33750
- [skip ci] Refactor wheel building CI job by @afuller-TT in #33530
- Migrate op to new infra: neighbor_pad_async by @ayerofieiev-tt in #33631
- #33225: cleanup tanh accurate by @KalaivaniMCW in #33226
- Convert some compile time arguments to runtime arguments for dispatch kernels by @mpiseTT in #33379
- Lightweight kernel asserts by @tt-vjovanovic in #33451
- Enable PinnedMemory on Wormhole by @jbaumanTT in #33583
- Modified galaxy CI tests to account for torus links by @jvegaTT in #33822
- [skip ci] Fix create draft release condition by @dpopovTT in #33846
- [skip ci] Update install example sw versions in INSTALLING.md by @gsarabandoTT in #33615
- #28593: [skip ci] Move CCL BH GLX tests to using torus as it's now available on CI and we want SysEng ot run it by @tt-rkim in #33757
- Strip unused symbols out of
hal.hppin Runtime Host API by @riverwuTT in #32115 - [skip ci] Speed up MultiProducerCommandQueueTest by @blozano-tt in #33839
- concat 1D tensors with tile or rm layout by @jungeunlim-TT in #33634
- Add 2x galaxy DeepSeekV3 module tests to 4x galaxy workflow by @esmalTT in #33862
- [skip ci] Speedup MultiProducerCommandQueueTest.EventSync by @blozano-tt in #33864
- Make Test2CQMultiDevicePrograms* Tests Faster by @blozano-tt in #33843
- make t3k fabric BW test tolerances bigger to only catch large violations by @SeanNijjar in #33868
- Implement DRAM Slicing for conv_transpose2d by @sankarmanoj-tt in #33136
- #32965: Follow up work on ternary sharding by @mouliraj-mcw in #33151
- [Fabric] Build refactor and initial support for switch builder by @aagarwalTT in #33541
- chore: update LLK submodule to d1d37ed by @fvranicTT in #33812
- [skip ci] Fix vllm nightly workflow by @dpopovTT in #33884
- [skip ci] Speed up test_async_runtime by @blozano-tt in #33867
- [skip ci] Make ttnn-core team owner of nd-reshard by @philei-tt in #33887
- [skip ci] Disable unity builds for CodeCoverage build types by @Copilot in #33874
- Migrate op to new infra: reshard by @philei-tt in #33270
- Migrate run_cluster_validation to cxxopts by @jpanasiukTT in #33574
- #33805: add noc async write barriers to some kernels by @bbradelTT in #33811
- Fix block-sharded matmul tile calculation by @mvasiljevicTT in #33777
- Fix Reduce Scatter Composite Ring by @jvegaTT in #33840
- Removed stable_diffusion temporarily from blackhole APC by @astancovTT in #33907
- Improve accuracy of non-Welford layernorm reduce kernels by @rmillerTT in #33120
- #32694: Migrate op to new infra: layernorm_pre_all_gather_op by @ssundaramTT in #33526
- [skip ci] Add initial copilot instructions to enhance review by @blozano-tt in #29611
- [skip ci] Add rate limit diagnostics to set-opened-on workflow by @jbakerTT in #33836
- [skip ci] Make Auto-Triage Automatically Run on Regressions by @ebanerjeeTT in #33909
- Migrate op to new infra: sdpa by @awliu-TT in #33733
- Fixed moreh loss BH alignment issues by @fplavecTT in #33569
- [skip ci] Disable warmup in smoke tests by @blozano-tt in #33926
- [Reduce apc] Move All C++, dispatch, distributed, and tools in cpp-unit-tests from APC to L2-Nightly by @kkabilarTT in #33847
- Migrate op to new infra: concat by @vtsilytskyiTT in #33779
- TT-Fabric Intermesh Traffic VC (VC1) Support [2/n] by @ubcheema in #33876
- [DM] Write After Read Transaction ID Testing by @ryanzhuTT in #33091
- SDXL Add refiner accuracy and refactor sdxl accuracy tests by @jmitrovicTT in #33773
- use fromPaddedShape while compute output specs for Clone Op by @hkwonTT in #33593
- [skip ci] Speedup tt-metalium-validation-smoke tests by @blozano-tt in #33918
- Cleanup
Allocator.hppby @riverwuTT in #31105 - Fix reduce scatter cache issue by @sjameelTT in #33808
- fix: add constexpr to sfpu calculate_gelu by @fvranicTT in #33954
- Fix demo testing when building with fresh sandbox by @akirby-TT in #33946
- Revert "removing AG hang workaround for Deepseek V3 (#31025)" by @yalrawwashTT in #33953
- TT-Triage - Summarize running operations across cores by @miacim in #33936
- [skip ci] make auto triage send nested slack messages by @ebanerjeeTT in #33960
- [TTT] fix batch resetting by @sraizada-tt in #33772
- Refactor GPT-OSS codebase by @sraizada-tt in #33648
- Move lightmetal into experimental by @riverwuTT in #33948
- Adjusted Glx perf for Whisper by @atupe-tt in #33962
- Added support for prompt param by @atupe-tt in #33460
- [skip ci] readability-reference-to-constructed-temporary by @blozano-tt in #33969
- Bug fixes for transposed conv2d by @pavlejosipovic in #33465
- improve PCC of
ttnn.experimental.intimgon client's architecture by @jbbieniekTT in #33642 - Add sharded support for ttnn.clone operation by @mradosavljevicTT in #33471
- refactor: avoid branching in hardmish using
sfpi::vec_min_maxby @fvranicTT in #33984 - fix bug in metal-exalens remapping by @dzivanovicTT in #33674
- fix dependencie for metal_device_id_mapping, no inspector rpc data, t… by @dzivanovicTT in #33899
- Revert "Bug fixes for transposed conv2d (#33465)" by @dpopovTT in #33990
- Implement in-memory GSD/FSD validation to avoid disk I/O by @jpanasiukTT in #33800
- Speeding up triage by caching dispatcher data by @tt-vjovanovic in #33959
- Update TT-NN Visualizer link in TTNN tools by @dcblundell in #33923
- [skip ci] Fix promotion to prerelease and build without tracy by @dpopovTT in #33989
- ttnn.log calls wrong recip init on Blackhole for float32 by @nmauriceTT in #33803
- Fix model trace sweep tests by @Aswinmcw in #33021
- disable dcache usage in fabric on BH by @SeanNijjar in #33509
- [skip ci] bugfixing pipeline status tracker by @ebanerjeeTT in #33998
- Migrate op to new infra: ring_distributed_sdpa by @awliu-TT in #33863
- fix bug 33919 by @mdingTT in #33939
- [skip ci] New Preset to export clang-tidy fixes in parallel by @blozano-tt in #33979
- adding deepseek demo test back to CI by @yalrawwashTT in #34017
- Fixed errors in the all broadcast functions that only appeared in Galaxy by @jvegaTT in #33930
- Add Missing Functionality in Cluster Validation by @jpanasiukTT in #33573
- Wan 2.2 Image-to-Video by @ricozhu-TT in #33850
- Remove obsolete models/tt_transformers/requirements.txt by @gwangTT in #33921
- [skip ci] Update ETA for release 0.65.0 by @bbeggsTT in #33206
- Moving DevicePool into MetalContext by @mfiltser-TT in #33598
- Revert removal of
get_pcie_alignmentin hal by @riverwuTT in #33981 - [skip ci] Remove YOLO entries from models/README by @gsarabandoTT in #33454
- SFPI 7.13.0 by @nathan-TT in #34016
- Clean up quad galaxy CI health check options by @aliuTT in #33952
- #0: Fix test_distributed_layernorm_pre_allgather.py by @ssundaramTT in #34012
- bugprone-unused-local-non-trivial-variable by @blozano-tt in #33956
- 9974: test_transpose_hc misalignment test fix by @bzimmermanTT in #33054
- [skip ci] restore timeout times by @subinleeTT in #34029
- use fromPaddedShape while compute Output specs for Unary Ops by @hkwonTT in #33949
- Const qualify global pointers by @nathan-TT in #34015
- Quick fix for broken test:
test_qwen_accuracyby @ricozhu-TT in #34046 - Migrate op to new infra: ccl/all_broadcast by @awliu-TT in #33940
- TT-Fabric Intermesh Traffic VC (VC1) Support [3/n] by @ubcheema in #33950
- Fix ttnn nightly L2 moreh tests by @fplavecTT in #34041
- Remove matmul_batched_weights by @aliaksei-sala in #33857
- fix kernel compile issue when watcher is enabled by @SeanNijjar in #34045
- Pjosipovic/restore transposed conv2d fix by @pavlejosipovic in #33999
- Override HF download for stable diffusion on BAPC by @astancovTT in #33995
- modernize-concat-nested-namespaces by @blozano-tt in #33965
- Test system health enable visible devices by @mbezuljTT in #34043
- Fix all_reduce_async hang in Galaxy nightly CI by @itarabanTT in #34068
- Add fixed version of legacy noc non_blocking api by @vvukomanovicTT in #33654
- #33644:
ttnn.sortindices not producinguint32output fix by @mgajewskiTT in #33983 - #0: TopK implementation docs added. by @mgajewskiTT in #34064
- Allow creating MeshDevice spanning subset of ranks by @pstankiewiczTT in #32651
- chore: update LLK submodule to a01054d by @fvranicTT in #34074
- Add deadlock avoidance for UDM [5/n] by @yugaoTT in #33611
- Fix t3k_llama3_70b_tests in t3k demo tests by @dpopovTT in #34069
- [skip ci] Update perf and latest features for llm models (Dec 8) by @skhorasganiTT in #34023
- Revert instrn_buffer to initialized pointer by @nathan-TT in #34078
- #33714: Add Deepseek micro-ops/benchmarks for blitz decode by @TT-BrianLiu in #33263
- Refactor warmup traces to be called from vLLM before first healthy signal by @nostojicTT in #33143
- performance-faster-string-find by @blozano-tt in #34067
- Making test clone compatible with the blackhole grid selection by @jvegaTT in #34082
- Migrate op to new infra: all_reduce_create_qkv_heads by @shutovilyaep in #33826
- [skip ci] Fix CODEOWNERS validation errors for non-existent users by @Copilot in #34095
- TT-Train: Enable SIMD RNG by default, fix tests by @athompsonTT in #33500
- Add TT_METAL_DISABLE_BACKTRACE env var to skip backtrace generation by @rpavlovicTT in #34027
- Remove cb_get_tile and cb_release_tile by @pavlejosipovic in #33889
- Adding stateful 2.0 read and write apis by @abhullar-tt in #33121
- Update defaults for reduce scatter and all gather CCL parameters by @jvegaTT in #34032
- Add DiT dashboard metrics by @sosborne-TT in #33622
- Revert "Remove matmul_batched_weights (#33857)" by @aliaksei-sala in #34101
- MGD Auto Discovery When MGD not provided by @Riddy21 in #33204
- Migrate op to new infra: all_gather_concat_heads_fused by @awliu-TT in #34051
- Adding env variable to ensure CI passes for deepseek-demo by @yalrawwashTT in #34084
- Fix profiler events key error by @mo-tenstorrent in #34110
- [skip ci] Revert "Fix variable shadowing in test_waypoint.cpp" by @blozano-tt in #34106
- modernize-use-bool-literals by @blozano-tt in #34120
- refactor: remove unused parameters from llk functions by @fvranicTT in #33992
- [skip ci] Implement bypass approval command in GitHub Actions workflow by @Aswinmcw in #34131
- Migrate op to new infra: data_movement/repeat by @bklockiewiczTT in #33397
- refactor: avoid branching in hardtanh by using TTI macros directly by @fvranicTT in #33985
- Gemma3 refactorization of model_config.py primarily by @pmilojevicTT in #34007
- Cross attention cache for Whisper by @atupe-tt in #34099
- reduce 120b glx ttft by @handrewsTT in #34085
- Reduce to root op by @nardoTT in #34057
- [skip ci] Use disable profiler flag in git bisect by @dpopovTT in #33986
- High accuracy fp32 exp by @nmauriceTT in #33563
- Added Neighbour Exchange Fabric Topology by @jhaiTT in #33201
- [Reduce APC] Move 2 tests from ttnn misc to L2Nightly by @kkabilarTT in #34116
- Update vllm gen to align with demo gen by @pprajapatiTT in #34044
- [skip ci] Add TT_TRIAGE_JOB_HANG failure signature for device timeout detection by @dpopovTT in #34165
- Abstract away memory areas from linker script by @nathan-TT in #34096
- #32430: Support tuneable block size to reduce L1 usage in
ttnn.convert_to_hwcby @esmalTT in #33866 - [skip ci] making profiler artifact mismatch errors more clear by @ebanerjeeTT in #34171
- Fix SDPA decode for Q heads greater than 32 (GQA support) by @alingTT in #34113
- feat: support LLK_ASSERT via rtoptions by @fvranicTT in #34153
- Migrate op to new infra: all_reduce_async by @kevinwuTT in #33705
- chore: update LLK submodule to 5218b2c by @fvranicTT in #34155
- [Reduce APC] Move the last 2 cpp jobs to NightlyL2 and remove cpp completely from APC by @kkabilarTT in #34169
- Migrate op to new infra: copy by @vtsilytskyiTT in #33243
- TT-Train flatbuffers by @athompsonTT in #32326
- Fixes for ring attention with bfloat8_b and bfloat4_b data types by @sosborne-TT in #33628
- [skip ci] Update issue templates: bug report and bounty model templates by @minaliuTT in #34087
- Pool2D Race Condition by @wransom-TT in #34115
- SDPA - minor compute optimization by @cglagovichTT in #34092
- PCC threshold change for the new integral image implementation by @ddjekicTT in #34154
- Fix pack untilize for cache update for ct dim > 8 to use regular untilize by @alingTT in #33932
- [skip ci] Simplify clang-tidy config by @blozano-tt in #34187
- Migrate op to new infra: data_movement/transpose by @shutovilyaep in #33423
- #33711: Add semaphore ID to program descriptor by @ssundaramTT in #33735
- Migrate op to new infra: pad op by @MaximArtemovEPAM in #33552
- Reduce num traffic iterations for BH Galaxy health check by @tt-asaigal in #34188
- Re-enable system config based MGD lookup for Multi-Host systems by @tt-asaigal in #34199
- Adding T3K schedule by @akirby-TT in #34201
- Add Multi-Mesh/Proc Pipeline on BH Galaxy by @tt-asaigal in #33638
- [skip ci] Update unary operation documentation example tests by @Aswinmcw in #34065
- [skip ci] Update documentation for GCD and LCM binary op by @Aswinmcw in #34063
- [skip ci] Update model tracer README.md to include currently traced models by @Aswinmcw in #34150
- Fix OOM happend by training on 1x32 with 2D config by @daminakaTT in #34058
- bugprone-unhandled-self-assignment by @wilderfield in #34208
- Fix sdpa reduce copy init parameter issue on BH by @alingTT in #34206
- Extend usage of UNITY_BUILD to more projects by @pavlejosipovic in #33972
- chore: update LLK submodule to 6d67375 by @fvranicTT in #34217
- Make prefill trace optional for GPT-OSS by @handrewsTT in #34223
- Automated Fabric Test Config Generation by @jpanasiukTT in #33595
- #0: Revert layernorm all gather changes for new infra to un-hang a model on BH QB GE by @tt-rkim in #34157
- Add log probs feature to TT-Transformers on T3K by @djordje-tt in #33343
- fix: include
risc_attribs.hinwatcher_common.hto fixtt_l1_ptrundefined error by @fvranicTT in #34231 - [skip ci] Enable a custom pipeline workflow for model and model-adjacent PRs by @pbaraTT in #32372
- Change the way sequence lengths are padded in
tt_transformers+ model warmup sequence length changes by @nostojicTT in #34114 - Revert "Automated Fabric Test Config Generation" by @ebanerjeeTT in #34241
- migrate meta's image attention to HF equivalent by @epam-ioannis-alexiou in #33553
- Enable sharding support in all_to_al_async_generic by @itarabanTT in #34124
- [skip ci] turn on pinging for auto-triage by @ebanerjeeTT in #34243
- Fix trace only run by @mo-tenstorrent in #33662
- Exposing local variables and arguments in lightweight asserts by @tt-vjovanovic in #34218
- [Deepseek] Add workaround for prefill hang by @pprajapatiTT in #34030
- [skip ci] adding option to disable slack pinging by @ebanerjeeTT in #34264
- [Reduce APC] Move N150 profiler to L2Nightly by @kkabilarTT in #34269
- Adding welford support, Gamma beta Tile support, Sliding window Computation to Distributed Layernorm by @vsureshTT in #31702
- Use prebuilt binaries for TT-Sim by @afuller-TT in #34203
- feat: start using fix: use
calculate_squarefrom tt-llk by @fvranicTT in #34191 - Add a version of tile reshape that does not cache mappings on device by @nardoTT in #33359
- [skip-ci] Fix: Scheduled t3000 tests do not run by @pbaraTT in #34270
- Migrate op to new infra: all_gather_matmul_async by @awliu-TT in #34109
- Changed attn_out memory_config to skip_mem_cfg by @maksim-tsishkouski-epam in #33917
- #28087 binary op sharding performance optimization by @dchenTT in #34132
- bugprone-crtp-constructor-accessibility by @wilderfield in #34238
- Add minimal broadcast for deepseek batch1 by @nardoTT in #34168
- [skip ci] ViT-p150 Readme Update batch size from 8 to 10 in README by @mbahnasTT in #34280
- add z-router support to fabric builder by @SeanNijjar in #33975
- [skip ci] Retag the image we use to cut the release by @blozano-tt in #31752
- Add lm head to Galaxy unit tests by @alingTT in #34207
- chore: update LLK submodule to c2ed028 by @fvranicTT in #34284
- [skip ci] Add lightweight asserts and llk asserts to setup job by @dpopovTT in #34221
- Fix which sequence lengths will be warmed up for a model by @nostojicTT in #34248
- Migrate op to new infra: sharded_to_interleaved/sharded_to_interleaved_partial by @philei-tt in #33996
- Set
reset_batch=Falseas default by @rdraskicTT in #34272 - #30261: Migrate logit as a device op by @mouliraj-mcw in #31296
- Inline [] operator in ShapeBase by @rpavlovicTT in #34035
- [skip ci] ci: provide more information when doing the llk uplift (added issue information) by @fvranicTT in #34224
- Migrate op to new infra: move by @vtsilytskyiTT in #33173
- Revert "#28087 binary op sharding performance optimization (#34132)" by @mbezuljTT in #34300
- E2E demo for Panoptic DeepLab on 20 cores by @ianastasijevicTT in #34081
- readability-container-data-pointer by @wilderfield in #34311
- [CONV] Relaxing the pcc threshold for the recently regressed test case by @dstoiljkovicTT in #34317
- Updates to rounding ops. by @jasondavies in #34149
- #30047: Refine conv2d function documentation by @bbeggsTT in #33728
- #33538: Move swish from composite infra to unary infra by @mouliraj-mcw in #33641
- 28564: set runtime args for all cores in override_runtime_args_mc_hc_tiled_interleaved by @bbradelTT in #34260
- Setting new values needed by Quasar linker by @arikTT in #34189
- [skip ci] Temporarily skip failing tests to reduce glx queue times in CI by @subinleeTT in #34257
- readability-non-const-parameter by @wilderfield in #34321
- Move DeviceManager to MetalContext by @mfiltser-TT in #34056
- Fix syntax error in CI configuration file by @sosborne-TT in #34329
- Adjust qwen25 CI perf thresholds by @yieldthought in #33902
- [TT-Transformers] Fix long context test cases with max_seq_len override by @gwangTT in #34136
- fix: adjust const char* size checks using strlen for assertions by @fvranicTT in #34318
- [skip ci] Fixing Slack Messaging Error in Auto Triage by @ebanerjeeTT in #34332
- Changed perf targets for OFT by @ddjekicTT in #34304
- [skip-ci] T3000 model perf tests: fix JSON object in generate-matrix steo by @pbaraTT in #34341
- #32775: Migrate op to new infra: reduce_scatter_minimal_async by @ssundaramTT in #33929
- bugprone-forwarding-reference-overload by @wilderfield in #34253
- deepseek_v3: make transfer_row safe for sharded tensors by @yieldthought in #34334
- modernize-use-default-member-init by @blozano-tt in #34176
- Add sub grid support to many of the to-layout conditions and to tilize with padding, untilize, untilize with unpadding by @jvegaTT in #34268
- Integrate watcher sanitize into safe L1 accessor by @nhuang-tt in #34275
- Fix blackhole multi card demo test failure due to incorrect max_seq_len with proper prefill tracing enabled by @alingTT in #34285
- Add flag to enable benchmark mode in fabric test kernels by @nnyamagoudar-TT in #34173
- Fix Qwen prefetcher perf tests in Llama galaxy unit test CI (overflowed global cb max num pages) by @alingTT in #34293
- Added tracing for Whisper by @atupe-tt in #33927
- Make permute/transpose consistent with nullopt by @nsextonTT in #34292
- [skip ci] Add new user mappings in codeowners-group-analysis.yaml by @Aswinmcw in #34392
- adding a skip, since cb_wait_front is not working in distributed_laye… by @vsureshTT in #34390
- Fix
Padded prefill end idx {some number} exceeds max seq len {some number}during prefill warmup by @nostojicTT in #34277 - Nanobind support by @ThisIsFineTM in #23160
- Revert "Nanobind support" by @nsextonTT in #34395
- #23667: Move
test_matmul_benchmark.cppto benchmark directory by @mgajewskiTT in #34230 - [skip ci] Delete models/docs/MODEL_GRADUATION.md by @bbeggsTT in #25921
- [skip ci] Create .rst file for TT-SMI by @bbeggsTT in #31847
- [skip ci] Updating Advanced Perf doc MCQ section. by @bbeggsTT in #33087
- [skip ci] Add changelog for release v0.65.0-dev20251129 by @bbeggsTT in #34052
- [skip ci] add changelog file for release v0.65.0-dev20251205 by @bbeggsTT in #34049
- chore: update LLK submodule to dfd8dc0 by @fvranicTT in #34374
- Update submodules when new ref is checked out for release models image by @dpopovTT in #34402
- test: disable constantly failing layernorm test by @fvranicTT in #34411
- [skip ci] Delete models/docs/MODEL_ADD.md by @bbeggsTT in #25920
- test: turn off test_move_op on BH P100 by @fvranicTT in #34418
- Cleanup the low level tt/tti instructions to move them into tt-llk by @CodeMan62 in #28465
- [Reduce APC] Move n150 tt-train-cpp-unit-tests from APC to L2Nightly by @kkabilarTT in #34419
- meta's lib cross attention block replaced to HF equivalent by @epam-ioannis-alexiou in #34242
- fix output coreranges ordering in all reduce by @kpaigwar in #34198
- Copy from device after every trace execution in tt-transformers prefill by @rdraskicTT in #34597
- fix penalties bugs by @sraizada-tt in #34697
- SDXL perf targets reverted (#34779) by @mbezuljTT in #34788
- SDXL TP=2 prompt batch hotfix by @mbezuljTT in #34429
- force batched prefill for users>=16 by @sraizada-tt in #34701
- Add logprobs for llama3.3-70b Galaxy by @djordje-tt in #34676
- penalties fixes for llama 8b by @sraizada-tt in #34776
- Device sampling in prefill for llama70b by @tchedaTT in #34972
- Add prefill sampling support to TTT models by @sraizada-tt in #35021
- Remove prefetcher dangling reference from previous test by @djordje-tt in #35061
- Fix batched prefill pcc issue by @rdraskicTT in #35059
- Llama-3.1-8B decode TSU optimizations by @jonathansuTT in #35142
- [skip ci] Re-gen Docker containers (#35305) by @acvejicTT in #35438
New Contributors
- @arichinsTT made their first contribution in #32997
- @bklockiewiczTT made their first contribution in #33397
- @CodeMan62 made their first contribution in #28465
Full Changelog: v0.65.0-rc14...v0.65.1