v0.48.0
·
14171 commits
to main
since this release
📦 Uncategorized
- #7744: Add support for non-4D tensor in moreh_sum, moreh_sum_backward
- PR: #7745
- #5544: Add output tensors parameter to moreh_nll_loss op
- PR: #7194
- #5544: Add output tensors parameter to moreh_sgd op
- PR: #7193
- #5544: Fix package build error
- PR: #7818
- #5544: Add output tensors parameter to moreh_linear op
- PR: #7147
- #5544: Prevent eager unit test failures
- PR: #7835
- #7997: Support non-4D tensor in moreh_softmax
- PR: #7998
- #7816: Bump SD perf target
- PR: #8140
- #8098: Remove temp buffer copying when reading from hugepage to host buffer
- PR: #8138
- #0: Specify DEBUG_STATUS as a string literal instead of multiple chars
- PR: #7981
- #8212: Fix uneven shards for interleaved_to_sharded op
- PR: #8259
- #0: Refactor unpad tile to modify rt args in place and remove dynamic…
- PR: #8308
- #7838: Add support for non-4D tensor in moreh_linear OPs
- PR: #8388
- #0: Use split_work_for_tilize in both tilize and untilize
- PR: #8470
- #8131: resnet-50 fix for b20.
- PR: #8283
- Add support for multiple parameters in
EltwiseUnary- PR: #8398
- #7625: Enable multicore for tilize with padding by default
- PR: #8527
- Trace Support
- PR: #8572
- #0: Switch set runtime args assertion for if kernel was placed on core to TT_ASSERT
- PR: #8645
- #7179: enabling test case. The issue was not reproducible on 8.12 dri…
- PR: #8613
- #4625: Multicore runs for untilize with unpadding on interleaved tensors
- PR: #8622
- #0: Cache program cmds, convert cb configs from write linear to write packed
- PR: #8604
- #0: Make skip and xfail optional in defining sweep tests
- PR: #8687
- Shwetank tt/bcast op
- PR: #8058
- #8364: Disable implicit fallback for ttnn.pad
- PR: #8742
- #8513: Add slack notifications to several more pipelines
- PR: #8685
- #0: Update common RT args to use no stride flag for packed cmd.
- PR: #8696
- #0: Option to write compile_commands.json from CMake
- PR: #8761
- #8718: eltwise testing for bfloat8
- PR: #8753
- Add support for bfloat8 input tensors in Mamba SSM block custom kernels
- PR: #8733
- #8460: Enable Clang-17
- PR: #8516
- #0: Remove overhead in calling functions wrapped in tensor_impl_wrapper
- PR: #8840
- #0: Updating the perf thresold to incorporate Merge back uneven reshard commit.
- PR: #8849
- #6365: Add ttnn host tests
- PR: #8210
- #6365: Revert "#6365: Add ttnn host tests (#8210)"
- PR: #8879
- #4382: fix GH reported vulnerabilities
- PR: #8876
- #0: bump C++ timeout limit to 45 minutes
- PR: #8882
- update unpad doc for slice generality
- PR: #8878
- Convert Falcon7b tt_lib ops and tensors to ttnn.experimental
- PR: #8870
- #6365: Fix ttnn host wheel tests
- PR: #8897
- Add git bisect script
- PR: #8894
- #0: Move falcon40b ci unit tests to different pipeline
- PR: #8891
- #8437: remove default matmul program config
- PR: #8772
- #0: Add myself to ttnn codeowners
- PR: #8905
- #0: Update README.md to include mention of TTNN_CONFIG_OVERRIDES
- PR: #8909
- #0: Fix typos and add TTNN_CONFIG_OVERRIDES parameter descriptions to readme
- PR: #8910
- #0: Add basic sanity checks during matmul program config creation
- PR: #8875
- #8907: Sweep tests for tilize/untilize
- PR: #8908
- #8902: Fixed program caching bug in nlp load slice op and added additional test cases for the op
- PR: #8913
- #8917: Add sweep test for the fold op
- PR: #8918
- #0: Properly support trivial single core case for 1D matmuls
- PR: #8915
- #6343: updated test_perf with test for bloom causal_lm
- PR: #8391
- #6343: Add functional_bloom test_demo
- PR: #8431
- Update README.md
- PR: #8927
- Enable optimised attention by default in falcon prefill.
- PR: #8892
- Replace FreeList shared_ptr with local_shared_ptr
- PR: #8798
- Add dummy_weights mode for mixtral tests
- PR: #8864
- Refactor operation calls: Replace operation::run() with operation::launch_op()
- PR: #8893
- Use HiFi2 to bump Falcon7b prefill PCC
- PR: #8719
- #8902: add input and attn_mask del
- PR: #8928
- #8930: Disable llama perf test
- PR: #8935
- #0: Add third codeowner to matmul path
- PR: #8934
- #0: Add create_venv.sh as environment option in installation instructions
- PR: #8898
- #7083: Composite conv fix for relu called after matmul
- PR: #8919
- #7525: Skip batch 7 metal BERT on WH B0 because it still hangs too often
- PR: #8938
- #8871: Add initial infra/support for dram sharding
- PR: #8901
- #8531: delete all makefiles
- PR: #8546
- #0: Delete dead code from work_split.hpp
- PR: #8950
- #8853: Uplift SFPI to latest w/ BH support
- PR: #8854
- #8725: Warn user if kernel cache is enabled
- PR: #8951
- #0: Minor test_prefetcher fixes
- PR: #8955
- #5389: Move ttnn.repeat to c++
- PR: #8911
- #8131: temp fix for PCC issue on W0.
- PR: #8948
- Optimize e2e perf Falcon40b modifying layernorm
- PR: #8969
- #0: Relax Falcon7b perf target
- PR: #8972
- #0: Resolve segfault in llama async mode
- PR: #8963
- Resnet Optimizations
- PR: #8933
- Create Falcon7b perplexity test and utility functions for text-gen datasets
- PR: #8960
- Revert "#8131: temp fix for PCC issue on W0."
- PR: #8984
- bmm dram sharded opt
- PR: #8947
- #8943: Clean up profiler python_env build flow
- PR: #8949
- #8904: Add slack notifications for T3000 unit-tests
- PR: #8906
- Add unet shallow functional, performance and demo test files
- PR: #8884
- #8932: Multi-Device Mixtral Argmax Support
- PR: #8990
- #8264: Worker thread optimizations:
- PR: #8778
- TTNN tests for bf8 with mk tiled scalar
- PR: #8485
- Ihamer/7468 inject noc delays
- PR: #8889
- Support changed csv row orderings in Mixtral's op_perf_results.py
- PR: #8999
- Correct merge issue in op_perf_results.py
- PR: #9001
- #0: Add kernel groups to test_pgm_dispatch
- PR: #8992
- #0: Add docs requirements to python env cache key because it can change the environment as well
- PR: #9010
- #0: Add helper function to create CBs
- PR: #8991
- #8973: Remove TT_METAL_ENV because we don't need it anymore
- PR: #8974
- #5773: Move SD model to demo folder
- PR: #8294
- #6938: Implement softplus as a single kernel
- PR: #8249
- Model team/rotary embeddings llama
- PR: #8812
- #8735: Fix hw/inc/blackhole files for compilation
- PR: #8880
- Improve Mixtral perf with ttlib
- PR: #8971
- Update README.md
- PR: #9014
- #3712: fix old version of GN test
- PR: #9017
- #0: Don't error on unused functions in compiler call
- PR: #9018
- Revert " #8904: Add slack notifications for T3000 unit-tests"
- PR: #9023
- Rtawfik/bh llk api
- PR: #8809
- #0: Added interactive demo
- PR: #9020
- Move Falcon7b before Mixtral in demo pipeline to workaround issue
- PR: #9034
- #8112: Add support for ND tensors to matmul
- PR: #9004
- #0: fix dram read benchmark
- PR: #9019
- Fix bug in utility_functions::Profiler
- PR: #9025
- Remove 1x1 matmul fallback on convolution and generalize convo…
- PR: #8886
- #5389: Remove ttnn.split
- PR: #9027
- #8767: decouple build folder name from build.cpp
- PR: #8780
- #8735: Update common flags for BH build after sfpi module update
- PR: #9024
- #8895: Fix ttnn.as_tensor(..) method for placing tensors on-device
- PR: #8964
- #8539: Add cq_id to run_operation function args
- PR: #9039
- #8632: Support fp32 dest acc en in moreh_sum and moreh_sum_backward
- PR: #8724
- #5044: Add optional output tensor and remove autoformat in eltwise binary ops
- PR: #8394
- #8895: Fix failing regression test in dump_tensor(...) API
- PR: #9040
- More Resnet Optimizations
- PR: #8993
- #4858: add typecast fp32 to uint32 op
- PR: #9033
- #8995: refactoring moreh arange
- PR: #8996
- #0: Add ccache option to build_metal.sh
- PR: #9015
- Update Mixtral perf figures
- PR: #9048
- #8349: Use BFP4_B for attention mask in falcon7b optimised prefill.
- PR: #9047
- #0: Add CODEOWNERS for build_metal.sh
- PR: #9053
- Rtawfik/add binary reuse metal
- PR: #8727
- Update watcher.rst - use double backticks
- PR: #9054
- Falcon40b tt_lib to ttnn.experimental
- PR: #9008
- #0: fix dram sharded program cache
- PR: #9031
- #7083: New halo fix for enabled program cache
- PR: #8987
- #9051: Enable Llama model perf test
- PR: #9052
- #8764: Single card WH demo tests
- PR: #9058
- #8764: Various docs fixes for WH release
- PR: #8975
- #0: Correct script locations for nightly single card
- PR: #9062
- #8764: Use new device_l1_small_size fixture for SD demo interactive test
- PR: #9063
- #9059: Update matmul test pcc
- PR: #9061
- #0: Ensure weka mount is active for demo tests otherwise it won't run
- PR: #9069
- #0: remove reserve to avoid bad alloc
- PR: #9067
- #8764: Separate n150/n300 demo tests to not run BERT 11 on N150
- PR: #9073
- Remove unnecessary llk sfpu param files
- PR: #9065
- #9059: Add fallback for getting matmul program config
- PR: #9077
- Add grouped convolution support
- PR: #8341
- #8282: Support non-4d tensor and fp32_dest_acc_en for moreh nllloss backward
- PR: #8966
- #8976: moreh_getitem receive signed integer index tensors
- PR: #8978
- #9049: fix moreh_sgd callback and add callback test
- PR: #9050
- #0: Remove argmax multi-device test due to segfault
- PR: #9089
- #7724: Add prototype for autonomous streams for use in tunneller
- PR: #8207
- #9036: GS & BH --> Combine llk param files using variable args
- PR: #9078
- #0: optimize allgather for small tensor sizes
- PR: #9087
- Enable weight caching for long running Mamba tests
- PR: #9002
- #5389: removed early return from validate when enable_fast_runtime_mo…
- PR: #8983
- Removed unucessary ttnn.to_device() from Mixtral code
- PR: #9097
- Add 2 cq implementation for Resnet
- PR: #9057
- #9084: Rename dockerfile and added virtualenv installation
- PR: #9085
- #0: Watcher interval to not include polling time
- PR: #9038
- #0: Revert "#8264: Worker thread optimizations:"
- PR: #9107
- #5389: disabled failing moreh tests
- PR: #9116
- #5389: disabled failing moreh tests
- PR: #9119
- #5389: disabled failing moreh tests
- PR: #9121
- #0: Update Resnet perf numbers
- PR: #9120
- Split dispatcher commands into packets+prefetcher relay_linear bug fix and test improvments
- PR: #8814
- #6448: re-enable all-gather bidir for dim 0,1
- PR: #9104
- #8890: Reduce size of pack_src|dst_format constexprs
- PR: #9115
- #0: merge all kernels into one group
- PR: #9125
- #7724: Disable a test to reduce runtime
- PR: #9129
- ttnn multi-chip changes for galaxy support
- PR: #9090
- #9026: Fix FD dispatcher wait on wrapped value
- PR: #9113
- #0: Add back Async Mode optimizations
- PR: #9130
- Add support for bfloat8 activations in Mamba
- PR: #8768
- #9118: Fix moreh getitem, moreh nllloss validation error
- PR: #9131
- Update ViT E2E number in README.md
- PR: #9136
- #4858: enable typecast fp16b to uint16
- PR: #9132
- #8540: Upgrade eltwise binary ops to support queue_id /output_tensor / uint output dtype
- PR: #9071
- #9095: implement callback helper function
- PR: #9096
- #5044: Add optional output to where op
- PR: #9055
- #0: enable multi-device tensor support for moreh sum op
- PR: #9126
- #5337: Mixtral dense matmul after all-gather
- PR: #9155
- Update Mamba decode performance metrics
- PR: #9134
- #8683: Add Unary right shift
- PR: #8921
- Snijjar/issue 7724
- PR: #9138
- #5044: add optional output to BW ops EQ, add, addalpha, mul
- PR: #8671
- build UMD with same compiler used to compile metal and remove clang 6 as a dependency
- PR: #9133
- #0: change silicon param to session scope
- PR: #9162
- Mo/8223 fd2 dispatch core profiler support
- PR: #8609
- #9006: single-core topk extension to include larger width and height
- PR: #9139
- #9088: fix ttnn_falcon_7b single-device regression in decoder module
- PR: #9166
- #7586: Create unstable branch of WH single card nightly FD
- PR: #9122
- #9143: BH -> Remove unused reduce args
- PR: #9175
- #8563: sweep split_query_key_value_and_split_heads, split and concat
- PR: #8610
- #8407: Remove 1x1 matmul fallback on convolution and generalize convo…
- PR: #9056
- #4252: Update to C++20
- PR: #9070
- #9110: Move typecast to ttnn
- PR: #9146
- Update TTNN sweeps - concatenate heads, embeddings
- PR: #8863
- #9016: adjust nightly t3000 demo test pipeline to run Mon/Wed/Fri
- PR: #9081
- #9088: fix ttnn_falcon_7b single-device regression in attention
- PR: #9183
- #9167: sped up compute program hash
- PR: #9169
- #9109: Add q_id to Eltwise binary EQ
- PR: #9177
- #8662: add initial argmax op single core kernel implementation
- PR: #9180
- #8424: Add new llk-wormhole-b0 commit: remove assert for fp32 zeroacc
- PR: #9188
- #9059: adjust matmul parameters for rounding up in some scenarios
- PR: #9105
- #5389: Move ttnn.repeat_interleave to c++
- PR: #8961
- #9167: updated llama3 ops to not use attributes method and instead to use attribute_names + attributes_values
- PR: #9185
- #8681: Add Floor , Trunc dependant ops
- PR: #8285
- Fuse Mamba block residual projection with activation
- PR: #9187
- #9167: sped up compute program hash
- PR: #9201
- Add trace 2cq version of Resnet
- PR: #9178
- #9167: changed program cache to use unique_any as the value type
- PR: #9203
- #8683: Add Unary left shift
- PR: #8712
- Mixtral: Add EoS token stop to demo
- PR: #9207
- #0: Update Falcon7b CODEOWNERS
- PR: #9204
- #8764: Part 2 fixes for docs for wormhole readiness
- PR: #9170
- Correctly block for the current EP when blocking=true
- PR: #9202
- Applying Llama2 Decode and Prefill Kernels to experimentals folder
- PR: #9214
- #9198: Fix minor regression in some nightly tests due to small packet optimization
- PR: #9199
- Fix softmax sharded program cache hit
- PR: #9212
- #0: add suuport for in1 dram sharded matmul2d
- PR: #9182
- #0: Fix repack_weights.py script for llama writing params.json contents using out_dir as a file
- PR: #9222
- #8965: deallocate all buffers on device when closing
- PR: #9220
- #0: Update noc_async_read/write docs to not specify only dram coords
- PR: #9225
- #9137: clean target will now remove entire
builtfolder- PR: #9184
- #9142: BH -> Fix pack api, add constant vector
- PR: #9181
- Standardize llk sfpu inits
- PR: #9260
- #0: Fix jupyterlab pinned to two different versions
- PR: #9262
- #4858: add uint16 to fp16b typecast support
- PR: #9265
- #0: pad subbblock size, allow mixtral shapes reach 240GB/s
- PR: #9264
- #7083: conv config cleanup in python and c++ changes
- PR: #9075
- #0: Add option to validate program binaries on device before enqueuing program in debug mode
- PR: #9216
- #7822: Fix conditionals for bmm multi core reuse optimized for when to update rt args
- PR: #9273
- #8764: Set TTNN_CONFIG_OVERRIDES if it exists in the ttnn workflow
- PR: #9257
- #9270: tracy linking error fix
- PR: #9274
- #9200: Use project paths in CMake
- PR: #9259
- #0: Make numa node based binding opt-in
- PR: #9172
- #5337: Add extrapolation and skipping to op_perf_results
- PR: #9282
- Update Mistral perf figures
- PR: #9284
- Improve mistral perf test for 1024 seqlen and on-device profiling
- PR: #9283
- Fix log message typo (importting -> importing)
- PR: #9281
- #7586: Move current wh b0 only single-card nightly tests to the ln model
- PR: #9215
- [Falcon7b] Add support for 2k kv-cache size for decode l1-sharded configuration
- PR: #9219
- #0: Update Llama experimental readme
- PR: #9292
- #8725: Update warning for persistent kernel cache
- PR: #9285
- [Falcon7b] Add option to run huggingface model in perplexity test, and add perplexity test to demo ci
- PR: #9266
- #0: Skip failing resnet tests
- PR: #9301
- #8658: Migrate composite unary ops to C++
- PR: #8810
- #5389: updated ShardSpec to use attribute_names + attribute_values instead of attributes
- PR: #9278
- #8764: Run ttnn ipynb tutorials on N150/N300
- PR: #9299
- #8837: Fix Resnet trace 2cq version to write inputs on cq 1
- PR: #9293
- #753: Syncing device host times for tracy profiler
- PR: #8101
- #8940: Get rid of source code directories in local environment to ensure that end to end environment is valid
- PR: #8899
- Fix Mixtral ttnn.eq dtype
- PR: #9306
- #8764: ttnn examples in ci
- PR: #9304
- Binary dest accumulation
- PR: #9272
- Move program configs out of runtime codepath.
- PR: #9320
- #0: Fix import error for skipping ttnn resnet tests
- PR: #9326
- #0: opt dram u-bench to 267GB/s
- PR: #9311
- Add ttnn argmax op
- PR: #9300
- #0: Cleanup bmm multi core reuse optimized ORTAs
- PR: #9327
- #9080: Migrate pipeline owners
- PR: #9310
- TTNN split removal fix
- PR: #9308
- Update sweeps documentation
- PR: #9157