v0.45.0
·
16397 commits
to main
since this release
🚀 Features
- #6204: added support for num_users < 32 for update cache op.
- PR: #6213
- #6247 Llama2 Galaxy MLP implementation
- PR: #6265
📦 Uncategorized
- #4736: Add support for moreh_norm op
- PR: #4864
- Fix moreh_layernorm rstd
- PR: #5616
- #5508: Change test_moreh_layernorm.py for debugging
- PR: #5619
- #4686: add infra for sharing global struct among ops
- PR: #5456
- #5592: Fix pcc on Falcon 7b prefill by turning on l1 packer on MLP 4h-to-h matmul
- PR: #5686
- Fix layernorm beta data format reconfig
- PR: #5760
- Add linked support for in0 in1 mcast in matmul
- PR: #5759
- #4957: optimizing construct_2d_padded_tensor_list
- PR: #5614
- #4003: added ttnn.as_tensor and enabled support for caching torch tensor
- PR: #5809
- Revert "#0: Fix for fail in asinh backward"
- PR: #5886
- #5829: Use moreh_common.hpp for data movement kernels across moreh OPs
- PR: #5833
- Barsic/ttnn ops
- PR: #5892
- #6030: Update resnet performance metrics
- PR: #6030
- #5876: pytest & c++ test logging cleanup
- PR: #5987
- #0: Use both 2x2 and 2x4 machines on every scheduled run
- PR: #6091
- Add single core matmul benchmark
- PR: #5997
- #6079: Update FORCE_INLINE to be nop when watcher is enabled
- PR: #6092
- #5980: Fix a hard-coded bounds check in dprint
- PR: #6028
- #5389: merged ttl and ttnn tensor classes into one
- PR: #6051
- Initial Performance Model
- PR: #6025
- fix ci
- PR: #6089
- TTNN RN50 :: on the road to match perf with TTLIB version
- PR: #6046
- #4438: Optimized single-core fold op
- PR: #5999
- #5589: Add repeat-interleave and addcmul sweeps
- PR: #6102
- #6055: Add square backward support
- PR: #6071
- #6057: Add backward support for lgamma
- PR: #6059
- #6056: Add backward support for frac and trunc
- PR: #6065
- #6066: Add support for backward log sigmoid
- PR: #6069
- #6002: Add backward support for binary maximum
- PR: #6003
- Ngrujic/improve conversion to bfloat8b in sweeps
- PR: #6068
- #5829: Use moreh_common.hpp for compute kernels across moreh OPs
- PR: #6122
- #0: Remove post-commit label from multi device pipeline because it's not actually post commit
- PR: #6142
- Add pack l1 acc to resnet conv
- PR: #6054
- #6144: Skip 512x512 cross attn 2d upblock for now in nightly because it hangs
- PR: #6145
- #6061: Add tanhshrink, threshold, Unary EQ backward ops support
- PR: #6137
- Width Sharded Concat for Unet
- PR: #5776
- #5184: uncommenting various moreh test case.
- PR: #6143
- Fix compute kernel config arg for resnet50
- PR: #6147
- Nsmith/untilize unit test
- PR: #6105
- Revert "Revert "#5389: merged ttl and tensor classes into one""
- PR: #6158
- #4438: Do not use the new fold op in Resnet tests
- PR: #6153
- Remove corerangeset that does not work on wormhole
- PR: #6156
- #6129: Expose kernel config attrs and use 4 dst tiles for fp32 configs
- PR: #6134
- #5391: Add device perf
- PR: #5875
- #0: Use multiplier for wormhole b0 mulsi3
- PR: #6160
- #4003: removed ttnn.Tensor autoclass from tensor.rst
- PR: #6170
- TTNN MultiDevice Support
- PR: #6131
- build artifacts
- PR: #6111
- #4947: Add noc alignment checks to watcher
- PR: #5998
- Add ttnn multi-chip unit test for checking device shards
- PR: #6179
- Nsmith/fix unet
- PR: #6141
- #6043: Random program stress test of command queues
- PR: #6044
- Logit and logiteps backward support
- PR: #6016
- Backward support for log2
- PR: #6064
- Add missing ttnn tests and disable broken tests until issues are fixed
- PR: #6186
- Fix Events feature for FD1.3 (out-of-order event ids, events feature missing) #6093
- PR: #6181
- #5873: make top-level post commit workflow re-useable
- PR: #6188
- #5589: add groupnorm for ttnn sweeps
- PR: #6167
- Ngrujic/ttnn sweeps 4
- PR: #6135
- Add ethernet datamover (EDM) - a foundational ethernet transfer engine
- PR: #5718
- #6116: Add backward support for softshrink
- PR: #6118
- #0: Add verbose make logs to artifact and make nicer name on metal
- PR: #6199
- #0: Only use 2x4 setup for multi-card WH CI as 2x2 does not provide us good feedback
- PR: #6202
- #4809 dprint tensix regs
- PR: #6072
- #4003: fixed bloom perf test
- PR: #6208
- #6187: Conv bugfix
- PR: #6205
- #0: concat RM support variable stick widths across inputs
- PR: #6207
- TTNN RN50 on WHB0
- PR: #6173
- #6084: Lower thresholds slightly after using proper configs for device resnet
- PR: #6214
- Fast dispatch 2.0 proof of concept
- PR: #6176
- #6218: add pytest for matmul 1d 2d
- PR: #6219
- #6177: use
is_tensor_storage_on_deviceso it works for MultiDeviceStorage- PR: #6178
- #6082: support workers + eth cores in one program
- PR: #6172
- #6215: Rename TensorToMeshMapper/MeshToTensorComposer
- PR: #6220
- #6164: Update test_noc_unicast_vs_multicast_to_single_core_latency to not use same cores for producer and consumer on WH
- PR: #6224
- #6117: Add backward support for softplus
- PR: #6128
- #6223: remove redundant call to context switch
- PR: #6225
- Integrate EDM with all-gather.
- PR: #6169
- #6136: Add backward support for unary LE and GE
- PR: #6138
- #5398: fix unicast binaries
- PR: #6231
- Barsic/ttnn ops 2
- PR: #6070
- #5380: Add wormhole_b0 model perf tests, only falcon7b in ttlib for now
- PR: #6216
- #5372: Updated README.md file for demo
- PR: #6060
- #4003: updated ttnn.concat to have a registered fallback
- PR: #6127
- Llama2 functional bringup
- PR: #6087
- #5589: Add working BFLOAT8_B sweeps to working folder
- PR: #6192
- FD2.0 rename HostQ->PrefetchQ, add multi-core capability, fix NOC coords
- PR: #6229
- #0: bugfix in ttnn resnet caught by nightly
- PR: #6251
- #0: fix tt_bisect build bug
- PR: #6256
- Watcher Asserts
- PR: #6175
- #6183: add unit test for sd matmul ops
- PR: #6246
- #6254: Make program cache per device:
- PR: #6255
- #5394: Add functional version of Mamba architecture
- PR: #5948
- #6257: Add temporary convenience script for 800MHz / new eth reset dependent CI
- PR: #6258
- #5661: Enable gtests for fast dispatch + R chip
- PR: #6110
- Alex/metal/bmm large block untilize out
- PR: #6201
- #5389: made tensor attributes public and use ttnn::Shape instead of tt::tt_metal::Shape for storing shape
- PR: #6261
- Revert "#6183: add unit test for sd matmul ops"
- PR: #6278
- #4003: print all of the L1 buffers using ttnn.print_l1_buffer_state
- PR: #6268
- #4003: print all of the L1 buffers using ttnn.print_l1_buffers
- PR: #6279
- #4438: Implement sharded multi-core fold op for Resnet50
- PR: #6275
- #6149: disabled the check for comparing generated report with GOLDEN_L1_BUFFER_REPORT becauson pipelines it looks different than when running locally
- PR: #6280
- FD2.0 fixes+mcast support for write and packed_write
- PR: #6263
- Shwetank tt/config
- PR: #5843
- #0: Change order of device and use_program_cache fixture in remaining pytests
- PR: #6269
- Softplus with beta and threshold param
- PR: #6239
- Build tests during artifact creation
- PR: #6286
- #6149: disabled test_print_l1_buffers_of_add_operation
- PR: #6299
- #4003: updated ttnn.to_torch to work with bfloat8_b tensors that are not multiple of tile size without tile padding
- PR: #6277
- #0: add to/from L1 reshard test
- PR: #6309
- #0: Add back deleted shape assertions for interleaved concat
- PR: #6307
- test errors flagged by watcher
- PR: #6320
- #0: fix incremental build
- PR: #6103
- Merge xuncai/llama-attention-galaxy to main: First version of llama-attention galaxy on emulated chips
- PR: #6297
- #6329: Fixing a bug causing mismatch on indices
- PR: #6330
- #6321: Test which sweeps read/write buffer and just checks that the e…
- PR: #6322
- Support moreh_getitem forward
- PR: #6227
- #6125: Update in0_block_w to be full shard width for sharded 2D systolic matmul
- PR: #6262
- #6107: Add softsign, sign, unary ceil backward support
- PR: #6191
- #6226: Add backward support for div
- PR: #6235
- #6234: Add backward support for rdiv
- PR: #6238
- #6236: Add backward support for fmod and remainder
- PR: #6240
- #4003: added positional embeddings to bert and updated ttnn_sharded_optimized_bert to run with batch size of 12
- PR: #6327
- Indexed Fill
- PR: #6328
- #5589: remove dtype in gen function sweep tests where needed
- PR: #6249
- #6347: Print built-in defines once only
- PR: #6351
- #0: Add Mo as code owner on profiler code
- PR: #6352
- #0: Simplify tt_lib.scripts package by adding a specific tt_eager/scripts directory and putting the production scripts in there, whereas development scripts will stay in /scripts
- PR: #6324
- #0: Fixture reorder changes reverted for falcon_7b perf test
- PR: #6318
- #5424: remove metal_ckernel_sfpu
- PR: #5665
- #0: Update remaining tt_lib.program_cache calls to use device APIs
- PR: #6357
- #6183: add unit test for sd matmul ops
- PR: #6323
- #6289: fix dispatcher page calculation
- PR: #6340
- #5924: Enable unet on wormhole_b0 changes
- PR: #6198
- #6325: skip test_multi_device.py for grayskull arch
- PR: #6332
- Alex/metal/pack untilize no repack
- PR: #6371
- #6144: Not hanging on GS or WH with or without Watcher
- PR: #6373
- Agrebenisan/swq hwq cardinality cleanup
- PR: #6369
- #6146: Add backward support for conj
- PR: #6272
- #0: bug fix UTWH div_up instead of div trunc for calculating CB sizes
- PR: #6367
- Fix To/From Sharded Bug
- PR: #6381
- #6206: Fix resharding page mapping
- PR: #6379
- #5733: ttnn/cpp: run_operation for multi-device
- PR: #6376
- #5589: TTNN - l1 loss sweep and unit tests
- PR: #6375
- Add Support to Allow Input Batch Offset for Update Cache when Users < 32
- PR: #6331
- Npetrovic/ttnn bin ops
- PR: #6045
- Use/dprint configuration registers
- PR: #6287
- #5629: Don't create new threads during
CompileProgram, use tf to manage threadpool instead- PR: #5860
- Revert "Npetrovic/ttnn bin ops"
- PR: #6399
- #6385: Update ttnn.create_sharded_memory_config to correctly determine shard shape for height/width sharding
- PR: #6386
- TestPrintEthCores fix
- PR: #6389
- #6266: Refactored Llama 2 MLP & attention
- PR: #6358
- Bteng/fdworkflow cleanup
- PR: #6337
- Initial perf model for WH
- PR: #6283
- #6363: Fix so remote does not try direct write to completion queue
- PR: #6398
- Add support for BFP4_b format
- PR: #6395
- #6378: Disable failing test for now
- PR: #6417
- fix alignment issue for indexed fill reading in batch_ids
- PR: #6409
- #4003: added register_pre_operation_hook and register_post_operation_hook
- PR: #6396
- #6349: Add missing asserts for concat op. Minor improvement to concat kernel setup code
- PR: #6401
- #0: remove printf
- PR: #6421
- add post-commit ttnn and model pipelines
- PR: #6413
- re-direct to same internal yaml from top-level fd, ttnn, or model workflows
- PR: #6431
- Bteng/ttnn model artifact dep
- PR: #6432
- #4003: remove inner ops from pre and post hooks
- PR: #6422
- #5163: Support optional output tensors in moreh groupnorm
- PR: #6407
- #6424: Split TestPrintEthCores into two kernels as workaround.
- PR: #6426
- Support moreh arange row major output
- PR: #6435
- #6284: Add backward support for imag and real
- PR: #6354
- #5163: Change are_needed_outputs -> are_required_outputs
- PR: #6438
- #5163: Update MorehGroupNormBackwardGammaBetaGrad
- PR: #6441
- Ngrujic/ttnn sweeps 1
- PR: #6393
- #0: fix clang build
- PR: #6408
- Update cache op optimizations
- PR: #6372
- #6281: Skip 2 Non-Deterministic failing Events tests for GS
- PR: #6455
- Asarje/ttnn rn50 wh bfp8
- PR: #6291
- #6453: Add watcher asserts to perform CB bounds checking
- PR: #6456
- #6313 Llama 2 Galaxy Decoder implementation
- PR: #6420
- #5733: ttnn multi-device cleanup memory management
- PR: #6434
- #6436: fix ttnn.to_layout() to correctly return RuntimeError
- PR: #6437
- #4957: split ttnn tests into 2 groups
- PR: #6457
- #4957: 3-way ttnn test split
- PR: #6461
- #6410: Encapsulate tensor attributes inside a shared_ptr
- PR: #6411
- #5589: TTNN mse loss sweeps
- PR: #6425
- #6363: observe max tensix slots in bidir tunneller
- PR: #6458
- #6075: add reshard support to the halo op
- PR: #6335
- updates to bring post-commit pipeline time to < 30 minutes
- PR: #6479
- #6123: Add support for backward mvlgamma
- PR: #6242
- #6390:L1 loss pcc issue
- PR: #6492
- #6040: enable bidirectional support for all-gather
- PR: #6416
- #6496: No longer gate upload release step on the frequent pipelines passing, and just let them run for convenience
- PR: #6497
- TTNN sweeps: binary ops and fixes
- PR: #6483
- #0: Tag name for eager - Package workflow, which is the impl of the main version, with appropriate qualifiers to not confuse ppl
- PR: #6498
- fix for WH
- PR: #6485
- #6414: Ensure we run single and multicore/multi device sfpu tests. Lo…
- PR: #6507
- FD2.0 CQ_DISPATCH_CMD_WRITE_PAGED initial implementation and tests
- PR: #6486
- #6510: Support to have enqueue write-only and read-only tests
- PR: #6511
- integrate fd multiqueue post commit into post commit
- PR: #6230
- #6513: move multi-device files under tt-metal/impl/device
- PR: #6514
- #0: ttnn-falcon: add packer_l1_acc to MLP module
- PR: #6515
- Add new frequent pipeline for multi nebula CI
- PR: #6512
- Non-zero indices op
- PR: #6473
- Add native repeat op and RM concat
- PR: #6490
- Add llama2_70b into multi-nebula frequent ci pipeline
- PR: #6521
- #6493: update backward softplus with beta and threshold param
- PR: #6380
- Jrock/falcon op tests
- PR: #6509
- Jrock/falcon40b utility test update
- PR: #6506
- Ngrujic/debug yaml based sweep tests
- PR: #6446
- #6241:Prefill on 8 chips
- PR: #6502
- #6503: Llama 2 Refactor All Test files, allow repro on any device
- PR: #6504
- #5480: Fix memory address hack in FD2 test
- PR: #6554
- #5592: Interleaved2ShardedPartialOp, Sharded2InterleavedPartialOp, Matmul1d height sharding + padding fixes
- PR: #6508
- #0: Modify Bert Large Perf test to delete intermediates at the end of each iteration
- PR: #6558
- Alex/metal/max pool dm perf
- PR: #6552
- #6524: clean up the to/from_device_mesh functions
- PR: #6525
- #5075: Watcher pause feature initial implementation
- PR: #6339
- #6562: Fix ttnn falcon7b by using arch-specific ComputeKernelConfig
- PR: #6564
- #6374: Fix to ensure that we never get an odd number of pages in our …
- PR: #6557
- Aliu/erisc launch msg
- PR: #6523
- #0: Remove temporary frequent pipeline api tests as that was meant to be a temporary stop gap for people wanting to add T3K tests until we got real CI for it
- PR: #6555
- #0: Delete llama_old models and their tests because we have no need for them anymore in light of WH-only T3K llama
- PR: #6561
- #4584: Demo file for functional whisper
- PR: #6489
- Ngrujic/ttnn sweeps
- PR: #6494
- Silu op for Sharded layout
- PR: #6459
- moreh getitem supports tilized input row major index
- PR: #6447
- #6568: Add lm-evaluation-harness support for Mamba reference model
- PR: #6569
- Barsic/ttnn ops 3
- PR: #6500
- Alex/metal/max pool remove init
- PR: #6600
- #0: Fix Falcon40B tests for CI
- PR: #6573
- FD2 test fixes
- PR: #6604
- #6450: compile fix for main
- PR: #6608
- #6377: Split perf models pipeline by arch and model collection type, as we need very specific ownership of models for Javelin
- PR: #6606
- #6577: Use CreateSemaphore api rather than hardcoded addresses in leg…
- PR: #6603
- #5733: fix multi-device to_host call
- PR: #6584
- #6472: reduce outstanding issue cmds
- PR: #6615
- #5917: Add test coverage for watcher kernei_id reporting
- PR: #6582
- Unet Concat Optimization
- PR: #6478
- #0: Properly declare the ttnn pybind dependency files for Make, as the previous one was trying to find them in the src directories, when they were really in the build
- PR: #6621
- Fast Dispatch on Idle Ethernet Core
- PR: #5919
- reduce timeout for post-commit pipelines to 45 minutes
- PR: #6624
- #6462: Upsample kernel opt
- PR: #6484
- #3766: Various fixes for Ubuntu 22.04 / Python 3.10
- PR: #6625