v0.45.0

github-actions released this 22 Mar 18:03

· 16397 commits to main since this release

4f11681

🚀 Features

#6204: added support for num_users < 32 for update cache op.
- PR: #6213
#6247 Llama2 Galaxy MLP implementation
- PR: #6265

📦 Uncategorized

#4736: Add support for moreh_norm op
- PR: #4864
Fix moreh_layernorm rstd
- PR: #5616
#5508: Change test_moreh_layernorm.py for debugging
- PR: #5619
#4686: add infra for sharing global struct among ops
- PR: #5456
#5592: Fix pcc on Falcon 7b prefill by turning on l1 packer on MLP 4h-to-h matmul
- PR: #5686
Fix layernorm beta data format reconfig
- PR: #5760
Add linked support for in0 in1 mcast in matmul
- PR: #5759
#4957: optimizing construct_2d_padded_tensor_list
- PR: #5614
#4003: added ttnn.as_tensor and enabled support for caching torch tensor
- PR: #5809
Revert "#0: Fix for fail in asinh backward"
- PR: #5886
#5829: Use moreh_common.hpp for data movement kernels across moreh OPs
- PR: #5833
Barsic/ttnn ops
- PR: #5892
#6030: Update resnet performance metrics
- PR: #6030
#5876: pytest & c++ test logging cleanup
- PR: #5987
#0: Use both 2x2 and 2x4 machines on every scheduled run
- PR: #6091
Add single core matmul benchmark
- PR: #5997
#6079: Update FORCE_INLINE to be nop when watcher is enabled
- PR: #6092
#5980: Fix a hard-coded bounds check in dprint
- PR: #6028
#5389: merged ttl and ttnn tensor classes into one
- PR: #6051
Initial Performance Model
- PR: #6025
fix ci
- PR: #6089
TTNN RN50 :: on the road to match perf with TTLIB version
- PR: #6046
#4438: Optimized single-core fold op
- PR: #5999
#5589: Add repeat-interleave and addcmul sweeps
- PR: #6102
#6055: Add square backward support
- PR: #6071
#6057: Add backward support for lgamma
- PR: #6059
#6056: Add backward support for frac and trunc
- PR: #6065
#6066: Add support for backward log sigmoid
- PR: #6069
#6002: Add backward support for binary maximum
- PR: #6003
Ngrujic/improve conversion to bfloat8b in sweeps
- PR: #6068
#5829: Use moreh_common.hpp for compute kernels across moreh OPs
- PR: #6122
#0: Remove post-commit label from multi device pipeline because it's not actually post commit
- PR: #6142
Add pack l1 acc to resnet conv
- PR: #6054
#6144: Skip 512x512 cross attn 2d upblock for now in nightly because it hangs
- PR: #6145
#6061: Add tanhshrink, threshold, Unary EQ backward ops support
- PR: #6137
Width Sharded Concat for Unet
- PR: #5776
#5184: uncommenting various moreh test case.
- PR: #6143
Fix compute kernel config arg for resnet50
- PR: #6147
Nsmith/untilize unit test
- PR: #6105
Revert "Revert "#5389: merged ttl and tensor classes into one""
- PR: #6158
#4438: Do not use the new fold op in Resnet tests
- PR: #6153
Remove corerangeset that does not work on wormhole
- PR: #6156
#6129: Expose kernel config attrs and use 4 dst tiles for fp32 configs
- PR: #6134
#5391: Add device perf
- PR: #5875
#0: Use multiplier for wormhole b0 mulsi3
- PR: #6160
#4003: removed ttnn.Tensor autoclass from tensor.rst
- PR: #6170
TTNN MultiDevice Support
- PR: #6131
build artifacts
- PR: #6111
#4947: Add noc alignment checks to watcher
- PR: #5998
Add ttnn multi-chip unit test for checking device shards
- PR: #6179
Nsmith/fix unet
- PR: #6141
#6043: Random program stress test of command queues
- PR: #6044
Logit and logiteps backward support
- PR: #6016
Backward support for log2
- PR: #6064
Add missing ttnn tests and disable broken tests until issues are fixed
- PR: #6186
Fix Events feature for FD1.3 (out-of-order event ids, events feature missing) #6093
- PR: #6181
#5873: make top-level post commit workflow re-useable
- PR: #6188
#5589: add groupnorm for ttnn sweeps
- PR: #6167
Ngrujic/ttnn sweeps 4
- PR: #6135
Add ethernet datamover (EDM) - a foundational ethernet transfer engine
- PR: #5718
#6116: Add backward support for softshrink
- PR: #6118
#0: Add verbose make logs to artifact and make nicer name on metal
- PR: #6199
#0: Only use 2x4 setup for multi-card WH CI as 2x2 does not provide us good feedback
- PR: #6202
#4809 dprint tensix regs
- PR: #6072
#4003: fixed bloom perf test
- PR: #6208
#6187: Conv bugfix
- PR: #6205
#0: concat RM support variable stick widths across inputs
- PR: #6207
TTNN RN50 on WHB0
- PR: #6173
#6084: Lower thresholds slightly after using proper configs for device resnet
- PR: #6214
Fast dispatch 2.0 proof of concept
- PR: #6176
#6218: add pytest for matmul 1d 2d
- PR: #6219
#6177: use is_tensor_storage_on_device so it works for MultiDeviceStorage
- PR: #6178
#6082: support workers + eth cores in one program
- PR: #6172
#6215: Rename TensorToMeshMapper/MeshToTensorComposer
- PR: #6220
#6164: Update test_noc_unicast_vs_multicast_to_single_core_latency to not use same cores for producer and consumer on WH
- PR: #6224
#6117: Add backward support for softplus
- PR: #6128
#6223: remove redundant call to context switch
- PR: #6225
Integrate EDM with all-gather.
- PR: #6169
#6136: Add backward support for unary LE and GE
- PR: #6138
#5398: fix unicast binaries
- PR: #6231
Barsic/ttnn ops 2
- PR: #6070
#5380: Add wormhole_b0 model perf tests, only falcon7b in ttlib for now
- PR: #6216
#5372: Updated README.md file for demo
- PR: #6060
#4003: updated ttnn.concat to have a registered fallback
- PR: #6127
Llama2 functional bringup
- PR: #6087
#5589: Add working BFLOAT8_B sweeps to working folder
- PR: #6192
FD2.0 rename HostQ->PrefetchQ, add multi-core capability, fix NOC coords
- PR: #6229
#0: bugfix in ttnn resnet caught by nightly
- PR: #6251
#0: fix tt_bisect build bug
- PR: #6256
Watcher Asserts
- PR: #6175
#6183: add unit test for sd matmul ops
- PR: #6246
#6254: Make program cache per device:
- PR: #6255
#5394: Add functional version of Mamba architecture
- PR: #5948
#6257: Add temporary convenience script for 800MHz / new eth reset dependent CI
- PR: #6258
#5661: Enable gtests for fast dispatch + R chip
- PR: #6110
Alex/metal/bmm large block untilize out
- PR: #6201
#5389: made tensor attributes public and use ttnn::Shape instead of tt::tt_metal::Shape for storing shape
- PR: #6261
Revert "#6183: add unit test for sd matmul ops"
- PR: #6278
#4003: print all of the L1 buffers using ttnn.print_l1_buffer_state
- PR: #6268
#4003: print all of the L1 buffers using ttnn.print_l1_buffers
- PR: #6279
#4438: Implement sharded multi-core fold op for Resnet50
- PR: #6275
#6149: disabled the check for comparing generated report with GOLDEN_L1_BUFFER_REPORT becauson pipelines it looks different than when running locally
- PR: #6280
FD2.0 fixes+mcast support for write and packed_write
- PR: #6263
Shwetank tt/config
- PR: #5843
#0: Change order of device and use_program_cache fixture in remaining pytests
- PR: #6269
Softplus with beta and threshold param
- PR: #6239
Build tests during artifact creation
- PR: #6286
#6149: disabled test_print_l1_buffers_of_add_operation
- PR: #6299
#4003: updated ttnn.to_torch to work with bfloat8_b tensors that are not multiple of tile size without tile padding
- PR: #6277
#0: add to/from L1 reshard test
- PR: #6309
#0: Add back deleted shape assertions for interleaved concat
- PR: #6307
test errors flagged by watcher
- PR: #6320
#0: fix incremental build
- PR: #6103
Merge xuncai/llama-attention-galaxy to main: First version of llama-attention galaxy on emulated chips
- PR: #6297
#6329: Fixing a bug causing mismatch on indices
- PR: #6330
#6321: Test which sweeps read/write buffer and just checks that the e…
- PR: #6322
Support moreh_getitem forward
- PR: #6227
#6125: Update in0_block_w to be full shard width for sharded 2D systolic matmul
- PR: #6262
#6107: Add softsign, sign, unary ceil backward support
- PR: #6191
#6226: Add backward support for div
- PR: #6235
#6234: Add backward support for rdiv
- PR: #6238
#6236: Add backward support for fmod and remainder
- PR: #6240
#4003: added positional embeddings to bert and updated ttnn_sharded_optimized_bert to run with batch size of 12
- PR: #6327
Indexed Fill
- PR: #6328
#5589: remove dtype in gen function sweep tests where needed
- PR: #6249
#6347: Print built-in defines once only
- PR: #6351
#0: Add Mo as code owner on profiler code
- PR: #6352
#0: Simplify tt_lib.scripts package by adding a specific tt_eager/scripts directory and putting the production scripts in there, whereas development scripts will stay in /scripts
- PR: #6324
#0: Fixture reorder changes reverted for falcon_7b perf test
- PR: #6318
#5424: remove metal_ckernel_sfpu
- PR: #5665
#0: Update remaining tt_lib.program_cache calls to use device APIs
- PR: #6357
#6183: add unit test for sd matmul ops
- PR: #6323
#6289: fix dispatcher page calculation
- PR: #6340
#5924: Enable unet on wormhole_b0 changes
- PR: #6198
#6325: skip test_multi_device.py for grayskull arch
- PR: #6332
Alex/metal/pack untilize no repack
- PR: #6371
#6144: Not hanging on GS or WH with or without Watcher
- PR: #6373
Agrebenisan/swq hwq cardinality cleanup
- PR: #6369
#6146: Add backward support for conj
- PR: #6272
#0: bug fix UTWH div_up instead of div trunc for calculating CB sizes
- PR: #6367
Fix To/From Sharded Bug
- PR: #6381
#6206: Fix resharding page mapping
- PR: #6379
#5733: ttnn/cpp: run_operation for multi-device
- PR: #6376
#5589: TTNN - l1 loss sweep and unit tests
- PR: #6375
Add Support to Allow Input Batch Offset for Update Cache when Users < 32
- PR: #6331
Npetrovic/ttnn bin ops
- PR: #6045
Use/dprint configuration registers
- PR: #6287
#5629: Don't create new threads during CompileProgram, use tf to manage threadpool instead
- PR: #5860
Revert "Npetrovic/ttnn bin ops"
- PR: #6399
#6385: Update ttnn.create_sharded_memory_config to correctly determine shard shape for height/width sharding
- PR: #6386
TestPrintEthCores fix
- PR: #6389
#6266: Refactored Llama 2 MLP & attention
- PR: #6358
Bteng/fdworkflow cleanup
- PR: #6337
Initial perf model for WH
- PR: #6283
#6363: Fix so remote does not try direct write to completion queue
- PR: #6398
Add support for BFP4_b format
- PR: #6395
#6378: Disable failing test for now
- PR: #6417
fix alignment issue for indexed fill reading in batch_ids
- PR: #6409
#4003: added register_pre_operation_hook and register_post_operation_hook
- PR: #6396
#6349: Add missing asserts for concat op. Minor improvement to concat kernel setup code
- PR: #6401
#0: remove printf
- PR: #6421
add post-commit ttnn and model pipelines
- PR: #6413
re-direct to same internal yaml from top-level fd, ttnn, or model workflows
- PR: #6431
Bteng/ttnn model artifact dep
- PR: #6432
#4003: remove inner ops from pre and post hooks
- PR: #6422
#5163: Support optional output tensors in moreh groupnorm
- PR: #6407
#6424: Split TestPrintEthCores into two kernels as workaround.
- PR: #6426
Support moreh arange row major output
- PR: #6435
#6284: Add backward support for imag and real
- PR: #6354
#5163: Change are_needed_outputs -> are_required_outputs
- PR: #6438
#5163: Update MorehGroupNormBackwardGammaBetaGrad
- PR: #6441
Ngrujic/ttnn sweeps 1
- PR: #6393
#0: fix clang build
- PR: #6408
Update cache op optimizations
- PR: #6372
#6281: Skip 2 Non-Deterministic failing Events tests for GS
- PR: #6455
Asarje/ttnn rn50 wh bfp8
- PR: #6291
#6453: Add watcher asserts to perform CB bounds checking
- PR: #6456
#6313 Llama 2 Galaxy Decoder implementation
- PR: #6420
#5733: ttnn multi-device cleanup memory management
- PR: #6434
#6436: fix ttnn.to_layout() to correctly return RuntimeError
- PR: #6437
#4957: split ttnn tests into 2 groups
- PR: #6457
#4957: 3-way ttnn test split
- PR: #6461
#6410: Encapsulate tensor attributes inside a shared_ptr
- PR: #6411
#5589: TTNN mse loss sweeps
- PR: #6425
#6363: observe max tensix slots in bidir tunneller
- PR: #6458
#6075: add reshard support to the halo op
- PR: #6335
updates to bring post-commit pipeline time to < 30 minutes
- PR: #6479
#6123: Add support for backward mvlgamma
- PR: #6242
#6390:L1 loss pcc issue
- PR: #6492
#6040: enable bidirectional support for all-gather
- PR: #6416
#6496: No longer gate upload release step on the frequent pipelines passing, and just let them run for convenience
- PR: #6497
TTNN sweeps: binary ops and fixes
- PR: #6483
#0: Tag name for eager - Package workflow, which is the impl of the main version, with appropriate qualifiers to not confuse ppl
- PR: #6498
fix for WH
- PR: #6485
#6414: Ensure we run single and multicore/multi device sfpu tests. Lo…
- PR: #6507
FD2.0 CQ_DISPATCH_CMD_WRITE_PAGED initial implementation and tests
- PR: #6486
#6510: Support to have enqueue write-only and read-only tests
- PR: #6511
integrate fd multiqueue post commit into post commit
- PR: #6230
#6513: move multi-device files under tt-metal/impl/device
- PR: #6514
#0: ttnn-falcon: add packer_l1_acc to MLP module
- PR: #6515
Add new frequent pipeline for multi nebula CI
- PR: #6512
Non-zero indices op
- PR: #6473
Add native repeat op and RM concat
- PR: #6490
Add llama2_70b into multi-nebula frequent ci pipeline
- PR: #6521
#6493: update backward softplus with beta and threshold param
- PR: #6380
Jrock/falcon op tests
- PR: #6509
Jrock/falcon40b utility test update
- PR: #6506
Ngrujic/debug yaml based sweep tests
- PR: #6446
#6241:Prefill on 8 chips
- PR: #6502
#6503: Llama 2 Refactor All Test files, allow repro on any device
- PR: #6504
#5480: Fix memory address hack in FD2 test
- PR: #6554
#5592: Interleaved2ShardedPartialOp, Sharded2InterleavedPartialOp, Matmul1d height sharding + padding fixes
- PR: #6508
#0: Modify Bert Large Perf test to delete intermediates at the end of each iteration
- PR: #6558
Alex/metal/max pool dm perf
- PR: #6552
#6524: clean up the to/from_device_mesh functions
- PR: #6525
#5075: Watcher pause feature initial implementation
- PR: #6339
#6562: Fix ttnn falcon7b by using arch-specific ComputeKernelConfig
- PR: #6564
#6374: Fix to ensure that we never get an odd number of pages in our …
- PR: #6557
Aliu/erisc launch msg
- PR: #6523
#0: Remove temporary frequent pipeline api tests as that was meant to be a temporary stop gap for people wanting to add T3K tests until we got real CI for it
- PR: #6555
#0: Delete llama_old models and their tests because we have no need for them anymore in light of WH-only T3K llama
- PR: #6561
#4584: Demo file for functional whisper
- PR: #6489
Ngrujic/ttnn sweeps
- PR: #6494
Silu op for Sharded layout
- PR: #6459
moreh getitem supports tilized input row major index
- PR: #6447
#6568: Add lm-evaluation-harness support for Mamba reference model
- PR: #6569
Barsic/ttnn ops 3
- PR: #6500
Alex/metal/max pool remove init
- PR: #6600
#0: Fix Falcon40B tests for CI
- PR: #6573
FD2 test fixes
- PR: #6604
#6450: compile fix for main
- PR: #6608
#6377: Split perf models pipeline by arch and model collection type, as we need very specific ownership of models for Javelin
- PR: #6606
#6577: Use CreateSemaphore api rather than hardcoded addresses in leg…
- PR: #6603
#5733: fix multi-device to_host call
- PR: #6584
#6472: reduce outstanding issue cmds
- PR: #6615
#5917: Add test coverage for watcher kernei_id reporting
- PR: #6582
Unet Concat Optimization
- PR: #6478
#0: Properly declare the ttnn pybind dependency files for Make, as the previous one was trying to find them in the src directories, when they were really in the build
- PR: #6621
Fast Dispatch on Idle Ethernet Core
- PR: #5919
reduce timeout for post-commit pipelines to 45 minutes
- PR: #6624
#6462: Upsample kernel opt
- PR: #6484
#3766: Various fixes for Ubuntu 22.04 / Python 3.10
- PR: #6625

Assets 5