Releases · tenstorrent/tt-metal

25 Sep 16:13

github-actions

v0.52.0

5448c47

v0.52.0

Note

This is a verified, real release, however the release notes are under construction. Thank you for understanding.

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/11036234439

📦 Uncategorized

#12323: delete ctor of AllGatherFusedOpSignaler
- PR: #12324
#0: Revert "#0: Update to gcc-12.x (#12332)"
- PR: #12522
#12448: Update 1d matmul sweep test to use CoreRangeSet for core range parameters
- PR: #12523
#0: [skip ci] Fix demo invocation in Llama README
- PR: #12526
#0: Update device creations functions to use num_command_queues instead of num_hw_cqs to match mesh_device creation functions
- PR: #12517
#12273: Move full wheel build on GitHub runners and 22.04 to scheduled job and fix related
- PR: #12535
Update perf and latest features for llm models (Sept 11)
- PR: #12515
#12532: Change sweep new vector checking to use the serialized vector…
- PR: #12534
#12451: add negative ends support for slice with list splicing format
- PR: #12469
fix llama t3k demo invoke in CI
- PR: #12537
Yugao/doc
- PR: #12540
#10855: Add single-device perf measurements to sweep infra
- PR: #12338
#9340: Add optional output tensor support for assign
- PR: #12057
#0: Add ccl multichip stack overview
- PR: #12551
#12371: Migrate moreh_getitem operation from tt_eager to ttnn
- PR: #12372
#11651: Remove type_caster
- PR: #11702
#12375: Add qid and optional tensor output to ttnn.gelu_bw
- PR: #12509
#8865: Optimized ttnn.bcast dispatch times
- PR: #12383
#12196: Use split readers wherever possible in UNet Shallow
- PR: #12441
Replace exact output match with tight pcc check in post-commit
- PR: #12446
#12148: Add queue_id and optional output tensors to ttnn.mul_bw
- PR: #12162
Fix start_pos in get_rot_mat() in llama galaxy model
- PR: #12493
Yieldthought/llama31 8b/ttembed
- PR: #12560
#8865: Fix non working ops in dispatch profiling infra
- PR: #12564
#0: Remove myself from tt_lib/csrc codeowners
- PR: #12567
Update workload theoretical ethernet numbers
- PR: #12570
#12524: Update fmt and unify logging API
- PR: #12464
#0: Update fmt and unify logging API
- PR: #12587
#11133: Improve various things about the wheel, including removal of patchelf and linking runtime assets to cwd
- PR: #11884
Support for initializing with 0s for SUM reduction WHB0
- PR: #12238
#12376: Support for non-32 Height in Width Sharded Conv2d
- PR: #12382
#0: Optimize context switch decision
- PR: #12545
#0: Correct #!/bin script headers
- PR: #12582
#12538: Separate out wheel tests from build so that other wheel-dependent jobs aren't blocked by the wheel smoke tests
- PR: #12594
#0: Create Blackhole Bring-Up Programming Guide
- PR: #12610
#12552: Fix indentation pybind files
- PR: #12543
#0: Add FD nightly single-card pipeline to data pipeline
- PR: #12618
#0: [skip_ci] Updating BH bring-up programming guide
- PR: #12620
Update owner of T3K ttnn unit tests
- PR: #12622
#0: change default reduce scatter num buffers per channel to 2
- PR: #12616
#12436: port moreh_sum from tt_dnn to ttnn
- PR: #12437
#12026: add permute sweep tests for trace
- PR: #12571
#12514: port moreh_mean and moreh_mean_backward from tt_dnn to ttnn
- PR: #12519
#12207: Port moreh_dot to ttnn
- PR: #12265
#12259: Move moreh dot backward
- PR: #12261
#12164: Add queue_id and optional output tensors to backward ops
- PR: #12255
#12439: Migrate moreh_nll_loss_bwd operations (reduced and unreduced) from tt_eager to ttnn
- PR: #12494
#12578: Update Mixtral t/s/u in README
- PR: #12629
#12373: Add queue_id and optional output tensors to rsqrt_bw op
- PR: #12404
remove todos from doc
- PR: #12636
add code language formatting CclDeveloperGuide.md
- PR: #12639
#0: Update multi-chip Resnet perf numbers after dispatch optimizations
- PR: #12621
#0: Remove unused _init, _fini
- PR: #12593
#0: remove unused variable
- PR: #12646
Contiguous pages support in Reduce Scatter read/write
- PR: #12477
#12628: Resolve arithmetic error in test_multi_cq_multi_dev causing T3K multi-CQ tests to fail
- PR: #12653
#12619: Update matmul sweep timeout and core range set usage
- PR: #12655
Run on custom dispatch commands on in-service runners only
- PR: #12659
#12544: support wide channels (> 256) in maxpool
- PR: #12625
#12605: Implement recommendations for Llama readme
- PR: #12657
#0: Point UMD back to main instead of metal-main
- PR: #12478
#0: ViT Trace+2CQ implementation
- PR: #12623
#0: Add BH to custom test dispatch workflow
- PR: #12667
Update ViT on GS perf
- PR: #12670
LLama selfout specific optimizations for fused all_gather_matmul op
- PR: #12292
#12520: Adding noc_async_writes_flushed between mcast writes and mcast semaphore sets for BH
- PR: #12627
#11144: Upgrade pip version to 21.2.4 to get around 22.04 import error
- PR: #12673
Remove duplicate from sfpu_split_includes.h
- PR: #12665
#12250: port moreh_matmul from tt_dnn to ttnn
- PR: #12251
#12297: Add queue_id and optional output tensors to add_bw op
- PR: #12358
#12392: Use shallow convolution in upblock3 of UNet Shallow
- PR: #12562
#0: Make CoreRangeSet thread safe
- PR: #12679
mm_sfence->tt_driver_atomics::sfence();
- PR: #12617
[New Op] Added dropout unary op
- PR: #12474
#12392: Shallov conv unet uts
- PR: #12568
Pkeller/memmap profiler
- PR: #12067
#0: Set WH_ARCH_YAML only if we have a wormhole machine
- PR: #12704
All gather expose params
- PR: #12389
Generalize nlp create head decode
- PR: #12663
#0: Remove CCL stalls, since Fabric VC support is merged
- PR: #12720
#0: Remove incorrect norelax option
- PR: #12717
#12668: SWOC bugfix
- PR: #12674
Fix start pos in get_rot_mat
- PR: #12728
#0: Remove unused CRT_START label
- PR: #12722
#12701: Split nightly tests into specific models for better reading
- PR: #12733
#0: Relax host bound tg threshold for Resnet
- PR: #12708
Rename tt::tt_metal::Shape to LegacyShape to not conflict with TTNN
- PR: #12742
#12374: Add optional output tensor support for ttnn.full_like
- PR: #12689
YoloV4 pipeline update
- PR: #12503
#12425: Add queue_id and optional output tensors to zeros_like
- PR: #12561
#12497: ttnn.empty to use create_device_tensor
- PR: #12542
#12266: Cleanup ternary backward
- PR: #12691
#0: Use absolute addressing in startup
- PR: #12723
#12595: Run profiler gather after every sweep test regardless of status
- PR: #12606
#12730: bert slice support unit tests
- PR: #12737
Reduce scatter perf sweep
- PR: #12391
#12778: Speed up sweeps parameter generation
- PR: #12780
#0: DPrint bugfix for which dispatch cores are included in 'all'
- PR: #12745
#12730: bert slice support unit tests correction
- PR: #12779
#5783: Remove watcher dependency on generated headers
- PR: #12686
#0: Update GS Resnet perf thresholds. Seeing large variation in CI
- PR: #12744
Fix issue w/ CBs getting allocated on ETH cores
- PR: #12792
#12802: add tracy option to build_metal.sh
- PR: #12803
#12748: Cleanup clamp_bw op
- PR: #12762
#12224: Add optional output tensor support for lt_bw
- PR: #12693
#12387: Workaround to_layout for height sharded tensor
- PR: #12641
#12196: Use split_reader and act db
- PR: #12769
#12508: Skip failing test in CI
- PR: #12761
#11512: Add frac, ceil and trunc sweeps
- PR: #12760
#0: Don't overwrite CMake flags in build_metal.sh
- PR: #12824
Add subtract, subalpha and rsub sweeps, interleaved
- PR: #12822
Llama tg/sharded ccls
- PR: #12814
Update peak dram speed to 288GB/s
- PR: #12528
#11169: Watcher to report if eth link retraining occurred during teardown
- PR: #12801
#0: adding jaykru-tt as codeowner for data_movement operations
- PR: #12139
Mamba CI hanging on Untilize fix
- PR: #12677
#12749: Update Test files
- PR: #12751
#12799: Add handling for pytest errors, especially those at the beginning, and expose their messages
- PR: #12838
#12529: Update comment of dataflow api for mcast loopback functions
- PR: #12825
Fix failure in llama perf on CI
- PR: #12669
fix typo - mention higher level multichip API above CCL ops
- PR: #12836
Add Mamba unit tests to post-commit test suite
- PR: #12129
#12529: Add check for in0_mcast_num_cores=1 for noc_async_write_multicast_loopback_src
- PR: #12796
#0: Change all ops which support page_table to enable non-log2 shapes
- PR: #12842
#12198: Add 2CQ and trace support for UNet Shallow
- PR: #12820
Add supports/examples for placing Reads and Writes on CQ1
- PR: #12821
#9370: Workaround: replace WRCFG with RMWCIB instructions in reduce_revert_delta
- PR: #12832
Remove UNet from landing page
- PR: #12856
#12750: Replace zeros_like with empty_like in backward ops
- PR: #12766
#12840: Add more handling more multiple attempts by restricting the space of github_job_ids we're looking to only the ones in the workflow run attempt in questi...

Assets 9

27 Aug 15:10

github-actions

v0.51.0

e6f4c70

v0.51.0

Note

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/10580177689

Demo models and their metrics

Grayskull (GS) Models

Model	Batch	Target throughput
ResNet-50 (fps)	20	10,000
BERT-Large (sen/s)	12	410
Falcon7B-decode (t/s)	32	140
ViT (fps)	8	2000
T5 small (sen/s)
Bloom (sen/s)
U-Net	coming soon

[1] - Observed from the host. Includes dispatch overhead and kernel execution time. For LLMs, token-to-token decode throughput is reported.

[2] - Ignoring host overhead. Kernel execution time only. For LLMs, token-to-token decode throughput is reported.

Wormhole (WH) Models

Note

All model demos in this table function on both N150 and N300 Wormhole cards, unless otherwise stated.

Furthermore, all performance numbers here are run or based off an N300 Wormhole card.

Model	Gen. Token [3]	Batch	Time to first token [4]	Target throughput
Falcon7B	129th	32	0.08 s	26
Mistral-7B	129th	32	coming soon	25
Mamba-2.8B	any	32	0.04 s	41
LLaMA-3.1-8B	129th	8	coming soon	23
BERT-Large (sen/s) [5]	-	8	-	400
Stable Diffusion 1.4 512x512 (sec/img) [6]	-	1	-	3
ResNet-50 (fps)	-	16	-	7,000

[1] - Observed from the host. Includes dispatch overhead and kernel execution time. For LLMs, token-to-token decode throughput is reported.

[2] - Ignoring host overhead. Kernel execution time only. For LLMs, token-to-token decode throughput is reported.

[3] - Generating the i'th token in a sequence while the kv_cache is filled with i-1 rows.

[4] - Time to fill the kv_cache and generate the first output token (1st user).

[5] - This model demo does not work on N150. It does work on N300.

[6] - This model demo does not work on N300. It does work on N150.

TT-QuietBox & TT-LoudBox (2x4 mesh of WHs) Models

Model	Technique	Gen. Token [3]	Batch	Target throughput
Falcon7B	Data Parallel	129th	256	26 t/s/u
LLaMA-2-70B	Tensor Parallel	129th	32	20 t/s/u
LLaMA-3.1-70B	Tensor Parallel	129th	32	20 t/s/u
Falcon40B	Tensor Parallel	129th	32	36 t/s/u
Mixtral7Bx8	Tensor Parallel	129th	32	33 t/s/u
ResNet-50 (fps)	Data Parallel	-	128	56,000

Single Galaxy (8x4 mesh of WHs) Models

Model	Last verified release	Technique	Gen. Token [3]	Batch	Time to first token [4]	End-to-end throughput [1]	Device throughput [2]	Target throughput
Falcon7B	v0.51.0-rc30	Data Parallel	129th	1024	0.30 s	4.0 t/s/u - 4096 t/s	17.7 t/s/u - 18125 t/s	26 t/s/u

📦 Uncategorized

#10600: renamed execute_on_main_thread to operator()
- PR: #10601
#0: refactor ttnn device operation code and program cache
- PR: #11223
#11112: Add forward support for Relational Inplace Ops
- PR: #8790
Update Synchronize api to barrier for missing transactions
- PR: #11318
Move sliding_window to TTNN
- PR: #10346
Update CODEOWNERS of SD tests
- PR: #11360
#11283: Remove old Stable Diffusion implementation and its tests
- PR: #11284
#5383: [Falcon7b] Remove per-token printing in single-card ci demo tests
- PR: #11309
#11349: Add missing include in kernel_types.hpp
- PR: #11352
#10119: move fold op to ttnn infra
- PR: #11273
Bump up TRISC0 stack size
- PR: #11317
#11089: Fix ttnn.line_all_gather(..) to work with async
- PR: #11331
Add best practices for error messages
- PR: #11337
#0: updated mistral readme to reflect batching changes
- PR: #11371
#0: Target specific test file in model perf for ttnn resnet to avoid import conflicts
- PR: #11361
#11280: Enable sharded buffer l1 read/writes test on BH
- PR: #11281
Update CODEOWNERS
- PR: #11388
Add new items to best_practices.md
- PR: #11385
#9322: Remove lamb_optimizer op
- PR: #11383
#10550: Enable remote chip routing before profiler init
- PR: #10554
#11333: Resolve hang with Trace and R-Chip Event Synchronization
- PR: #11334
#10117: Migrate fast_reduce_nc op to ttnn
- PR: #11311
#11389: Add a cloud preset to allow easy connection to the tt-cloud elasticsearch instance
- PR: #11390
Update perf and latest features for llm models (Aug 12)
- PR: #11373
#0: Update watcher noc_sanitize to internally specify noc_id
- PR: #11394
Fix rounding in recip causing pcc issues in models
- PR: #11319
update build_metal.sh to trigger cmake test target
- PR: #11409
#9322: Remove unused bindings
- PR: #11404
Migrate Sharded Partial from TTL to TTNN
- PR: #11285
Move all NLP TMs into experimental/transformers, reorganize the folder, and delete the assorted ttlib bindings
- PR: #11324
#10360: Cut down on build time by targeting tests target directly
- PR: #11419
#11042: Overload complex fw ops
- PR: #11047
#0: remove decorate_as_composite
- PR: #11345
#11346: Replace tt_lib usage in eltwise backward
- PR: #11347
Add sweeps for complex bw_ops: polar, recip, add, mul
- PR: #11364
#0: Add initial t3000 nightly pipeline
- PR: #11372
#11038: Clean up more runner labels for single card
- PR: #11376
[CLEANUP] Remove old unused Mistral code inside models/experimental
- PR: #11444
#5424: GELU and GELU' API calls submodule LLKs
- PR: #11193
#5424: GELU and GELU' API calls submodule LLKs
- PR: #11150
#5424: GELU and GELU' API calls submodule LLKs
- PR: #11154
#10127: Move reduce op from tt_lib to ttnn part 1
- PR: #11299
#0: Recommend noc_async_write_flushed() on examples
- PR: #11448
#0: added llama3-tg nightly demo test
- PR: #11399
#0: re-add install step at end of build_metal.sh
- PR: #11452
#0: update fold call to new ttnn
- PR: #11455
#0: Fix watcher sanitization for NOC1
- PR: #11456
Implementing all_gather to datacopy signaling
- PR: #11231
#11322: Fix UNet functional and performance demo crash
- PR: #11405
#9992: Compute-engine add example DRAM NOC fix for WH n300
- PR: #11393
Ccl/revert datacopy
- PR: #11466
Fixed default arguments for repacking llama3
- PR: #11326
#11443: Updated Mistral7B reference
- PR: #11458
#11241: Replace tt_lib in models/demos/bert and falcon7b_common
- PR: #11249
#7494: Added unit tests to verify that values to semaphores and circular buffers are being correctly written out when core range sets are used
- PR: #10629
#10612: Unit tests for Galaxy cluster
- PR: #10705
#11469: Run ci/cd upload only on main if workflow_run
- PR: #11475
FIx elt_...

Assets 9

10 Jul 22:04

github-actions

v0.50.0

f7c10a2

v0.50.0

📦 Uncategorized

Fix issue with Mamba SSM A weight preprocessing
- PR: #9443
Make buid key unique for mmio and remote devices with same harvest mask
- PR: #9435
#5337: Removed eth_dispatch yaml flag from mistral tests
- PR: #9421
New workflow for custom test dispatch on CI runners
- PR: #9536
#9312: Add single-header boost-ext/reflect library as dependency
- PR: #9328
Opt LayerNorm/RMSNorm with 2D reduce
- PR: #9603
Revert "#8630: support uint8 data type"
- PR: #9649
#0: Fix codeowners for metal bert
- PR: #9635
Revert "Revert "#8630: support uint8 data type""
- PR: #9651
#9642: fix matmul2d in1 sharded with batch>1
- PR: #9655
#0: add tile layout support for GN
- PR: #9645
FD2 packed binary commands
- PR: #9572
#9082: t3k demo with slack notifications for owners. split jobs
- PR: #9625
Rtawfik/issue 9142
- PR: #9674
#9688: Remove redundant left shift in DEBUG_SANITIZE_NOC_READ_TRANSACTION_FROM_STATE
- PR: #9689
#9500: Update eth_interface include in tt_cluster to not be hardcoded for WH
- PR: #9501
#9578: Add WITH_PYTHON_BINDINGS option to allow build w/o python
- PR: #9662
#9587: Update CB and worker Go signals to respect max sub cmd limit introduced by dispatch packed write local copy change
- PR: #9670
Add support for bfloat4 weights in Mamba
- PR: #8869
Use in-place binary operations in Mamba block
- PR: #9726
#5337: Relaxed Mistral expected compilation time in CI by 1 sec
- PR: #9731
Mo/9406 profiler build flags
- PR: #9549
Add support for single col/row/core output grid for matmul 2D
- PR: #9683
#9725: Set release candidate releases on GitHub to pre-release, not draft, to enable downstream users
- PR: #9729
add tagged docker image with releases
- PR: #9693
Rtawfik/issue 9164
- PR: #9700
#5562: resolve reduce scatter issues (nd hang and correctness)
- PR: #9423
Create benchmarking tools for saving run/measurement data (with Falcon7b example) and model-demo utilities for verifying tokens/perf
- PR: #9659
#0: Fix bug with var name in single-chip falcon7b demo tests
- PR: #9740
#9735: fix issues with including reflect library
- PR: #9737
#9527: Remove usage of bcast where multiply is used
- PR: #9717
Mchiou/9082 slack notification owners
- PR: #9690
#9681: set name attribute for ttnn operations when fast runtime m…
- PR: #9730
#9553: Add prefix scan op for Mamba prefill
- PR: #9554
#9628: Merge Binary backward ops from tt_eager to TTNN
- PR: #9570
Namhyeong kim/support fp32 dest acc in moreh adam
- PR: #9135
#0: Update t3k workflow timeouts (except freq pipeline)
- PR: #9772
Temporary update Mixtral perf times to pass CI
- PR: #9673
#9479: fix cpu core worker bug
- PR: #9739
#4858: add typecast fp32 <-> int32
- PR: #9736
#0: ViT demo fix
- PR: #9768
#9389: Add support for integer type in sum operation
- PR: #9548
Transfer llama2/3 from experimental to demo folder.
- PR: #9716
#9657: add topk multicore to support larger dimension sizes
- PR: #9718
#4858: add typecast bfp8_b
- PR: #9779
#9082: t3k model perf split tests with slack notifications, disabled cnn
- PR: #9761
#0: Add ttnn/cpp to packages to enable using ttnn kernels in tt_eager ops
- PR: #9784
#9741: Set stricter pytest timeouts
- PR: #9742
#9492: Change models matmul usage to ttnn
- PR: #9727
#9778: test prefetcher hanging with changes to test
- PR: #9795
#9490: TTNN eltwise/unary migration
- PR: #9732
Update timeout for falcon40b t3k demo test
- PR: #9777
#0: Remove extra t3k falcon40b matrix test group
- PR: #9802
#9044: Move dispatch core x y to be part of launch msg
- PR: #9743
Modify rot mat each iteration to avoid allocating 10k tensors upfront
- PR: #9809
Optimize bcast sharded op
- PR: #9822
Start using reflect library
- PR: #9780
#0: Properly delete source folders for wheel testing
- PR: #9829
#9479: Update Mixtral perf estimates
- PR: #9803
#0: Added github community issue workflow
- PR: #9833
#8729: Pytest multiprocess reset infrastructure
- PR: #9677
Enable switching between 1 and 2 cqs in the same process
- PR: #9832
Fixed failing tests for SD Conv tests for WH using new conv
- PR: #9799
#0: Switch org-membership check to an authenticated call
- PR: #9840
#0: Decrease num loops in trace stress tests
- PR: #9724
#9628: Support optional return tensor
- PR: #9769
#0: Use CV to wait for cq_reader in production mode. Remove enqueue_record_event for NB calls
- PR: #9793
#9628: Merge second set of binary backward op from tt_eager to TTNN
- PR: #9771
#0: Bump bert compile time threshold since it's been intermittently failing on ci
- PR: #9844
Mchiou/9792 t3k runner management
- PR: #9847
#0: Bump up Bert inference time due to instability on ci
- PR: #9850
#8865: For host dispatch time measureing increese failing reference t…
- PR: #9438
#9484: Add output_tensor queue_id to dependency ops
- PR: #9494
Adding the new op: Flash Decode!
- PR: #9794
#0: Add missing permissions to issue notification job
- PR: #9863
#9275: Fix Falcon7b demo failing to run by default on an Grayskull e75
- PR: #9859
#9801: Account for 64B BH PCIe alignment in cq cmd sizing
- PR: #9862
#0: Make prefetcher early exit after fetching/reading exec_buf
- PR: #9856
#8683: Add Unary bitwise AND, OR
- PR: #9437
Ngrujic/profiling
- PR: #9875
#9628: Merge third set of binary backward op from tt_eager to TTNN
- PR: #9846
#4858: add typecast uint32
- PR: #9843
Migrate Pad Host Code, Bindings, C++ Usages from TT Eager to TTNN
- PR: #9816
Support longer sequence lengths in ssm_prefix_scan
- PR: #9776
#9709: Add optional transpose_a and transpose_b to ttnn matmul and linear
- PR: #9836
#0: Only run batch 12 bert for GS profiling and tighten some bert/resnet thresholds
- PR: #9851
Asarje/resnet highres 20240624
- PR: #9660
#9492: replace falcon specific matmul calls
- PR: #9810
Extend ssm_eltwise_mul for num_users > 32
- PR: #9867
Update documentation for adding new ttnn operation
- PR: #9841
Extend ssm_1d_reduce for the batch>32
- PR: #9881
#0: rn50 fix add api
- PR: #9890
#9123: Add support for optional output tensors to run in the worker t…
- PR: #9894
#9861: support check_tensor helper_function
- PR: #9869
Fix syntax issues in custom test dispatch workflow
- PR: #9567
Add Mixtral accuracy tests and cleanup its other tests (CI-friendly)
- PR: #9864
#9876: Increase timeout on falcon7b perplexity tests.
- PR: #9880
#9492: Remove bmm/resnet_matmul from models
- PR: #9896
#9410: enable fp32 precision unpacking for interm. CBs
- PR: #9885
#9903: Fix conditional statements and indexing of y values in CoreRange::diff
- PR: #9915
#9860: fix test create device apis
- PR: #9919
#0: delete unused code
- PR: #9921
#9719: fixed l1 clear issue on nlp create qkv heads decode test case
- PR: #9924
Fixing type in llama demo readme
- PR: #9927
#9892: Device only op report
- PR: #9914
#8704: define consts for registers that hold x-y coordinates and amount to shift address to get x-y coord
- PR: #9897
CODEOWNERS update
- PR: #9930
Abhullar/bh misc fix
- PR: #9899
Auto-register C++ ttnn operations in python
- PR: #9900
#9788: Remove TopK from TTLib and replace all references with the TTNN api
- PR: #9884
#0: add owners for resnet demo
- PR: #9937
7-way split of eager tests
- PR: #9950
#9910: Improve Softplus kernel accuracy
- PR: #9893
#9818: Add cache check to op info V2
- PR: #9826
#0: update noc test bound
- PR: #9922
Fix branching bug in softplus kernel
- PR: #9955
propagate error upwards for tests in falcon 40b suite
- PR: #9957
#0: Fix falcon40b softmax import failure
- PR: #9958
#9755: move ttnn.concat to match the new file structure
- PR: #9923
#9837: Assign workers after performing ref count cleanup in async mode
- PR: #9944
#0: Make event_synchronize API safer
- PR: #9965
#0: Update buffer asserts to account for trace buffers
- PR: #9918
Clean up ttnn operation registration on python side
- PR: #9961
#9164: [Blackhole bringup] Add fix for unpack untilize
- PR: #9967
Aliu/no l1 clear
- PR: #9931
Restructure ttnn::permute to match the new standard format
- PR: #9917
#9815: Update host to pass packed write max unicast sub cmds to cq dispatch
- PR: #9868
Distributed layernorm op
- PR: #9382
#9831: re-enable test
- PR: #9976
#8835: cleaned up ttnn operation registration on C++ side
- PR: #9975
#9941: update dram/l1 to noc xy header to do the appropriate shift
- PR: #9948
#9336: Refactoring moreh layernorm
- PR: #9636
#9745: move unpad to slice ttnn cpp references
- PR: #9970
#9980: Update falcon updated outputs
- PR: #9981
Fix Main after Pad Merge
- PR: #9988
Update eltwise bcast unary ops to use memory_config and fix PCC issue for interleaved output
- PR: #9939
Update FD cmds to be PCIe aligned
- PR: #9929
Fix N150 product name to nebula_x1 even if its unharvested.
- PR: #9925
#0: add a second codeowner for conv
- PR: #9990
#0: Get tt-metal to compile with gcc-12
- PR: #9943
#9492: Change to ttnn matmul in tests and tt_eager
- PR: #9928
#9441: add typecast uint16->uint32
- PR: #9991
Move ttnn::embedding to match new pybind structure and replace C++ ttlib embeddings usage with it
- PR: #9969
  -...

Assets 7

12 Jun 14:05

github-actions

v0.49.0

d35ea9d

v0.49.0

📦 Uncategorized

#5044: Add optional output to addalpha
- PR: #8785
#9059: Fix matmul for single core grid
- PR: #9341
readme update
- PR: #9352
#0: (MINOR) Update to v0.49.0
- PR: #9353
#7586: Move common models for single-card nightly to ln model
- PR: #9351
Update Mamba README
- PR: #9344
TTLIB interval to sharded sweeps
- PR: #9003
#0: Update dataflow api comments
- PR: #9343
#9196: Merge new op: Fast reduce nc into main
- PR: #9330
#0: New resnet50 test skipped on WH since its WIP
- PR: #9355
#9329: Restructure ttnn::argmax
- PR: #9331
#9323: Introduce template for new ttnn pull requests
- PR: #9324
#0: skip release build on GH runners, we already test it via build a…
- PR: #9362
Remove unused dependencies and fetch gtest via CPM
- PR: #9332
#8764: Part 3 of docs and model demos changes
- PR: #9350
Ngrujic/profiling
- PR: #8939
[Mistral-7B] Add flags for weight paths
- PR: #9173
Typecast int32->fp16b
- PR: #9317
#9258: Remove ARCH_NAME and TT_METAL_ENV from wheel testing
- PR: #9354
Implemented SD using new Conv API
- PR: #8786
#9258: Re-add wheel into release assets
- PR: #9374
#9361: Install Clang-17 and gdb 14.2
- PR: #9363
#7525: Re-skip demo batch 7 metal_BERT_large_11 on WH because it still hangs ND
- PR: #9385
#9206: add sfpu config reg init to llk sfpu inits
- PR: #9358
#9059: Avoid a couple of fatals in matmul
- PR: #9387
Add Galaxy support.
- PR: #9068

Assets 7

10 Jun 18:09

github-actions

v0.48.0

a19eb11

v0.48.0

📦 Uncategorized

#7744: Add support for non-4D tensor in moreh_sum, moreh_sum_backward
- PR: #7745
#5544: Add output tensors parameter to moreh_nll_loss op
- PR: #7194
#5544: Add output tensors parameter to moreh_sgd op
- PR: #7193
#5544: Fix package build error
- PR: #7818
#5544: Add output tensors parameter to moreh_linear op
- PR: #7147
#5544: Prevent eager unit test failures
- PR: #7835
#7997: Support non-4D tensor in moreh_softmax
- PR: #7998
#7816: Bump SD perf target
- PR: #8140
#8098: Remove temp buffer copying when reading from hugepage to host buffer
- PR: #8138
#0: Specify DEBUG_STATUS as a string literal instead of multiple chars
- PR: #7981
#8212: Fix uneven shards for interleaved_to_sharded op
- PR: #8259
#0: Refactor unpad tile to modify rt args in place and remove dynamic…
- PR: #8308
#7838: Add support for non-4D tensor in moreh_linear OPs
- PR: #8388
#0: Use split_work_for_tilize in both tilize and untilize
- PR: #8470
#8131: resnet-50 fix for b20.
- PR: #8283
Add support for multiple parameters in EltwiseUnary
- PR: #8398
#7625: Enable multicore for tilize with padding by default
- PR: #8527
Trace Support
- PR: #8572
#0: Switch set runtime args assertion for if kernel was placed on core to TT_ASSERT
- PR: #8645
#7179: enabling test case. The issue was not reproducible on 8.12 dri…
- PR: #8613
#4625: Multicore runs for untilize with unpadding on interleaved tensors
- PR: #8622
#0: Cache program cmds, convert cb configs from write linear to write packed
- PR: #8604
#0: Make skip and xfail optional in defining sweep tests
- PR: #8687
Shwetank tt/bcast op
- PR: #8058
#8364: Disable implicit fallback for ttnn.pad
- PR: #8742
#8513: Add slack notifications to several more pipelines
- PR: #8685
#0: Update common RT args to use no stride flag for packed cmd.
- PR: #8696
#0: Option to write compile_commands.json from CMake
- PR: #8761
#8718: eltwise testing for bfloat8
- PR: #8753
Add support for bfloat8 input tensors in Mamba SSM block custom kernels
- PR: #8733
#8460: Enable Clang-17
- PR: #8516
#0: Remove overhead in calling functions wrapped in tensor_impl_wrapper
- PR: #8840
#0: Updating the perf thresold to incorporate Merge back uneven reshard commit.
- PR: #8849
#6365: Add ttnn host tests
- PR: #8210
#6365: Revert "#6365: Add ttnn host tests (#8210)"
- PR: #8879
#4382: fix GH reported vulnerabilities
- PR: #8876
#0: bump C++ timeout limit to 45 minutes
- PR: #8882
update unpad doc for slice generality
- PR: #8878
Convert Falcon7b tt_lib ops and tensors to ttnn.experimental
- PR: #8870
#6365: Fix ttnn host wheel tests
- PR: #8897
Add git bisect script
- PR: #8894
#0: Move falcon40b ci unit tests to different pipeline
- PR: #8891
#8437: remove default matmul program config
- PR: #8772
#0: Add myself to ttnn codeowners
- PR: #8905
#0: Update README.md to include mention of TTNN_CONFIG_OVERRIDES
- PR: #8909
#0: Fix typos and add TTNN_CONFIG_OVERRIDES parameter descriptions to readme
- PR: #8910
#0: Add basic sanity checks during matmul program config creation
- PR: #8875
#8907: Sweep tests for tilize/untilize
- PR: #8908
#8902: Fixed program caching bug in nlp load slice op and added additional test cases for the op
- PR: #8913
#8917: Add sweep test for the fold op
- PR: #8918
#0: Properly support trivial single core case for 1D matmuls
- PR: #8915
#6343: updated test_perf with test for bloom causal_lm
- PR: #8391
#6343: Add functional_bloom test_demo
- PR: #8431
Update README.md
- PR: #8927
Enable optimised attention by default in falcon prefill.
- PR: #8892
Replace FreeList shared_ptr with local_shared_ptr
- PR: #8798
Add dummy_weights mode for mixtral tests
- PR: #8864
Refactor operation calls: Replace operation::run() with operation::launch_op()
- PR: #8893
Use HiFi2 to bump Falcon7b prefill PCC
- PR: #8719
#8902: add input and attn_mask del
- PR: #8928
#8930: Disable llama perf test
- PR: #8935
#0: Add third codeowner to matmul path
- PR: #8934
#0: Add create_venv.sh as environment option in installation instructions
- PR: #8898
#7083: Composite conv fix for relu called after matmul
- PR: #8919
#7525: Skip batch 7 metal BERT on WH B0 because it still hangs too often
- PR: #8938
#8871: Add initial infra/support for dram sharding
- PR: #8901
#8531: delete all makefiles
- PR: #8546
#0: Delete dead code from work_split.hpp
- PR: #8950
#8853: Uplift SFPI to latest w/ BH support
- PR: #8854
#8725: Warn user if kernel cache is enabled
- PR: #8951
#0: Minor test_prefetcher fixes
- PR: #8955
#5389: Move ttnn.repeat to c++
- PR: #8911
#8131: temp fix for PCC issue on W0.
- PR: #8948
Optimize e2e perf Falcon40b modifying layernorm
- PR: #8969
#0: Relax Falcon7b perf target
- PR: #8972
#0: Resolve segfault in llama async mode
- PR: #8963
Resnet Optimizations
- PR: #8933
Create Falcon7b perplexity test and utility functions for text-gen datasets
- PR: #8960
Revert "#8131: temp fix for PCC issue on W0."
- PR: #8984
bmm dram sharded opt
- PR: #8947
#8943: Clean up profiler python_env build flow
- PR: #8949
#8904: Add slack notifications for T3000 unit-tests
- PR: #8906
Add unet shallow functional, performance and demo test files
- PR: #8884
#8932: Multi-Device Mixtral Argmax Support
- PR: #8990
#8264: Worker thread optimizations:
- PR: #8778
TTNN tests for bf8 with mk tiled scalar
- PR: #8485
Ihamer/7468 inject noc delays
- PR: #8889
Support changed csv row orderings in Mixtral's op_perf_results.py
- PR: #8999
Correct merge issue in op_perf_results.py
- PR: #9001
#0: Add kernel groups to test_pgm_dispatch
- PR: #8992
#0: Add docs requirements to python env cache key because it can change the environment as well
- PR: #9010
#0: Add helper function to create CBs
- PR: #8991
#8973: Remove TT_METAL_ENV because we don't need it anymore
- PR: #8974
#5773: Move SD model to demo folder
- PR: #8294
#6938: Implement softplus as a single kernel
- PR: #8249
Model team/rotary embeddings llama
- PR: #8812
#8735: Fix hw/inc/blackhole files for compilation
- PR: #8880
Improve Mixtral perf with ttlib
- PR: #8971
Update README.md
- PR: #9014
#3712: fix old version of GN test
- PR: #9017
#0: Don't error on unused functions in compiler call
- PR: #9018
Revert " #8904: Add slack notifications for T3000 unit-tests"
- PR: #9023
Rtawfik/bh llk api
- PR: #8809
#0: Added interactive demo
- PR: #9020
Move Falcon7b before Mixtral in demo pipeline to workaround issue
- PR: #9034
#8112: Add support for ND tensors to matmul
- PR: #9004
#0: fix dram read benchmark
- PR: #9019
Fix bug in utility_functions::Profiler
- PR: #9025
Remove 1x1 matmul fallback on convolution and generalize convo…
- PR: #8886
#5389: Remove ttnn.split
- PR: #9027
#8767: decouple build folder name from build.cpp
- PR: #8780
#8735: Update common flags for BH build after sfpi module update
- PR: #9024
#8895: Fix ttnn.as_tensor(..) method for placing tensors on-device
- PR: #8964
#8539: Add cq_id to run_operation function args
- PR: #9039
#8632: Support fp32 dest acc en in moreh_sum and moreh_sum_backward
- PR: #8724
#5044: Add optional output tensor and remove autoformat in eltwise binary ops
- PR: #8394
#8895: Fix failing regression test in dump_tensor(...) API
- PR: #9040
More Resnet Optimizations
- PR: #8993
#4858: add typecast fp32 to uint32 op
- PR: #9033
#8995: refactoring moreh arange
- PR: #8996
#0: Add ccache option to build_metal.sh
- PR: #9015
Update Mixtral perf figures
- PR: #9048
#8349: Use BFP4_B for attention mask in falcon7b optimised prefill.
- PR: #9047
#0: Add CODEOWNERS for build_metal.sh
- PR: #9053
Rtawfik/add binary reuse metal
- PR: #8727
Update watcher.rst - use double backticks
- PR: #9054
Falcon40b tt_lib to ttnn.experimental
- PR: #9008
#0: fix dram sharded program cache
- PR: #9031
#7083: New halo fix for enabled program cache
- PR: #8987
#9051: Enable Llama model perf test
- PR: #9052
#8764: Single card WH demo tests
- PR: #9058
#8764: Various docs fixes for WH release
- PR: #8975
#0: Correct script locations for nightly single card
- PR: #9062
#8764: Use new device_l1_small_size fixture for SD demo interactive test
- PR: #9063
#9059: Update matmul test pcc
- PR: #9061
#0: Ensure weka mount is active for demo tests otherwise it won't run
- PR: #9069
#0: remove reserve to avoid bad alloc
- PR: #9067
#8764: Separate n150/n300 demo tests to not run BERT 11 on N150
- PR: #9073
Remove unnecessary llk sfpu param files
- PR: #9065
#9059: Add fallback for getting matmul program config
- PR: #9077
Add grouped convolution support
- PR: #8341
#8282: Support non-4d tensor and fp32_dest_acc_en for moreh nllloss backward
- PR: #8966
#8976: moreh_getitem receive signed integer index tensors
- PR: #8978
#9049: fix moreh_sgd callback and add callback test
- PR: #9050
#0: Remove argmax multi-device test due to segfault
- PR: #9089
#7724: Add prototype for autonomous streams for use in tunneller
- PR: #8207
#9036: GS & BH --> Combine llk param files using variable args
- PR: #9078
#0: optimize allgather for small tensor sizes
...

Assets 5

05 Apr 13:57

github-actions

v0.46.0

cd00276

v0.46.0

📦 Uncategorized

user-triggerable C++ post-commit suite
- PR: #6626
#6406: add missing position_ids/attention_mask to bert demo
- PR: #6617
#6282: Add AdamW
- PR: #6333
#6315: Fix dprint tests for T3000
- PR: #6599
FD2: prefetch stall, dispatch wait, linear read, delay and cleanup
- PR: #6620
#6609: update wording in demo section of main README.md
- PR: #6639
#6364: Autocomplete for pybinded types
- PR: #6440
Asarje/ttnn rn50 b20
- PR: #6629
FD2.0 Test - Fix l1 buffer not page-size aligned in after FD-on-eth changes to L1_UNRESERVED_BASE
- PR: #6646
#6593: Add resharding to Llama2 model when possible.
- PR: #6595
#6572: Fix ttnn.repeat_interleave example in documentation
- PR: #6574
#5780: Re-enable 100K enqueue program stress test on grayskull
- PR: #6648
Enable basic width sharding support in all-gather
- PR: #6642
Alex/metal/remove cb wait markers
- PR: #6628
#6657: Use sysmem manager cq size instead of recomputing it each time…
- PR: #6658
#0: (MINOR) Add Grayskull purchase link and update version to 0.46.0
- PR: #6667
#5063: add TopK API to metal
- PR: #6563
#5480: FD2.0 Test - Fix test_prefetcher for dram paged read test (-t 3) on whb0
- PR: #6663
Fix logit low pcc
- PR: #6538
Backward op - Fixed ldexp, hardsigmoid and asin
- PR: #6542
#6598: Fix softplus
- PR: #6675
Add support for BFP4_B tensor serialization
- PR: #6545
Eltwise mul for different batch size
- PR: #6587
#6575: Split docs into separate Metalium and nn docs
- PR: #6666
#0: Add two separate links for documentation (tt-metalium/ttnn) on README
- PR: #6697
#6361: Update ttnn repeat to use correct shapes when formatting output
- PR: #6526
#0: Sayonaraaaaaaa
- PR: #6702
FD2.0 Test fix test_prefetcher add_paged_dram_data_to_worker_data dropping start_page
- PR: #6703
#5785: Watcher ringbuffer implementation
- PR: #6652
Add FD 2.0 WriteHost Command
- PR: #6614
#0: Put back frequent api tests because I'm an idiot
- PR: #6698
Optimize All Gather Interleaved Worker send/receive
- PR: #6706
#0: changing all #include common/* to #include tt_metal/common/*
- PR: #6669
#6676: Fix issues related to unary lte and gte
- PR: #6685
#5817: Fix lerp
- PR: #6630
#6589: Fix for relu_bw
- PR: #6631
#6633: Backward test update
- PR: #6679
#0: Skip logit, logiteps test
- PR: #6714
#0: Testing CI fix
- PR: #6708
#5480: Update test_prefetcher to pass added hugepage args to dispatch kernel
- PR: #6717
Fix l1 acc, add whb0 optimized conv tests
- PR: #6668
Alignment fix for eth core kernels
- PR: #6696
Add data parallel (multi-chip) for Falcon7b (prefill/decode) model and corresponding tests
- PR: #6656
CQ_DISPATCH_CMD_WRITE_PAGED support in test_dispatcher and passing tests
- PR: #6641
#6647: disable failing ci cpp tests and reenable cpp pipeline on CI
- PR: #6704
Backward test updates
- PR: #6692
Ngrujic/check bugs
- PR: #6688
Add Llama matmul perf tests to main
- PR: #6690
TTLIB: removing working tests from broken
- PR: #6718
#6443: Update backward asin and addcdiv logic
- PR: #6715
#0: Fix output cb size calculation in reshard op for bfp8b
- PR: #6739
#0: use smart ptrs in allocator
- PR: #6719
Jvasilje docs 0322
- PR: #6745
DRAM based device profiler with Tracy support
- PR: #6460
#6553: Fix ttnn.reshape(..) handling for bfloat16, TILE_LAYOUT
PR: #6746
Add Llama2 demo to tt-metal docs
- PR: #6682
Mistral-7B WH demo
- PR: #6501
Revert "#0: Put back frequent api tests because I'm an idiot"
- PR: #6755
FP32 support
- PR: #6747
#0: Add back frequent api tests to run.sh
- PR: #6756
Bteng/watcher ci3
- PR: #6530
Remove cpuprof
- PR: #6758
logo update
- PR: #6762
#6184: sharded row major silu support.
- PR: #6643
#6443: Update div_bw and backward ops test file
- PR: #6742
#6705: Relax forcing of keyword argument in ttnn.open_device
- PR: #6707
Forward op tests
- PR: #6730
#6691: Allow blocking of inner dim within a core for shaded in0 for 2d and 1d systolic matmuls
- PR: #6640
#6662: Width Sharding support for eltwise OP
- PR: #6671
Stable diffusion python API level perf improvements
- PR: #6681
Add get_compute_kernel_config_args function
- PR: #6768
#0: Add fd-2/main triggers for pull_request and push for post-commit
- PR: #6709
#5480: FD2 refactor for pre/dis patch variants
- PR: #6655
#6654: Add perf tests for ttnn ResNet50
- PR: #6673
#5480: Fix fd gtest unit test test_write_host
- PR: #6778
#0: Set myself as setup.py owner
- PR: #6779
#6780: Add mistral7b to demos list in getting started
- PR: #6781
#4003: re-added TTNN_ENABLE_LOGGING as runtime flag
- PR: #6750
#0: Fix semaphore address gen bug
- PR: #6233
#6769: Disable program caching for failing Llama tests.
- PR: #6770
#5480: Fix zero sized write transaction request that could occur in write_linear_host
- PR: #6784
#6077: Fix unet pcc issues
- PR: #6660
Remove DstSync from llk api templates
- PR: #6753
FP32 Support
- PR: #6785
#6680: Reverting move op change
- PR: #6811
#6443: Update asinh and softsign backward
- PR: #6773
Backward tests with updated test modules
- PR: #6765
Ngrujic/check bugs 1
- PR: #6734
#6654: Moving init for self.compute_kernel_config
- PR: #6782
#6805: reproduce the bug with sharded split_query_key_value_and_split_heads
- PR: #6806
#6832: Account for tile-padding in softmax for mistral 7B
- PR: #6833
Enable support for uint32 format to be consumed by SFPU (issue #4624)
- PR: #6796
#4252: fix clang build error since std::log2 only constexpr in gcc
- PR: #6835
#4003: log, debug and add pre- and post- hooks only for top-level ttnn ops
- PR: #6841
#6823: Fix core count to not include dispatch cores in op reprot
- PR: #6831
#6197: Align pages for interleaved <-> sharded.
- PR: #6828
METALIUM_GUIDE
- PR: #6846
Bteng/watcher post commit
- PR: #6760
#6443: update backward test file for relational ops and concat op
- PR: #6817
Revert "Bteng/watcher post commit"
- PR: #6866
#6443: Update backward ops
- PR: #6826
Backward test updates
- PR: #6822
#0: Add the dim 0 support repeat backward
- PR: #5596
Update hard related test ops
- PR: #6816
#6757: Remove set_profiler_location
- PR: #6824
#6443: Update backward ops erfinv elu hypot cos sin
- PR: #6827
#6861: Enable Watcher/dprint tests on T3000 CI
- PR: #6869
Update Mistral perf regression for CI, until issue is resolved
- PR: #6883
Mamba/perf v1
- PR: #6744
#0: remove data movement ops related to silu in SD
- PR: #6798
#4003: added proper fallback for getitem of ttnn.Tensor. Slice the tensor only on the tile boundary but set the shape based on whatever user provided
- PR: #6886
#4003: added proper fallbacks for every op that falls back to torch
- PR: #6888
#6731: add fix to LN width sharding
- PR: #6891
#5797: add back sweep test for ln
- PR: #6893
Integrate GroupNorm V2 to SD model
- PR: #6862
METALIUM_GUIDE.md updates
- PR: #6863
[Falcon7b] Fix bugs with inference throughput measurements in demo
- PR: #6884
#0: shallow unet add perf_mode
- PR: #6904
#6154: 2d matmul in0 height, in1 width sharding
- PR: #6821
#5249: Various Falcon40b test and demo cleanup
- PR: #6764
#0: fix incremental build
- PR: #6914
#0: remove upsample spill to DRAM
- PR: #6905
[Llama2 Prefill] Model Functionality completed
- PR: #6800
Watcher alignment checking for PCIe/DRAM <-> L1
- PR: #6901
#6920: fixed the error in whisper
- PR: #6921
Update METALIUM_GUIDE.md
- PR: #6902
#6644: save l1 buffers to data base
- PR: #6856
Update usage.rst
- PR: #6929
#6804: fix ttnn falcon7b demo regression + add to CI regressions
- PR: #6924
#6285: Add backward support for floor round and div_no_nan
- PR: #6290
[skip ci] Update INSTALLING.md
- PR: #6936
#6873: Add more test combinations to tt_lib sweeps add, add_unary, su…
- PR: #6887
Ngrujic/check bugs 3
- PR: #6951
#6882: Updated Mistral-7b perf estimate
- PR: #6892
#6850: Update install links in Sphinx docs to point directly to INSTALLING.md
- PR: #6953
#6619: Fix per op profiler sum
- PR: #6955
#6644: sync before calling print l1 buffers
- PR: #6958
Barsic/ttlib ops check
- PR: #6772
Barsic/ttlib params fix
- PR: #6944
#6962: Move cd tt-metal earlier in the command list of INSTALLING.md
- PR: #6966
#6819: Add support for CreateKernel absolute file paths
- PR: #6922
#6356: Remove half-half grid logic for bmms
- PR: #6968
#4003: added a flag to disable ttnn fallbacks. Don't throw an error w…
- PR: #6961
#0: Correct FW versions, tt-smi versions, and add note about tt-topology
- PR: #6971
#0: Capitalize tt to TT consistently for marketing
- PR: #6973
#0: Add myself as CODEOWNER for INSTALLING.md
- PR: #6974
#6644: ttnn visualizer
- PR: #6935
#6847: Allow disabling individual watcher features
- PR: #6855
#6889: Support printing/padding/tilizing multi-device tensors
- PR: #6976
#4003: removed ttnn.print_l1_buffers and consolidated all ttnn flags into a CONFIG class
- PR: #6980
#6217: tt_lib async mode support (single chipp tensors supported)
- PR: #6700
Reshard With Ranges
- PR: #6919
#4003: updated buffer report to show...

Assets 5

22 Mar 18:03

github-actions

v0.45.0

4f11681

v0.45.0

🚀 Features

#6204: added support for num_users < 32 for update cache op.
- PR: #6213
#6247 Llama2 Galaxy MLP implementation
- PR: #6265

📦 Uncategorized

#4736: Add support for moreh_norm op
- PR: #4864
Fix moreh_layernorm rstd
- PR: #5616
#5508: Change test_moreh_layernorm.py for debugging
- PR: #5619
#4686: add infra for sharing global struct among ops
- PR: #5456
#5592: Fix pcc on Falcon 7b prefill by turning on l1 packer on MLP 4h-to-h matmul
- PR: #5686
Fix layernorm beta data format reconfig
- PR: #5760
Add linked support for in0 in1 mcast in matmul
- PR: #5759
#4957: optimizing construct_2d_padded_tensor_list
- PR: #5614
#4003: added ttnn.as_tensor and enabled support for caching torch tensor
- PR: #5809
Revert "#0: Fix for fail in asinh backward"
- PR: #5886
#5829: Use moreh_common.hpp for data movement kernels across moreh OPs
- PR: #5833
Barsic/ttnn ops
- PR: #5892
#6030: Update resnet performance metrics
- PR: #6030
#5876: pytest & c++ test logging cleanup
- PR: #5987
#0: Use both 2x2 and 2x4 machines on every scheduled run
- PR: #6091
Add single core matmul benchmark
- PR: #5997
#6079: Update FORCE_INLINE to be nop when watcher is enabled
- PR: #6092
#5980: Fix a hard-coded bounds check in dprint
- PR: #6028
#5389: merged ttl and ttnn tensor classes into one
- PR: #6051
Initial Performance Model
- PR: #6025
fix ci
- PR: #6089
TTNN RN50 :: on the road to match perf with TTLIB version
- PR: #6046
#4438: Optimized single-core fold op
- PR: #5999
#5589: Add repeat-interleave and addcmul sweeps
- PR: #6102
#6055: Add square backward support
- PR: #6071
#6057: Add backward support for lgamma
- PR: #6059
#6056: Add backward support for frac and trunc
- PR: #6065
#6066: Add support for backward log sigmoid
- PR: #6069
#6002: Add backward support for binary maximum
- PR: #6003
Ngrujic/improve conversion to bfloat8b in sweeps
- PR: #6068
#5829: Use moreh_common.hpp for compute kernels across moreh OPs
- PR: #6122
#0: Remove post-commit label from multi device pipeline because it's not actually post commit
- PR: #6142
Add pack l1 acc to resnet conv
- PR: #6054
#6144: Skip 512x512 cross attn 2d upblock for now in nightly because it hangs
- PR: #6145
#6061: Add tanhshrink, threshold, Unary EQ backward ops support
- PR: #6137
Width Sharded Concat for Unet
- PR: #5776
#5184: uncommenting various moreh test case.
- PR: #6143
Fix compute kernel config arg for resnet50
- PR: #6147
Nsmith/untilize unit test
- PR: #6105
Revert "Revert "#5389: merged ttl and tensor classes into one""
- PR: #6158
#4438: Do not use the new fold op in Resnet tests
- PR: #6153
Remove corerangeset that does not work on wormhole
- PR: #6156
#6129: Expose kernel config attrs and use 4 dst tiles for fp32 configs
- PR: #6134
#5391: Add device perf
- PR: #5875
#0: Use multiplier for wormhole b0 mulsi3
- PR: #6160
#4003: removed ttnn.Tensor autoclass from tensor.rst
- PR: #6170
TTNN MultiDevice Support
- PR: #6131
build artifacts
- PR: #6111
#4947: Add noc alignment checks to watcher
- PR: #5998
Add ttnn multi-chip unit test for checking device shards
- PR: #6179
Nsmith/fix unet
- PR: #6141
#6043: Random program stress test of command queues
- PR: #6044
Logit and logiteps backward support
- PR: #6016
Backward support for log2
- PR: #6064
Add missing ttnn tests and disable broken tests until issues are fixed
- PR: #6186
Fix Events feature for FD1.3 (out-of-order event ids, events feature missing) #6093
- PR: #6181
#5873: make top-level post commit workflow re-useable
- PR: #6188
#5589: add groupnorm for ttnn sweeps
- PR: #6167
Ngrujic/ttnn sweeps 4
- PR: #6135
Add ethernet datamover (EDM) - a foundational ethernet transfer engine
- PR: #5718
#6116: Add backward support for softshrink
- PR: #6118
#0: Add verbose make logs to artifact and make nicer name on metal
- PR: #6199
#0: Only use 2x4 setup for multi-card WH CI as 2x2 does not provide us good feedback
- PR: #6202
#4809 dprint tensix regs
- PR: #6072
#4003: fixed bloom perf test
- PR: #6208
#6187: Conv bugfix
- PR: #6205
#0: concat RM support variable stick widths across inputs
- PR: #6207
TTNN RN50 on WHB0
- PR: #6173
#6084: Lower thresholds slightly after using proper configs for device resnet
- PR: #6214
Fast dispatch 2.0 proof of concept
- PR: #6176
#6218: add pytest for matmul 1d 2d
- PR: #6219
#6177: use is_tensor_storage_on_device so it works for MultiDeviceStorage
- PR: #6178
#6082: support workers + eth cores in one program
- PR: #6172
#6215: Rename TensorToMeshMapper/MeshToTensorComposer
- PR: #6220
#6164: Update test_noc_unicast_vs_multicast_to_single_core_latency to not use same cores for producer and consumer on WH
- PR: #6224
#6117: Add backward support for softplus
- PR: #6128
#6223: remove redundant call to context switch
- PR: #6225
Integrate EDM with all-gather.
- PR: #6169
#6136: Add backward support for unary LE and GE
- PR: #6138
#5398: fix unicast binaries
- PR: #6231
Barsic/ttnn ops 2
- PR: #6070
#5380: Add wormhole_b0 model perf tests, only falcon7b in ttlib for now
- PR: #6216
#5372: Updated README.md file for demo
- PR: #6060
#4003: updated ttnn.concat to have a registered fallback
- PR: #6127
Llama2 functional bringup
- PR: #6087
#5589: Add working BFLOAT8_B sweeps to working folder
- PR: #6192
FD2.0 rename HostQ->PrefetchQ, add multi-core capability, fix NOC coords
- PR: #6229
#0: bugfix in ttnn resnet caught by nightly
- PR: #6251
#0: fix tt_bisect build bug
- PR: #6256
Watcher Asserts
- PR: #6175
#6183: add unit test for sd matmul ops
- PR: #6246
#6254: Make program cache per device:
- PR: #6255
#5394: Add functional version of Mamba architecture
- PR: #5948
#6257: Add temporary convenience script for 800MHz / new eth reset dependent CI
- PR: #6258
#5661: Enable gtests for fast dispatch + R chip
- PR: #6110
Alex/metal/bmm large block untilize out
- PR: #6201
#5389: made tensor attributes public and use ttnn::Shape instead of tt::tt_metal::Shape for storing shape
- PR: #6261
Revert "#6183: add unit test for sd matmul ops"
- PR: #6278
#4003: print all of the L1 buffers using ttnn.print_l1_buffer_state
- PR: #6268
#4003: print all of the L1 buffers using ttnn.print_l1_buffers
- PR: #6279
#4438: Implement sharded multi-core fold op for Resnet50
- PR: #6275
#6149: disabled the check for comparing generated report with GOLDEN_L1_BUFFER_REPORT becauson pipelines it looks different than when running locally
- PR: #6280
FD2.0 fixes+mcast support for write and packed_write
- PR: #6263
Shwetank tt/config
- PR: #5843
#0: Change order of device and use_program_cache fixture in remaining pytests
- PR: #6269
Softplus with beta and threshold param
- PR: #6239
Build tests during artifact creation
- PR: #6286
#6149: disabled test_print_l1_buffers_of_add_operation
- PR: #6299
#4003: updated ttnn.to_torch to work with bfloat8_b tensors that are not multiple of tile size without tile padding
- PR: #6277
#0: add to/from L1 reshard test
- PR: #6309
#0: Add back deleted shape assertions for interleaved concat
- PR: #6307
test errors flagged by watcher
- PR: #6320
#0: fix incremental build
- PR: #6103
Merge xuncai/llama-attention-galaxy to main: First version of llama-attention galaxy on emulated chips
- PR: #6297
#6329: Fixing a bug causing mismatch on indices
- PR: #6330
#6321: Test which sweeps read/write buffer and just checks that the e…
- PR: #6322
Support moreh_getitem forward
- PR: #6227
#6125: Update in0_block_w to be full shard width for sharded 2D systolic matmul
- PR: #6262
#6107: Add softsign, sign, unary ceil backward support
- PR: #6191
#6226: Add backward support for div
- PR: #6235
#6234: Add backward support for rdiv
- PR: #6238
#6236: Add backward support for fmod and remainder
- PR: #6240
#4003: added positional embeddings to bert and updated ttnn_sharded_optimized_bert to run with batch size of 12
- PR: #6327
Indexed Fill
- PR: #6328
#5589: remove dtype in gen function sweep tests where needed
- PR: #6249
#6347: Print built-in defines once only
- PR: #6351
#0: Add Mo as code owner on profiler code
- PR: #6352
#0: Simplify tt_lib.scripts package by adding a specific tt_eager/scripts directory and putting the production scripts in there, whereas development scripts will stay in /scripts
- PR: #6324
#0: Fixture reorder changes reverted for falcon_7b perf test
- PR: #6318
#5424: remove metal_ckernel_sfpu
- PR: #5665
#0: Update remaining tt_lib.program_cache calls to use device APIs
- PR: #6357
#6183: add unit test for sd matmul ops
- PR: #6323
#6289: fix dispatcher page calculation
- PR: #6340
#5924: Enable unet on wormhole_b0 changes
- PR: #6198
#6325: skip test_multi_device.py for grayskull arch
- PR: #6332
Alex/metal/pack untilize no repack
- PR: #6371
#6144: Not hanging on GS or WH with or without Watcher
- PR: #6373
Agrebenisan/swq hwq cardinality cleanup
- PR: #6369
#6146: Add backward support for conj
- PR: #6272
#0: bug fix UTWH div_up instead of div trunc for calculating CB sizes
- PR: #6367
Fix To/From Sharded Bug
- PR: #6381
#6206: Fix resharding page mapp...

Assets 5

27 Feb 15:57

github-actions

v0.44.0

4db6308

v0.44.0

📦 Uncategorized

Update CreateBuffer to return shared_ptr, and Enqueue R/W buffer to accept std::shared_ptr
- PR: #5154
#4794: Implement DownBlock2D using ttnn for stable_diffusion model
- PR: #5091
#4797: Implement BasicTransformerBlock sub-module using ttnn for stab…
- PR: #5081
#0: write cluster config for FD mode, non tunneling cores as well
- PR: #5161
Update bw test, change mulsi calls to use *
- PR: #5149
#3003: updated tt-lib documentation
- PR: #5179
#0: Update to v0.44.0
- PR: #5188
#4003: added ability to trace ttnn operations using torchtrail library
- PR: #5135
Support moreh logsoftmax
- PR: #4961
#4614: gitmodules: Use https URLs for submodules
- PR: #5183
#0: add reviewers to frequently touched ops docs file
- PR: #5190
backward ops - hypot and atan2
- PR: #5045
#4885: Move program device map to program
- PR: #5193
#4858: Add support for float to int typecast
- PR: #5058
Matmul_block on a smaller grid size
- PR: #5170
Revert "#0: Add support for typecast float to int"
- PR: #5199
Add dst ethernet router support and remote command processor to accept FD packets on remote chip
- PR: #5102
Falcon40B TT Implementation
- PR: #5046
#5198: Fix moreh softmax related bug
- PR: #5200
#0: skip MOREH Softmax tests from main
- PR: #5202
#3122: Use device grid size in falcon_attention to be genereric...
- PR: #5207
#0: Add assertions for interleaved tensors for ops that don't support sharding
- PR: #5195
#5169: Add activation ops to ttnn
- PR: #5217
#3003: add duration to the ttnn operation nodes when TTNN_ENABLE_LOGGING=1 is used to compile the code
- PR: #5201
#5027: Optimize group attn matmul for Falcon40B decode
- PR: #5127
#0: add documentation about managing documentation
- PR: #5227
Adding docs for maxpool, avg pool and upsample
- PR: #5223
Revert "#0: skip MOREH Softmax tests from d5811b7…
- PR: #5228
#5165: Add hyperbolic ops to ttnn
- PR: #5166
#4866: Add grayskull open source llk-library
- PR: #5136
#5002: simplified preprocessing of CNNs using preprocess_model
- PR: #5181
Create GroupNorm sharded in TTNN
- PR: #5221
#5097: Support for dedicated completion queue thread
- PR: #5098
upsample test calculate grid
- PR: #5238
fix for sharded allocater when num banks == num cores
- PR: #5229
MHA tutorial interactive notebook with diagrams
- PR: #5239
#4003: Adding a profile tutorial
- PR: #5242
#0: Added non-blocking read stress test
- PR: #5243
Revert "MHA tutorial interactive notebook with diagrams"
- PR: #5245
#0: Update all_gather to work for multi_link. Update falcon-40b to use 2 links for all gathers
- PR: #5214
#5142: Remove slow dispatch mode from workgin sweeps
- PR: #5146
#3003: fixed the input tensor documentation
- PR: #5255
#0: Temp slower resnet VM run
- PR: #5256
throw on fast dispatch for to_host_sharded as its not supported
- PR: #5264
#5253: Fix kv_past_len being passed in to rotary embedding for falcon models
- PR: #5254
#5233: started adding ttnn_functional_resnet
- PR: #5240
#3003: updated ttnn documentation to explain what features it has over tt_lib. Added standalone examples of basic usage of ttnn
- PR: #5265
#0: Speedup incremental builds
- PR: #5251
#0: Change setup.py to be git worktree friendly
- PR: #5234
MHA tutorial interactive notebook with diagrams
- PR: #5277
#3003: disable tutorial 6 from running as the unit test
- PR: #5278
Agrebenisan/non blocking tensor reads
- PR: #5244
#5275: CODEOWNERS: update to include files relevant for ttnn team
- PR: #5276
Fix an intermittent launch message transfer error
- PR: #5152
Revert "MHA tutorial interactive notebook with diagrams"
- PR: #5282
#0: add parens in LLK doc
- PR: #5283
#3003: only unit test tutorials that work on pipelines
- PR: #5291
#5246: Add unary math ops to ttnn
- PR: #5259
Vignesh/stable diffusion ttnn basic transformer block fix
- PR: #5211
#4854: Implement attention and rms_norm sub-module using ttnn for mis…
- PR: #5175
#4795: Add upblock2d to functional stable diffusion model
- PR: #5085
#4796: Implement Transformer2DModel using ttnn for stable_diffusion m…
- PR: #5092
#0: Adding llk wormhole_b0 submodule
- PR: #5262
#4003: Adding pyind11 to ttnn
- PR: #5236
#5296: Fix broken link to host_api.hpp in README.md
- PR: #5297
#0: Fix bug with the way we were measuring bert inference time
- PR: #5312
#0: Change local tt_lib._C module install from symlink to copy
- PR: #5292
#5233: added ability to fold batch_norm2d into conv2d
- PR: #5317
#5222: replace hex8_to_hex32.py with cpp to shave off some compile time -temporary fix
- PR: #5220
Enable tests for WHB0
- PR: #5307
#5137: Cleanups for newer Linux distro / toolchains
- PR: #5162
#5233: implemented support for converting all Resnet-18 modules using preprocess_model function
- PR: #5325
#3003: fix model preprocessing bug
- PR: #5332
#4799: Implement CrossAttnDownBlock2D sub-module using ttnn for stabl…
- PR: #5086
#4800: Implement UNetMidBlock2DCrossAttn using ttnn for stable_diffus…
- PR: #5093
#4798: Add ttnn cross attn upblock2d in functional stable diffusion m…
- PR: #5089
#4801: Implement Unet 2D Condition model using ttnn for stable_diffus…
- PR: #5119
#4965: Rename Conv2D to Conv2d and MaxPool2D to MaxPool2d to match torch
- PR: #5219
#0: Remove departed team member from CODEOWNERS
- PR: #5340
#0: add to codeowners
- PR: #5339
#5314: Only stall on first scheduled read after commands with side effects
- PR: #5315
#4965: fix bad rebase
- PR: #5342
#0: Add more instructions for dispatching workflow actions and a note about skipping git hooks
- PR: #5345
Update optimized Bert to support WH grid sizes, add sharding support for RMSNorm
- PR: #5308
#4642: create gtest_smoke as a sanity test suit
- PR: #5112
#5341: context switch if eth txq is full
- PR: #5347
#5323: Convolutions of small size fail during parallelization calculations
- PR: #5324
Npetrovic/transformer softmax
- PR: #5298
Fix groupnorm for narrow channels
- PR: #5320
#4862: added more test for ttnn bloom. Update optimized ttnn bert to match the structure of non-optimized ttnn bert
- PR: #5336
#0: Add an envvar parser with value detection and default value setti…
- PR: #5367
#4732: Clean up compute kernel apis
- PR: #5316
#5318: Modify Falcon7B to use attn_matmul for wormhole
- PR: #5322
#0: make logLocationsRecord a static function
- PR: #5351
#5233: run convs with auto-format
- PR: #5364
#5377: Avoid segfault by checking buffer !null before getting device
- PR: #5381
Alex/metal/pack untilize b0
- PR: #5378
#4487: Support block sharding in upsample
- PR: #5361
#5359: update python package transformers + dependencies to include Falcon
- PR: #5360
#3708: Add support for LN having gamma/beta in bfp8
- PR: #5376
#4003: Skip sweep tests if not available
- PR: #5392
#4003: use faster TMs in optimized ttnn whisper
- PR: #5384
#4732: Clean up compute_kernel_api
- PR: #5375
More optimizations for group_attn_matmul
- PR: #5385
#5233: updated resnet18 to run residual connections
- PR: #5390
#3003: added more meaningful errors to ttnn. Updated getitem to run on device in the cases when it can
- PR: #5403
#5233: simplified the logic in tracer
- PR: #5370
#3003: include ttl operations and necessary types under ttnn.ttl
- PR: #5405
#0: Add note about no merge commits in main
- PR: #5349
#0: Add timeout in profiler regression workflow
- PR: #5355
codeowners update
- PR: #5407
#5365: Add device argument to determine grid size based on target
- PR: #5366
disable whisper until further investigation, see issue #5430
- PR: #5431
#3003: fixed ttnn convs
- PR: #5432
#3886: Fix build error for C++ tests in debug mode
- PR: #5434
#4954: Support depth 32 in maxpool writer
- PR: #4956
#0: Pass output cb to pack init functions
- PR: #5418
#0: skipping DeviceLoadBlankKernels on remote devices
- PR: #5437
#5359: transformers: update version and relax pcc asserts
- PR: #5421
#3003: guidelines for adding new op
- PR: #5440
Don't assume user has one entry in their $PYTHONPATH
- PR: #5250
FP32 tensor support for matmul
- PR: #5414
#3003: updated tutorial 001 to describe the tensor more comprehensively before showing the add
- PR: #5441
Onboard additional metal code owners
- PR: #5445
#5402: Add redesigned host-side sw command queue, it can be configured i…
- PR: #5382
#3003: fixed docs
- PR: #5455
Alex/metal/enable conv tests on b0
- PR: #5425
#5356: git bisect script to find broken commits
- PR: #5348
#0: Update data_format.cpp file
- PR: #5399
Add skip to full grid matmul whb0
- PR: #5461
#3003: simplified the logic in ttnn/operations/matmul.py. Added dataclasses instead of tuples for CoreGrid and ShardShape
- PR: #5450
#5204: adding moreh's test suit. removing an absolute assertion.
- PR: #5373
Npetrovic/lt gt ne fix
- PR: #5304
#0: Move device id attribute from tensor to DeviceStorage
- PR: #5467
#3003: fixed scheduled pipeline
- PR: #5466
Npetrovic/transformer concat sweeps ttnn
- PR: #5208
#3003: added support for running ttnn.matmul using 1D_systolic_array. Also, added support for passsing in the program config directly
- PR: #5468...

Assets 5

08 Feb 18:02

github-actions

v0.43.0

4b97c17

v0.43.0

📦 Uncategorized

#4668: Yolov5 GS Demo Benchmarking
- PR: #4776
#0: uplift umd; pick up fix for n150 cluster
- PR: #4881
#3178: Fix for wormhole b0 reduce w
- PR: #4882
#4489: fixed bugs in the program caching of eltwise unary and eltwise binary. Updated bloom to use L1 memory config
- PR: #4842
#4821: Add cumsum op to tt_dnn
- PR: #4824
Dispatch/Bandwidth tests
- PR: #4783
#4003: fixed test_eltwise_unary_op
- PR: #4901
Argmax and Argmin Support
- PR: #4779
#3212: softmax works after reduce fix of max, sum, etc. for WHB0
- PR: #4907
#0: (MINOR) Update version to v0.43.0
- PR: #4910
#4761: Add call to ttl repeat_interleave and also provide script for …
- PR: #4891
#4003: fixed the bug with printing the compile-time attributes
- PR: #4918
Support moreh arange
- PR: #4921
Remove skip_for_wormhole_b0 for test_moreh_softmax and test_moreh_softmin
- PR: #4924
#4541: remove unpad start at 0 limitation
- PR: #4566
Agrebenisan/restart cmd fix
- PR: #4922
Support moreh SGD
- PR: #4929
#0: Use fetch-depth: 0 instead of fetch-tags because otherwise git complains of commit SHA/tag conflict
- PR: #4934
#0: Add code owners for primary operations api binding
- PR: #4936
#4547: Add 2x2 window unit tests to ttnn maxpool
- PR: #4909
#4003: restructure ttnn
- PR: #4902
#4889: Change TileSlice printing to only print tile data
- PR: #4912
#4836: Add support for blocking conv activation in 2d systolic conv v…
- PR: #4837
#0: Update unicast cycles lower bound
- PR: #4937
#4904: Add support for 1d width sharded LN
- PR: #4905
#4941: Convert command header to struct for easier maintainability
- PR: #4942
#4823: enable sum_0 operation fails with low PCC [Wormhole,Grayskull]
- PR: #4955
Fix sharded buffers for one core in fast dispatch
- PR: #4944
#4906: global reduce sum, mean, max, min operations added
- PR: #4908
Revert "#4823: enable sum_0 operation fails with low PCC [Wormhole,GS]
- PR: #4963
#0: Change codeowners from specific op binding files/dirs to all tt_lib bindings
- PR: #4938
#4003: split unary sweep into per op sweeps
- PR: #4952
#4232: added support for converting from numpy arrays to ttnn tensors. Borrow data whenever possible when converting from numpy/torch
- PR: #4893
Uplift AttnMatmul to support GroupAttnMatmul
- PR: #4913
Add watcher-specific CI tests
- PR: #4919
#4916: Add avg pool to ttnn
- PR: #4917
#0: Add a lock on DPRINT server raise/wait structures
- PR: #4920
#4967: added validation for input tensors
- PR: #4977
#4971: update documentation by a new doc hierarchy;
- PR: #4983
#0: Leftover decorate_operation replacement for avg pool
- PR: #4987
#4899: fix the permute to operate on the intended shape
- PR: #4951
#4730: Add tt_lib.tensor.concat
- PR: #4990
Aliu/enqueue eth
- PR: #4845
#4003: Updating functional performance from changes in ttnn.permute w…
- PR: #4991
#4984: Remove dead OP_INFO and graph interpreter
- PR: #4985
#4878: initial commit to add Conv parameters to ttnn.preprocess_model_parameters
- PR: #4966
Update Program Hashes for Ops using Mem config
- PR: #4953
#4984: Remove unused dprint functionality
- PR: #5000
Aliu/ci fix
- PR: #5001
#4215: Add Argmax and Argmin Fallback
- PR: #4928
#4999: added input tensor validation to add, sub and mul operations.
- PR: #5004
Support for softmax rm major sharding and causal mask sharding
- PR: #5006
#0: provide API for where() to support scalar True/False branches
- PR: #4988
#5003: Update expected compile and runtimes for perf regression on VM
- PR: #5008
Revert "Update Program Hashes for Ops using Mem config"
- PR: #5021
#4931: add apis to get ethernet by socket ids
- PR: #4932
#4786: Add upsample_nearest2d functional stable diffusion
- PR: #4870
#4986: deploy docs only to main and enable devs to run docs build on different pages
- PR: #5020
Deploy ttnn sweeps results to docs
- PR: #5019
#4958: Move all python api unit tests to frequent in order to reduce SD pipeline length
- PR: #4981
#4999: Added input validation for ttnn.matmul and ttnn.linear. Add unit test for linear operation. Update input tensor validation in binary.py. Fix compute_output_shapes in bmm_op.cpp
- PR: #5010
#4620: Fix+improve bw test
- PR: #5029
#4852: Add unit tests for functional bloom
- PR: #5013
#5032: scalar argument versions for relops
- PR: #5018
#0: Add some README recommendations from MCW to clarify issue about access to internal workflows VM installation page
- PR: #5034
#4790: Implement GEGLU using ttnn for stable_diffusion model
- PR: #4869
#4999: Adding validation checks
- PR: #5011
#4791: Implement Feedforward sub-module using ttnn for stable_diffusi…
- PR: #4868
Npetrovic/bw ops sweeps
- PR: #5009
#4999: update documentation of ttnn operations to include the validation schema
- PR: #5031
#0: Remove model run from frequent_api_pipeline per @tt-rkim
- PR: #5043
Minor dprint/watcher cleanup
- PR: #5030
#4858: Add support for typecast
- PR: #4840
#0: Disable dprint tests because they're flaky at the moment
- PR: #5026
#4946: Add trig ops to ttnn
- PR: #5041
Nshanker/convs split by 2
- PR: #5042
#4946: Add inv trig ops to ttnn
- PR: #5038
#4003: fixed circular dependency in decorators
- PR: #5052
#5054: Removed asserts from conv op host code that are not required. …
- PR: #5055
#4003: fixed circular dependencies in ttnn
- PR: #5061
#4852: Fix CI pipeline by re-enabling functional bloom for causal LM
- PR: #5060
GroupNorm Sharded. support
- PR: #4945
#4972: is_sharded and memory_config is free from tensor
- PR: #4980
#0: eltwise ops/activate operator tracking for GS, and WHB0
- PR: #5074
Aliu/fd tunneling pr
- PR: #4725
#4642: Converted 14 old cpp tests to use gtest, with capabilities to switch btwn FD/SD when possible
- PR: #5050
#4852: Add tests for functional ttnn bloom implementation.
- PR: #5078
#4003: correctly convert all parameters of torch module to ttnn parameters
- PR: #5100
#5082: Pow gradient calculation method is different with pytorch
- PR: #5106
Argmax/Argmin support for channel, batch and all dim
- PR: #5040
#4420: switch to shared_ptr
- PR: #5123
#4420: return shared_future from taskflow async wrapper
- PR: #5121
Minor DPrint fixes
- PR: #5108
#0: Enable/disable clearing L1 from env var
- PR: #5107
#4003: started moving ttnn operation to C++
- PR: #5111
#4003: Add script to help with finding issues that we need approval for
- PR: #5129
#5044: Adding support for optional output tensors
- PR: #5104
#4003: Adding the open flag to show only open PRs
- PR: #5134
#5048: Add CreateDevices and CloseDevices api to detail
- PR: #5118
decouple ClearProgramCache from CommandQueue
- PR: #5124
Conv fixes for padding input channels. Shallow conv fixes. Conv input/output autoformatting. Cleanup
- PR: #5109
Asarje/mp unpack tilize fused
- PR: #5033
Update CreateBuffer to return shared_ptr, and Enqueue R/W buffer to accept std::shared_ptr
- PR: #5125
#5137: Cleanups for newer Linux distro / toolchains
- PR: #5114
Revert "#5137: Cleanups for newer Linux distro / toolchains"
- PR: #5139
Revert "Update CreateBuffer to return shared_ptr, and Enqueue R/W buffer to accept std::shared_ptr"
- PR: #5138
#4793: Implement ResnetBlock2D using ttnn for stable_diffusion model
- PR: #5084
#4788: Implement Downsample2D using ttnn for stable_diffusion model
- PR: #5090
#4792: Implement CrossAttention sub-module using ttnn for stable_diff…
- PR: #4927
#4747: Reduce amount of samples in bert sweeps
- PR: #5140
#4789: Add upsample2d to functional_stable_diffusion model
- PR: #5080
#0: Add fix for lamb optimizer
- PR: #5144
#5057: Add relational ops support to TTNN
- PR: #5120
skip eth test suite on GS
- PR: #5155
#4003: updated ttnn.Tensor to be derived form ttl.tensor.Tensor
- PR: #5130
Asarje/shwetank upsample
- PR: #5105
#5082: power gradient is erroneous when exponent is in range (0-1)
- PR: #5158

Contributors

tt-rkim

Assets 5

26 Jan 14:59

github-actions

v0.42.0

d7d068e

v0.42.0

📦 Uncategorized

Syrmia/new sweeps
- PR: #4390
Update test sweeps for the system memory input buffer
- PR: #4245
#4181: Add bfloat8_b dtype fix for tests that should support bfloat8_b
- PR: #4207
#4343: Add new op sweeps for GS and WH
- PR: #4408
#0: (MINOR) Update to v0.42.0
- PR: #4714
#4311: Automate determining and scheduling RC generation
- PR: #4713
Jedi main
- PR: #4690
#0: Remove path appends from test files
- PR: #4715
#4003: Adding padding for whisper
- PR: #4578
#4632: Add dprint server support for eth cores
- PR: #4709
#4003: added ttnn.group_norm
- PR: #4727
#4003: added ttnn.silu
- PR: #4731
#3999: move fallback_ops.silu -> tt_lib.tensor.silu
- PR: #4728
#4683: Support tracing
- PR: #4656
#0: Patch for bad state reached when enqueuing trace
- PR: #4735
Nshanker/remove pow of 2 req for channels size
- PR: #4693
#4003: added ttnn.pad
- PR: #4733
#4730: Adding ttnn.concat as fallback
- PR: #4738
#4003: added ttnn.split
- PR: #4737
Syrmia/ttnn sweeps
- PR: #4579
#4347: Move VGG tensors to L1
- PR: #4498
#4670: Add end to end demo for functional roberta model
- PR: #4718
#4431: mnist gs_demo benchmark
- PR: #4502
#4623: lenet gs demo benchmarking [Pending CI]
- PR: #4634
#4720: Improve folder structure of broken sweep tests
- PR: #4721
Adding interface to assign dispatch kernels to dispatch functionality and adding kernel to service remote command queue
- PR: #4615
#4003: Fixing whisper pcc in last layer
- PR: #4753
#4003: updated ttnn unit tests to assert using higher PCC thresholds
- PR: #4762
#4761: Adding fallback for repeat_interleave
- PR: #4767
#4003: simplified the logic in to_layout
- PR: #4766
#4003: added ttnn.log
- PR: #4769
#4003: updated ttnn.to_layout and ttnn.pad to do the right thing with padded shape
- PR: #4770
#0: Fix reference to Python integration test in README
- PR: #4784
#0: As a quick fix for now, source /etc/rc.local to re-insert number of hugepages back in after starting weka service in perf pipelines
- PR: #4807
#4003: updated model names
- PR: #4771
#4617: Matmul went to 0.9998887677925289 with float comparison to torch
- PR: #4812
#0: Fix bad access to memconfig/device when input tensors are on host
- PR: #4716
#4503: Demo for functional bloom
- PR: #4554
#4611: Add end to end test for ViT model with ImageNet data
- PR: #4749
#4506: SSD gs demo benchmarking
- PR: #4585
#4504: Add end to end demo for functional t5 model
- PR: #4649
#4557: Uplift swin model to resolve errors in tests & Add test_perf_accuracy...
- PR: #4774
#4556: Roberta gs demo benchmarking
- PR: #4627
#3974: nanogpt uplift and move weights to weka path
- PR: #4221
#4610: EfficientNet gs demo benchmark
- PR: #4633
#4003: added more sweeps
- PR: #4813
#4231: Fine-tune the unary ops for add, sub, div, mul binops with one scalar constant arg
- PR: #4768
#516: Sanity check tracy artifact generation
- PR: #4545
#4003: fixed crashing sweep tests
- PR: #4829
#0: Update get_semaphore to return 16B aligned semaphore addresses
- PR: #4820
#0: Add tracy dependencies to github actions runner workflows
- PR: #4835
#4730: Add sweep test for ttnn.concat
- PR: #4830
Update ops for sharding used in falcon 40b
- PR: #4806
#4833: Create initial ttnn sweeps with csv artifact upload
- PR: #4834
#4003: debugging whisper
- PR: #4746
#4003: Setting all = [] to block whild card imports
- PR: #4832
TTNN Sharded tensor support
- PR: #4597
#3662: Impl moreh_clip_grad_norm
- PR: #4743
#4609: Deit gs demo benchmarking
- PR: #4628
#4741: Add sum op to tt_dnn
- PR: #4744
#4622: Yolov3 GS demo Benchmarking
- PR: #4719
#0: Add weka mount + force hugepage mount with /etc/rc.local in frequent pipelines
- PR: #4827
#0: Reduce timeout of multi queue single device FD post commit
- PR: #4850
#4003: Make ttnn sweep tests available from pytest
- PR: #4819
Add MaxPool2d to ttnn
- PR: #4831
Ttnn 4761 add sweep for repeat interleave
- PR: #4841
#0: Remove checkout secret
- PR: #4856
#4847: Error out when there are insufficient num hugepages
- PR: #4860
simpler hugepage check
- PR: #4839
Revert "#4839: simpler hugepage check"
- PR: #4865
#4862: Disable test_moreh_clip_grad_norm_with_error_if_nonfinite
- PR: #4867
#4374: Benchmarking for bloom TT model
- PR: #4772
#4505: Add end to end demo for functional bert model
- PR: #4582
#4003: updated documentation
- PR: #4876
#4003: updated concat operation to raise an exception if the dimension is out of range
- PR: #4853
#0: Losen models perf tolerance for GS
- PR: #4879
#0: Add more instructions on syseng assets installation + direct users to additional hugepages setup if needed for cloud VMs
- PR: #4884
#4815: New restart command which safely resets a command queue into a starting state
- PR: #4816
Revert "#4815: New restart command which safely resets a command queue into a starting state"
- PR: #4887

Assets 5

Releases: tenstorrent/tt-metal

v0.52.0

📦 Uncategorized

Uh oh!

v0.51.0

Demo models and their metrics

Grayskull (GS) Models

Wormhole (WH) Models

TT-QuietBox & TT-LoudBox (2x4 mesh of WHs) Models

Single Galaxy (8x4 mesh of WHs) Models

📦 Uncategorized

Uh oh!

v0.50.0

📦 Uncategorized

Uh oh!

v0.49.0

📦 Uncategorized

Uh oh!

v0.48.0

📦 Uncategorized

Uh oh!

v0.46.0

📦 Uncategorized

Uh oh!

v0.45.0

🚀 Features

📦 Uncategorized

Uh oh!

v0.44.0

📦 Uncategorized

Uh oh!

v0.43.0

📦 Uncategorized

Contributors

Uh oh!

v0.42.0

📦 Uncategorized

Uh oh!