13 Jan 21:15

github-actions

69f4a5e

v0.41.0

Metal

API Changes

tt::tt_metal::detail::GLOBAL_CQ replaced with tt::tt_metal::detail::GetCommandQueue(Device *device)
New num_hw_cqs parameter to specify underlying number of HW CQs for a given Device: Device *CreateDevice(chip_id_t device_id, const uint8_t num_hw_cqs = 1, const std::vector<uint32_t>& l1_bank_remap = {});

Tools

Profiler

Integrated Tracy host-side CLI capture and csv report generation with metal’s profiler infrastructure
Added support for device profiling on ethernet cores for Wormhole systems.

ttNN

Infrastructure

Updated ttnn documentation with visualizations and examples
Added padded shape to ttnn
Renamed ttnn.nlp to ttnn.transformer
Updated ttnn.transformer.split_query_key_value_and_split_heads to handle most shapes, multi head query and cases when key_value_states are used to compute key and value
Added ttnn.rms_norm
Added ttnn.Shape and exposed support for padded shape. Simplified broadcasting and reduction operations
Moved ttnn.Tensor to C++
Added debug decorator for ttnn operations

Operations

Layer operators layernorm, conv,softmax were optimized for multi-core computation; model specific operators for Falcon7B were also added.
The operator normalize_global was added to the tt_lib.tensor namespace; this transforms the tensor by normalizing elements to the mean and standard deviation of the entire tensor.
The operator lamb_optimizer was added to the tt_lib.tensor namespace to help with computing the back-propagation algorithm and weight update for DNN in the training loop.

The following backward operators, for use with back-propagation training loop, have been added to tt_dnn library; they are accessible with suffix _bw in the tt_lib.tensor namespace.

 1. abs
 2. add
 3. addalpha
 4. addcdiv
 5. addcmul
 6. binary_assign
 7. binary_le
 8. clamp
 9. clamp_max
10. clamp_min
11. div
12. exp
13. fill
14. fill_zero
15. gt
16. log
17. lt
18. max
19. min
20. mul
21. ne
22. neg
23. relu
24. rsqrt
25. rsub
26. sigmoid
27. sqrt
28. sub
29. tan
30. tanh
31. unary_add
32. unary_assign
33. unary_div
34. unary_mul
35. unary_pow
36. unary_sub
37. where

Models

Added ttnn implementation for Roberta, Whisper, T5-small, and flan-T5-small
Updated ttnn implementation of Bloom to work with L1 memory, and cleaned up ttnn implementation of BERT
Updated Mistral implementation to use tilized tensors and operations
Updated VGG model to load pre-tilized weight tensors and use tilized tensors
Added benchmarking demo for DistilBert and T5 using SQuAD dataset for question answering

Assets 5

09 Jan 20:01

github-actions

v0.40.0

6e16b48

v0.40.0

📦 Uncategorized

Opt LN_sharded and SMX_sharded
- PR: #4147
#1919: Turn existing allocator tests into gtests
- PR: #4218
Agrebenisan/fd perf opt
- PR: #4219
#3932: Rename unary op args which were input_a -> input, binary ops from input, other -> input_a, input_b
- PR: #4194
#3971: Fix TSLICE printing truncation when hitting MAX_COUNT
- PR: #4159
#0: Fix undefined variable error when running with watcher
- PR: #4256
#4141: Add GetPreferredNOCForDRAMRead, GetPreferredNOCForDRAMWrite and update all ops to use these apis
- PR: #4184
#3420: fix eth core init L1 bug
- PR: #4262
#0: Add ttnn founding engineers as CODEOWNERS of functional models
- PR: #4265
#0: Commonize logic between E2E and device perf functions/scripts. Enable assertions for device perf scripts/ci
- PR: #4248
Issue 4073: Fix for host-side hanging when an invalid DPRINT WAIT command is running on the device.
- PR: #4103
#0: Add tt-rkim as CODEOWNERS for setup_hugepages.py
- PR: #4266
#4003: implemented functional t5 model
- PR: #4241
#3003: commonized variable names across tnn tests. Removed ttnn.experimental. Added ttnn.unary and commonized the import of ttl unary ops
- PR: #4268
#0: Delete extra text in first docs page about being added to repo
- PR: #4295
write watcher log to built/ folder rather than kernel subfolder
- PR: #4291
Add Batch>1 fix for matmul blocking API
- PR: #4296
#4231: improve unary add, sub, mul and div implementation in SFPU. Add complex polar operator
- PR: #4257
#3493: sharded tensor support
- PR: #3790
REVERT #4231: Fine-tune the unary ops to improve performance
- PR: #4312
#0: Move setup_hugepages.py to release assets
- PR: #4264
#0: (MINOR) Update VERSION to 0.40.0
- PR: #4315
#4301: Fix link to announcements in README
- PR: #4317
#4301: Replace some more instances of Metal w/ Metalium in docs
- PR: #4320
Llk refactor uplift
- PR: #3908
#0: Fix TT-Metalium docs link in get_performance.rst
- PR: #4323
#0: uplift in device code
- PR: #4299
#4176: uplift umd plus tt_metal changes
- PR: #4333
init fw once
- PR: #4335
Merge v2 of untilize_with_halo, maxpool, and conv ops for Resnet-50
- PR: #4325
Backward ops for Metalium - part-2
- PR: #4322
#4211: Assert that hugepages number is greater than or equal to required, rather than equal to
- PR: #4381
Update resnet readme
- PR: #4367
Add Run Instructions for BERT_large sharded in readme
- PR: #4366
Add batch 20 for resnet-50
- PR: #4371
#4376: Support mixed precision for eltwise binary with prescaling
- PR: #4387
Increase timeout of slow dispatch unit tests and switch to Y_M_D format for ops logs
- PR: #4397
#0: point umd to main, comestic change
- PR: #4396
New tilize and straightforward vec gen in matmul kernel examples
- PR: #4261
#4216: Enable DPrint slow dispatch testing
- PR: #4326
#4376: Call llk reconfig functions in compute kernel apis for WH
- PR: #4393
#4336: #4386: Fix interleaved_to_sharded writer waiting on incorrect amount of data for uneven shards
- PR: #4402
#1433: removed Device* and MemoryConfig from DeviceStorage
- PR: #4411
#0: Increase fast dispatch post commit timeout and shorten full regressions because we no longer need that much time
- PR: #4412
#4003: added ttnn.mean, ttnn.rsqrt and ttnn.pow and deleted and got rid of ttl use in ttnn_functional_t5. Updated ttnn.Tensor to store shape as ttnn.Shape
- PR: #4383
Aliu/load base erisc
- PR: #4394
#4399: add spell checker script for docs spellchecking
- PR: #4398
#2134: Uplift UMD
- PR: #4400
#0: fix memory leaks found in test_sfpu via valgrind
- PR: #4419
Revert "#4399: add spell checker script spellcheck.sh should be read…
- PR: #4424
#0: update llk.rst for minor ReST syntax
- PR: #4423
#2934: Make one CommandQueue and one HW CommandQueue (SysmemWriter) per device
- PR: #4077
#4003: convert ttl.tennsor.Shape to tuple when using it in torch functions
- PR: #4426
#4211: Fix HP targeting issues in main from cq-per-device changes
- PR: #4447

Assets 5

12 Dec 15:57

github-actions

v0.39.0

88be963

v0.39.0

📦 Uncategorized

#0: Add extra sentence about use cases in somewhat vague terms
- PR: #3975
#3824: cache weight tensors for mistral
- PR: #3973
Npetrovic/power fp sweep
- PR: #3959
#3918: Fix falcon7b perf profiling & add support to load weights from HF when weka is not mounted
- PR: #3863
Rename KernelID -> KernelHandle and CircularBufferID -> CBHandle
- PR: #3939
Aliu/erisc cleanup
- PR: #3989
#3003: ttnn program logging
- PR: #3987
Watcher output/doc tweaks
- PR: #3998
#4014: added support for uint16 datatype
- PR: #4015
#4000: Add links to demo folders in note in first 5 things
- PR: #4012
#3751: Fix sfpu load/store of ints
- PR: #4016
enable watcher for stress test actions
- PR: #4021
#3058: Give first pass at flattening build by getting rid of tt-metal intermediate libs
- PR: #4011
Revert "#3058: Give first pass at flattening build by getting rid of …
- PR: #4042
#3219: Added host functions which tilize and untilize bfloat16 vectors
- PR: #4038
stress test machine config update
- PR: #4025
#0: update to use concat on device
- PR: #4010
#3895: ttnn functional optimized Bert
- PR: #4020
#4014: Fix bug with packing uint16 datatype
- PR: #4050
#3824: move mistral embedding weights to weka
- PR: #4028
#3978: Fix readme to instruct running pytest without warnings
- PR: #3984
Dma/3467 dprint cleanup
- PR: #4018
#0: identity operator for comparison of SFPU ops
- PR: #4019
#3058: Add tracy back into build and test with ENABLE_TRACY=1
- PR: #4047
#3979: Add support for ResNet for weka unmounted machines to download ImageNet
- PR: #4066
#3990: Remove DPRINT SETW sticky bit
- PR: #4081
#4041: Add moreh_layernorm op
- PR: #4045
#4044: Add moreh_softmax, moreh_softmin ops
- PR: #4060
#3103: profile the SFPU operators
- PR: #4075
#0: function typo fix
- PR: #4100
#3211: bug in WH B0 - sum along dim3
- PR: #4099
Implementation for Bert Sharded Batch 12
- PR: #4093
#4069: Avoid reading out of bounds in the hugepage
- PR: #4098
#4014: Add testing for uint16 and uint32 on device
- PR: #4094
#0: Disable TestPrintRaiseWait gtest until a fix for nondet issue is in
- PR: #4123
Move hugepages section and refer to public syseng instructions for accelerator-level dependencies
- PR: #4124
#4055: non-deterministic test_pow_fractional PCC error with watcher enabled
- PR: #4129
#0: update test_sfpu and profiling conflict
- PR: #4128
#4043: Add discord link to docs support page + README
- PR: #4134
Noc on erisc
- PR: #4046
#3894: backward ops for tt-metal
- PR: #4054
#3972: Update tracy and device-side profiler docs
- PR: #4138
#4085: update seed value and re-verify the reported bug
- PR: #4139
#2860: Init one UMD per MMIO device ID and the remote devices it controls
- PR: #4080
#4074: Add opened, reopened, synchronize pull_request triggers (default) for static checks pipeline
- PR: #4152
#0: Ignore /device, not device/ in .gitignore
- PR: #4153
#4074: Add wording to CONTRIBUTING.md to be open to future forks + to discourage clogging up pipelines with too many PRs
- PR: #4155
#4053: Upgrade driver from 1.23 to 1.26 in release assets from syseng
- PR: #4133
#4065: Update pinned python3.8-venv to 20.04.9 because 20.04.8 is gone
- PR: #4135
#4096: Fix issue with DPRINT server closing too early for some WAITs
- PR: #4130
#4053: Add chmod ugo+x step in ansible scripts for copying over script assets
- PR: #4167
#4109: ttnn examples.rst needs update
- PR: #4149
#4158: support full repeat interleave developed for Mistral
- PR: #4113
#4076: Add instructions for execution for programming_examples and fix one typo
- PR: #4168
#0: (MINOR) Bump minor to v0.39.0
- PR: #4175
#4053: Get rid of FW labels for silicon runner targets
- PR: #4169
#3752: update ttnn tutorials and make them more descriptive
- PR: #4178
#3994: Add bfloat16 dtype to sweep tests
- PR: #4090
#0: update ownership for SFPU ops profiler, and Backward ops code
- PR: #4179
#3420: move init erisc info to clear l1 call
- PR: #4166
#3918: Add falcon caching support
- PR: #4185
#4125: Refactor tests for backward ops
- PR: #4180
Perf bloom
- PR: #4095
#4121: Unset TT_METAL_SLOW_DISPATCH_MODE when empty string in yaml. R…
- PR: #4182
#4079: Remove dprints from op kernels
- PR: #4191
#4176: uplift umd to include create-eth-map fixes
- PR: #4195
#4017: Replace static device APIs to query num available devices and num availale pcie devices with standalone host APIs
- PR: #4190
Fixup some error messages
- PR: #4209
Rework build system
- PR: #4192
#4228: Revert umd change to see if seg faults go away
- PR: #4229
#4003: use if-else instead of try-except in ttnn.reshape and ttnn.permute
- PR: #4235
#4003: updated ttnn.model_preprocessing to keep the structure of the model weights
- PR: #4196
#0: Changing name for major places from Metal to Metalium
- PR: #4239
#4186: Move all assets except for setup_hugepages.py to internal workflows
- PR: #4189
#4003: run test_performance_of_bloom_for_question_answering using L1 Config and assuming fused softmax
- PR: #4238
#3003: updated ttnn tests
- PR: #4242

Assets 4

24 Nov 19:50

github-actions

v0.38.0

2639f20

v0.38.0

📦 Uncategorized

#3820: Trunc fallback op
- PR: #3822
#3703: Support power with non integer exponent: tt_lib.tensor.power_fp
- PR: #3821
#308: Add a new test for coverage of previous issue with dprinting float consts from ncrisc
- PR: #3818
#0: Update UMD submdoule and add cluster wrapper fof get_pcie_base_addr_from_device
- PR: #3688
ttnn - added Bert
- PR: #3660
Remove asserts and enable lto for release builds
- PR: #3806
#2220: Use new UMD apis to get PCIe address ranges
- PR: #3836
#3814: Use UMD fast write path to update the CQ write pointer, clean up the names of the write/read core APIs so they do not reference DRAM
- PR: #3833
#0: Fix the repeat interleave doc
- PR: #3817
#3003: use log_debug instead of log_info for logging operations
- PR: #3845
Revert "#2220: Use new UMD apis to get PCIe address ranges"
- PR: #3855
Update get_started.rst
- PR: #3861
#0: Remove kkwong from CODEOWNERS
- PR: #3864
#0: Fix scatter op
- PR: #3802
#3829: Add new void* enqueue apis
- PR: #3860
#2516: Remove datacopy into uint32_t vector now that we have void* apis
- PR: #3866
#3640: eltwise binary op perf optimzation
- PR: #3871
#0: Fix microbenchmark csv artifact path
- PR: #3837
#3568: Move weigths dtype from bfloat16 to bfp8 in mistral model
- PR: #3775
Fix SPDX headers to be machine readable
- PR: #3865
#3804: Split device perf job into separate workflow from E2E perf
- PR: #3879
#0: Update untilizewithunpad to support some cases of unpadding width in width sharding
- PR: #3878
#2498: Upload syseng assets as part of release
- PR: #3876
#0: (MINOR) Update to v0.38.0
- PR: #3883
#2498: Revert "#2498: REVERT ME - test out release pipeline without r…
- PR: #3884
Update llama-2 version
- PR: #3840
#3566: support mistral model for generic batch size
- PR: #3848
#3718: Link multicasts that use the same path to avoid multiple path reservations in a row
- PR: #3842
remove UpdateRuntimeArg
- PR: #3877
#3704: Increase size of trisc1 code hole for now
- PR: #3858
Doc update for EnqueueReadBuffer
- PR: #3912
Env variable cleanup
- PR: #3906
Documenting Compute Kernels API Sprint
- PR: #3653
#3647: Add fix for test for polyval coeffs generation
- PR: #3923
#0: mistral code refactor and reuse variables
- PR: #3916
Codeowners update
- PR: #3907
#3914: Apply scatter for mistral model
- PR: #3922
Rewrote ttnn_optimized_multi_head_attention using only ttnn operations
- PR: #3911
Update models' landing page
- PR: #3940
#3904: First docs changes for Project Grayskull
- PR: #3919
Adding compute kernel api docs for untilize, tilize, unpack, tile_move_copy and reg_api
- PR: #3941
document compute_kernel_api/matmul.h, compute_kernel_api/pack.h, and compute_kernel_api/bcasth.h
- PR: #3937
#3887: repeat operator implementation
- PR: #3920
restrict my ownership to host API docs only
- PR: #3944
#0: update profiling for unary ops
- PR: #3956
#2220: Redo use new UMD apis to get PCIe address ranges
- PR: #3925
Merge latest resnet optimizations
- PR: #3935
Add support for eth kernels full stack
- PR: #3773
#0: Update docs on device side profiler
- PR: #3958
#3913: Update mem config for the mistral modules
- PR: #3921
#3003: updated links to steps 3 and 4 of getting started
- PR: #3964
#3830: Fix CB failures in perf pipelines
- PR: #3938
#0: enable test for wormhole, use eps from device
- PR: #3963
#3003: Adding ttnn_functional_bloom
- PR: #3872
#3926: refactored run_device_operation to commonize the logic of runn…
- PR: #3966
#0: add --tile-factor, --use-L1, --use-DRAM, or --help options
- PR: #3967
Moreh Matmul Op
- PR: #3851

Assets 9

17 Nov 22:42

github-actions

v0.37.0

6db3ec5

v0.37.0

Metal

API Changes

Top-level API to create a Program:
Program CreateProgram();
GetRuntimeArgs now returns a reference to underlying runtime args to allow for in-place updates. This results in noticeably better performance for host-bound workloads:
std::vector<uint32_t>& GetRuntimeArgs(const Program &program, KernelID kernel_id, const CoreCoord &logical_core);
Two other variants of updating runtime arguments that results in better host-side performance in certain situations:
- void UpdateRuntimeArg(const Program &program, KernelID kernel, const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec, size_t offset, uint32_t value);
- void SetRuntimeArgs(const Program &program, KernelID kernel, const std::vector< CoreCoord > & core_spec, const std::vector< std::vector<uint32_t> > &runtime_args);
(NOTE: UpdateRuntimeArg is getting removed by next release as it’s use as been superseded by the other functions)
GetCircularBufferConfig now returns a const reference: const CircularBufferConfig &GetCircularBufferConfig(Program &program, CircularBufferID cb_handle);
Updating circular buffer config parameters are done through separate 3 functions:
- void UpdateCircularBufferTotalSize(Program &program, CircularBufferID cb_handle, uint32_t total_size);
- void UpdateCircularBufferPageSize(Program &program, CircularBufferID cb_handle, uint8_t buffer_index, uint32_t page_size);
- void UpdateDynamicCircularBufferAddress(Program &program, CircularBufferID cb_handle, const Buffer &buffer);
Moved slow/host dispatch APIs to detail namespace:
- void LaunchProgram(Device *device, Program &program);
- void ReadFromBuffer(const Buffer &buffer, std::vector<uint32_t> &host_buffer);
- void WriteToBuffer(const Buffer &buffer, const std::vector<uint32_t> &host_buffer);

Tools - Profiler

Updating the path for all profiler artifacts to be under generated/profiler folder

ttNN

Infrastructure

Introduced ttnn.embedding to facilitate word embeddings
Added preprocess_parameters for generic conversion of torch parameters with caching
Added ttnn.experimental.gelu
Added ttnn.experimental.layer_norm
Updated program hash to be std::size_t and significantly sped up its computation

Operations

Support for split tensor into two has support for tensor [W, Z, Y, X] shape along Y in addition to existing X.
Support trunc function has fallback support equivalent to torch.trunc
Support power function with exponent which is not integral: tt_lib.tensor.power_fp()
Support for reshape operator on host for ROW_MAJOR layout

Models

Notes not available.

Assets 9

05 Nov 17:22

github-actions

v0.36.1

da55d40

v0.36.1

Metal

Wormhole Bringup

Added some APIs to query device ethernet connectivity.
Added first phase of ethernet data movement support, basic unit tests passing on N300.

API Changes

Notes not available.

Tools - Profiler

Device only and host only profiling options for profile_this.py script
Examples for fast dispatch device program profiling

Tools - Watcher

Added kernel names/paths to watcher log file

Extra features

Notes not available.

Eager/ttNN

Infrastructure

Added initial implementation of TTNN APIs
- Added functions to interface with torch: from_torch, to_torch
- Added functions to move tensor to/from device: to_device, from_device
- Added functions to change the layout of the tensor: to_layout
- Added matmul, add, sub, mul, reshape, permute and softmax operations
Implemented Multi-Head-Attention using TTNN APIs
Added 3 tutorials to showcase TTNN
Updated Documentation to describe TTNN and its APIs

Operations

Following on-device operators are added to tt_lib.tensor module:

interleave repeat
triu
tril
rmsnorm
groupnorm
silu (update to be first-class unary operator)

Models

For BERT demo, added loading of cached pre-processed weights (stored as TT tensors) to avoid conversion from Torch to TT tensors.
Added demo for ResNet that executes on TT hardware. Demo takes images from ImageNet and processes them in batches of 8.

Assets 4

27 Oct 23:36

github-actions

v0.35.0

b1f6f1a

v0.35.0

Metal

Wormhole Bringup

Extended gtests to run on all available devices in Wormhole systems.
Single device tests passing on remote chips.

API Changes

These 2 functions:
- uint32_t CreateSemaphore(Program &program, const CoreRange &core_range, uint32_t initial_value)
- uint32_t CreateSemaphore(Program &program, const CoreRangeSet &core_range_set, uint32_t initial_value)
have been replaced by
- uint32_t CreateSemaphore(Program &program, const std::variant<CoreRange,CoreRangeSet> &core_spec, uint32_t initial_value).
These 3 functions:
- void SetRuntimeArgs(const Program &program, KernelID kernel, const CoreCoord &logical_core, const std::vector<uint32_t> &runtime_args)
- void SetRuntimeArgs(const Program &program, KernelID kernel, const CoreRange &core_range, const std::vector<uint32_t> &runtime_args)
- void SetRuntimeArgs(const Program &program, KernelID kernel, const CoreRangeSet &core_range_set, const std::vector<uint32_t> &runtime_args)
have been replaced by
- void SetRuntimeArgs(const Program &program, KernelID kernel, const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec, const std::vector<uint32_t> &runtime_args)
These 2 functions:
- KernelID CreateDataMovementKernel(Program &program, const std::string &file_name, const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec, const std::optional<DataMovementConfig> &config = {})
- KernelID CreateComputeKernel(Program &program, const std::string &file_name, const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec, const std::optional<ComputeConfig> &config = {})
have been replaced by:
- KernelID CreateKernel(Program &program, const std::string &file_name, const std::variant<CoreCoord, CoreRange, CoreRangeSet> &core_spec, const std::variant<DataMovementConfig,ComputeConfig> & config)

Tools - Profiler

Improved profile_this.py log management strategy to avoid conservative log folder checks from profiling

Extra features

Runtime Compute Args: Arguments can be sent to Compute Kernels at runtime in the same way as DataMovement Kernels. The kernel uses the same get_arg_val<type>(<index>) to retrieve it. The host uses the same tt_metal::SetRuntimeArgs(Program program, KernelID kernel, const std::variant<CoreCoord, CoreRange, CoreRangeSet> & core_spec, const std::vector<uint32_t> &runtime_args), as the host used to communicate to DataMovement Kernels.

Eager (Ops)

There have been no notable changes to communicate in this release.

Models

Moved code that implements and tests models from tests/models to top level models folder. In the models folder, models are separated into demos (working models with end2end demo code) and experimental (models that are under development).
Added implementation of Falcon7B for GS and PyTorch demos for nanoGPT and T5
Added BERT Large end2end demo on GS (set up for question answering)

Assets 6

13 Oct 15:22

github-actions

v0.34.0

62cd4c4

v0.34.0

Metal

API Changes

CreateDevice: device_id type has changed from int to chip_id_t
CreateCircularBuffer: Three previous variants which only differ by CoreCoord, CoreRange, and CoreRangeSet function parameter have been compressed into one user-facing CreateCircularBuffer function that’s parameterized with std::variant<CoreCoord,CoreRange,CoreRangeSet>. Now accepts CircularBufferConfig which specifies size, data format, and page size per buffer index. Return type updated from CircularBuffer object to CircularBufferID (uintptr_t)
GetCircularBufferConfig: New function to retrieve a reference to configuration of a CircularBuffer. This allows the CircularBuffer config to be updated. Updates will take effect on the next call to LaunchProgram.

Tools - Profiler

Tracy Python Support : Profile python side code with tracy. Similar to cProfile, the standard python profiler module, all python function calls are picked up on tracy. Additionally, TT’s binded C++ calls are also picked up automatically. The entire python script or just desired parts of it can be profiled either at function or line level.

Extra features

Runtime Compute Args: Arguments can be sent to Compute Kernels at runtime. The kernel uses the same get_arg_val<type>(<index>) API to retrieve it. The host uses the same tt_metal::SetRuntimeArgs(<program, <compute_kernel_id>, <Core,CoreRange> , <vector of u32 runtime args>) as DataMovement Kernel.

Eager (Ops)

Notes not yet available.

Models

metal_BERT_large_15: model implementation updated to use tt-DNN operation embedding that executes on GS device. Previously this model used PyTorch embedding operation executing on CPU.
Falcon7b: added end to end demo that is running on GS device. The demo takes a text prompt and returns text generated by the model to complete the prompt. The demo works by pre-filling the cache with decoded input prompts and then running decode for all users in parallel.

Assets 6

06 Oct 02:29

tt-rkim

v0.33.0

e76a336

v0.33.0

Metal

Wormhole

Basic bringup and tests running on WH B0
Harvesting functionality working on WH B0
Basic fast dispatch functionality working on WH B0

Host API changes

void StartDebugPrintServer(Device *device, const std::vector<CoreCoord> & cores) no longer callable
Device *CreateDevice no longer requires arch parameter
New wrapper around Buffer API so that users don't need to look inside buffer.hpp to figure out how to construct a buffer object: Buffer CreateBuffer(Device *device, std::uint64_t size, std::uint64_t page_size, const BufferType buffer_type)
LaunchKernels renamed to LaunchProgram(Device *device, Program &program) to match EnqueueProgram and removed obsolete stagger_start parameter
void WriteRuntimeArgsToDevice(Device *device, const Program &program) moved to detail namespace
bool CompileProgram(Device *device, Program &program) moved to detail namespace
bool ConfigureDeviceWithProgram(Device *device, const Program &program) moved to detail namespace
bool InitializeDevice(Device *device) removed

Profiler

Bug fix on device side to support new FW init process in fast and slow dispatch.
RISC FW cleanup to avoid unnecessary function wrappers.

Watcher

Add more way points to watcher and add access methods to soc descriptor for, eg, harvesting
Add some noc sanitization and checks
Some bug fixes: don't read registers during kernel run, don't include wh headers on gs, allow 0 length transactions

Feature: Runtime Compute Args

Arguments can be sent to Compute Kernels at runtime in the same way as DataMovement Kernels.
The kernel uses the same get_arg_val<type>(<index>) api to retrieve it.
The host uses the same tt_metal::SetRuntimeArgs( <program>, <compute_kernel_id>, <Core, CoreRange>, <vector of u32 runtime args>); as DataMovement Kernel communication as well.

Eager (Ops)

Added support for overriding runtime args and circular buffers
Added support for saving and loading tensors
Added support for uint32 tensor

Models

5+% increase of BERT Large performance on bare metal machines.
15+% increase of LLaMA 7B performance on bare metal machines.

Assets 4

Releases: tenstorrent/tt-metal

v0.41.0

Metal

API Changes

Tools

Profiler

ttNN

Infrastructure

Operations

Models

Uh oh!

v0.40.0

📦 Uncategorized

Uh oh!

v0.39.0

📦 Uncategorized

Uh oh!

v0.38.0

📦 Uncategorized

Uh oh!

v0.37.0

Metal

API Changes

Tools - Profiler

ttNN

Infrastructure

Operations

Models

Uh oh!

v0.36.1

Metal

Wormhole Bringup

API Changes

Tools - Profiler

Tools - Watcher

Extra features

Eager/ttNN

Infrastructure

Operations

Models

Uh oh!

v0.35.0

Metal

Wormhole Bringup

API Changes

Tools - Profiler

Extra features

Eager (Ops)

Models

Uh oh!

v0.34.0

Metal

API Changes

Tools - Profiler

Extra features

Eager (Ops)

Models

Uh oh!

v0.33.0

Metal

Host API changes

Profiler

Watcher

Feature: Runtime Compute Args

Eager (Ops)

Models

Uh oh!