Add fused minimal matmul addcmul operation by nmauriceTT · Pull Request #36502 · tenstorrent/tt-metal

nmauriceTT · 2026-01-26T18:43:54Z

Ticket

Problem description

For Wan2.2, we want to fuse minimal_matmul and addcmul, like the following pattern:

spatial_ff_1BND = self.ff(spatial_normed_1BND, compute_kernel_config=self.ff_compute_kernel_config)
spatial_1BND = ttnn.addcmul(spatial_1BND, spatial_ff_1BND, c_gate_msa_1B1D)

What's changed

This adds a new dit_minimal_matmul_addcmul_fused operation that perform minimal_matmul + addcmul operation.
This operation is equivalent to:

matmul_output = minimal_matmul(tensor_input0, tensor_input1, tensor_bias)
output = addcmul(ternary_a, minimal_matmul, ternary_b, scalar) # ternary_a + scalar * minimal_matmul * ternary_b

To make minimal_matmul more extensible, I've also merged its device operation with that of minimal_matmul_split (i.e. reduce code duplication).

The kernels of dit_minimal_matmul_addcmul_fused have been implemented by modifying the minimal_matmul kernels. It is also defined by calling minimal_matmul (device operation of minimal_matmul has been updated with new parameters).

Note: Ideally, we'd like to use the addcmul_tile LLK. But it seems that unary_bcast<BroadcastType::ROW> does not work with fp32_acc_to_dst. Instead, row-broadcast of ternary_ais done throughadd_bcast_tile` (FPU). The downside is that the output should be less accurate than with addcmul.
If accuracy turns out to be a problem, then we can switch to other workaround (e.g. do broadcasting in dataflow kernels).

Performance

wan2.2_14b-720p-glx: Single GLX
(M, K, N) = (9472, 1280, 5120)

Name	Duration [ms]	Std	Gains	Speed-up
Separate	2.377 ms	0.016	N/A	N/A
Fused	2.080	0.005	-0.30	8% faster

wan2.2_14b-720p Quad GLX
(M, K, N) = (2368, 1280, 5120)

Name	Duration [ms]	Std	Gains [ms]	Speed-up
Separate	0.835	0.006	N/A	N/A
Fused	0.776	0.003	-0.059	7% faster

As a reference, here's the execution time of addcmul.

Shape	Name	Duration [ms]
Single GLX	addcmul	0.463
Quad GLX	addcmul	0.124

Fusing the operations saves us ~50-65% of the execution time addcmul.

Checklist

Model tests

If your changes cover model-related code, you should run tests corresponding to affected models and platforms (Single card, T3K, Galaxy). "Choose your pipeline" workflows facilitate running multiple kinds of tests in a single run. Each offers models-mandatory and models-extended presets.
The former includes a minimal set of tests, to be run always. The latter extends that with additional ones - use your best judgement in deciding which is the most appropriate for your PR.

tests/ttnn/nightly/unit_tests/operations/experimental/test_dit_minimal_matmul_addcmul_fused.py

Copilot

Pull request overview

This PR introduces a fused dit_minimal_matmul_addcmul_fused operation that combines minimal_matmul and addcmul for improved performance in DiT transformer blocks (specifically targeting Wan2.2). The PR also refactors the minimal_matmul device operations by unifying the previously separate minimal_matmul and minimal_matmul_split implementations into a single device operation that supports both single and chunked outputs.

Changes:

Added new fused operation dit_minimal_matmul_addcmul_fused that computes output = residual + scalar * matmul(input, weight) * gate
Unified minimal_matmul and minimal_matmul_split device operations, changing return types from Tensor to std::vector<Tensor>
Extended kernels to support fused ternary (addcmul) operations with new circular buffers and runtime parameters

Reviewed changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
dit_minimal_matmul_addcmul_fused/*	New operation implementation with nanobind bindings and comprehensive documentation
minimal_matmul_device_operation.*	Unified device operation supporting both single and split outputs, added ternary fusion parameters
minimal_matmul_program_factory.*	Added circular buffers for ternary inputs, extended runtime argument handling
minimal_matmul.cpp, minimal_matmul_split.cpp	Updated to use unified device operation returning vector of tensors
kernels/compute.cpp	Added `add_bias_and_addcmul_block` function implementing fused bias and addcmul logic
kernels/dm_in*.cpp	Extended dataflow kernels to read and process ternary input tensors
kernels/matmul_dataflow_common.hpp	Added `read_ternary_blocks_sync` helper for reading ternary tensors
minimal_matmul_split_* (deleted)	Removed duplicate device operation files now unified with minimal_matmul
CMakeLists.txt	Updated build configuration to remove split-specific files and add new fused operation
test_dit_minimal_matmul_addcmul_fused.py	Comprehensive tests covering basic functionality, Wan2.2 shapes, and different scalar values

ttnn/cpp/ttnn/operations/experimental/minimal_matmul/device/kernels/compute.cpp

ttnn/cpp/ttnn/operations/experimental/minimal_matmul/device/kernels/matmul_dataflow_common.hpp

Copilot

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 5 comments.

ttnn/cpp/ttnn/operations/experimental/minimal_matmul/device/minimal_matmul_device_operation.cpp

...l/transformer/dit_minimal_matmul_addcmul_fused/dit_minimal_matmul_addcmul_fused_nanobind.cpp

ttnn/cpp/ttnn/operations/experimental/minimal_matmul/device/kernels/matmul_dataflow_common.hpp

nmauriceTT · 2026-02-09T12:27:13Z

/codeowners ping

tenstorrent-github-bot · 2026-02-09T12:28:01Z

CodeOwners Group Analysis

This PR requires approval from one member of each of the following groups:

Summary: 1 pending groups, 7 approved groups

Group Information:

⏳ tenstorrent/metalium-developers-ops-data-movement (Team) - Members: Naif Tarafdar, Saad Jameel, Juan Camilo Vega, Ligang Long, Nour Ardo, Adrian Morrison, Ilia Taraban, Sheran Cardoza | Pending approval
📁 Files owned by this team (2 files)
- ttnn/cpp/ttnn/operations/experimental/ccl/strided_all_gather_minimal_matmul_async/device/strided_all_gather_minimal_matmul_async_op.cpp
- ttnn/cpp/ttnn/operations/experimental/ccl/strided_all_gather_minimal_matmul_async/device/strided_all_gather_minimal_matmul_async_program.cpp

✅ tenstorrent/metalium-developers-infra (Team) - Members: Raymond Kim, Michael Chiou, Bryan Keith, Bryan Wilder Field Lozano, Andrew Fuller, William Ly, Kannika Kabilar, Anthony Kirby, Rose Li, Subin Lee, Evan Banerjee, NSexton, David Popov, Aditi Rajesh Shah, Jacek Jakub Lakis, Iris Wang, jessica yuan, Hasan Baig | Approved by: Evan Banerjee, NSexton, Colman Glagovich (shared)
📁 Files owned by this team (3 files)

✅ tenstorrent/metalium-developers-mmfusedreduce (Team) - Members: Borys Bradel, Vasisht Suresh, Edwin Lee, Nikhil Soraba, Ryan Miller, Izajasz Wrosz | Approved by: Borys Bradel
📁 Files owned by this team (1 files)
- tests/ttnn/docs_examples/test_matrix_multiplication_examples.py

✅ tenstorrent/metalium-developers-ttnn-core (Team) - Members: Pavlo Hilei, Brian Liu, Joseph Chu, Artem Yerofieiev, Diego Gomez | Approved by: Evan Banerjee, NSexton, Borys Bradel, NSexton (shared)
📁 Files owned by this team (8 files)

✅ tests/ttnn/ (Group) - Members: Borys Bradel, Dongjin Na, Jaehoon Jung | Approved by: Borys Bradel
📁 Files owned by this group (1 files)
- tests/ttnn/nightly/unit_tests/operations/experimental/test_dit_minimal_matmul_addcmul_fused.py

✅ ttnn//nanobind** (Group) - Members: NSexton | Approved by: NSexton
📁 Files owned by this group (3 files)

✅ ttnn/cpp/ttnn/operations/experimental/minimal_matmul//CMakeLists.txt** (Group) - Members: Colman Glagovich, Jonathan Su | Approved by: Colman Glagovich, Evan Banerjee, NSexton (shared)
📁 Files owned by this group (1 files)
- ttnn/cpp/ttnn/operations/experimental/minimal_matmul/CMakeLists.txt

✅ ttnn/cpp/ttnn/operations/experimental/minimal_matmul/ (Group) - Members: Colman Glagovich, Jonathan Su | Approved by: Colman Glagovich
📁 Files owned by this group (16 files)

Note: At least one approval from each group is sufficient.

tenstorrent-github-bot · 2026-02-09T12:28:09Z

Hi Borys Bradel (@bbradelTT), Colman Glagovich (@cglagovichTT), Edwin Lee (@edwinleeTT), Izajasz Wrosz (@iwroszTT), Jonathan Su (@jonathansuTT), NSexton (@nsextonTT), this PR Add fused minimal matmul addcmul operation by Nathan Maurice (@nmauriceTT) needs your approval/review to merge this.

github-actions

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

...l/transformer/dit_minimal_matmul_addcmul_fused/dit_minimal_matmul_addcmul_fused_nanobind.cpp

…ecking output

…me time

…ernary writing order

nmauriceTT self-assigned this Jan 26, 2026

nmauriceTT added the op_cat: fused label Jan 26, 2026

nmauriceTT linked an issue Jan 26, 2026 that may be closed by this pull request

Fused addcmul in epilog of minimal matmul #35915

Open

github-code-quality bot found potential problems Jan 26, 2026

View reviewed changes

tests/ttnn/nightly/unit_tests/operations/experimental/test_dit_minimal_matmul_addcmul_fused.py Fixed Show fixed Hide fixed

tests/ttnn/nightly/unit_tests/operations/experimental/test_dit_minimal_matmul_addcmul_fused.py Fixed Show fixed Hide fixed

nmauriceTT force-pushed the nmaurice/35915-fused-minimal-matmul-addcmul branch from b5a7dad to c21de53 Compare January 29, 2026 14:21

nmauriceTT requested a review from Copilot January 29, 2026 14:39

Copilot started reviewing on behalf of nmauriceTT January 29, 2026 14:39 View session

Copilot AI reviewed Jan 29, 2026

View reviewed changes

ttnn/cpp/ttnn/operations/experimental/minimal_matmul/device/kernels/compute.cpp Outdated Show resolved Hide resolved

ttnn/cpp/ttnn/operations/experimental/minimal_matmul/device/kernels/matmul_dataflow_common.hpp Outdated Show resolved Hide resolved

nmauriceTT force-pushed the nmaurice/35915-fused-minimal-matmul-addcmul branch from c21de53 to a9f6f42 Compare February 5, 2026 22:21

nmauriceTT requested a review from Copilot February 5, 2026 23:53

Copilot started reviewing on behalf of nmauriceTT February 5, 2026 23:53 View session

Copilot AI reviewed Feb 6, 2026

View reviewed changes

nmauriceTT force-pushed the nmaurice/35915-fused-minimal-matmul-addcmul branch 2 times, most recently from fcfd4fa to 737d754 Compare February 9, 2026 12:18

nmauriceTT marked this pull request as ready for review February 9, 2026 12:26

nmauriceTT requested review from a team, bbradelTT, cglagovichTT, dongjin-na, jonathansuTT, nsextonTT and razorback3 as code owners February 9, 2026 12:26

github-actions bot reviewed Feb 9, 2026

View reviewed changes

...l/transformer/dit_minimal_matmul_addcmul_fused/dit_minimal_matmul_addcmul_fused_nanobind.cpp Outdated Show resolved Hide resolved

nmauriceTT added 29 commits February 16, 2026 14:39

Update minimal_matmul with optional ternary fused arugments

f350f5b

Pass arguments to kernle

8756b2f

Pass fused ternary arguments to data movement kernels

8a03fcd

Addcmul logic in compute kernel (not tested)

0acc4cf

addcmul processing (no hang but wrong results)

8d476fe

Fused addcmul work (if not using bias)

e08b3d9

dit_minimal_matmul_addcmul_fused works

04b0734

Update dit_minimal_matmul_addcmul_fused tests

4d9f86c

Update dit_minimal_matmul_addcmul_fused test cases

cd07fd8

Move and udpate dit minimal matmul addcmul fused tests

c781271

Code Cleanup

28534f5

Address copilot code quality comments

dbb6da6

Optimization round #1

e64fbaf

Optimization round #2

6cdc349

Cleaning-up code

de5d024

Optimization and row broadcast

f2f9643

More code cleanup

86ea029

Add comments to read_ternary_blocks_sync

9244bdd

Remove redundant tests

f83abe2

Testing other version of addcmul with broadcast & fixing tests not ch…

cd7fa96

…ecking output

Debugging

4c2026c

unary_bcast bug workaround

fabf650

Code cleanup

75eef07

Address copilot comments + add doc example

a954a41

Swap ternary_a and ternary_b loading

a5dd273

Address github comments

1c0ed44

Broadcast ternary_b instead of ternary_a

c79c81f

Add TT_FATAL when using fused_activations and fused addcmul at the sa…

db27f35

…me time

Fix misleading pack_tile arguments, stricter PCC threshold and swap t…

dd53245

…ernary writing order

nmauriceTT force-pushed the nmaurice/35915-fused-minimal-matmul-addcmul branch from b6040c5 to dd53245 Compare February 16, 2026 15:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fused minimal matmul addcmul operation#36502

Add fused minimal matmul addcmul operation#36502
nmauriceTT wants to merge 31 commits intomainfrom
nmaurice/35915-fused-minimal-matmul-addcmul

nmauriceTT commented Jan 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nmauriceTT commented Feb 9, 2026

Uh oh!

tenstorrent-github-bot commented Feb 9, 2026 •

edited

Loading

Uh oh!

tenstorrent-github-bot commented Feb 9, 2026

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

nmauriceTT commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ticket

Problem description

What's changed

Performance

Checklist

Model tests

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nmauriceTT commented Feb 9, 2026

Uh oh!

tenstorrent-github-bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodeOwners Group Analysis

Group Information:

Uh oh!

tenstorrent-github-bot commented Feb 9, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

nmauriceTT commented Jan 26, 2026 •

edited

Loading

tenstorrent-github-bot commented Feb 9, 2026 •

edited

Loading