Skip to content

Add fused minimal matmul addcmul operation#36502

Open
nmauriceTT wants to merge 31 commits intomainfrom
nmaurice/35915-fused-minimal-matmul-addcmul
Open

Add fused minimal matmul addcmul operation#36502
nmauriceTT wants to merge 31 commits intomainfrom
nmaurice/35915-fused-minimal-matmul-addcmul

Conversation

@nmauriceTT
Copy link
Contributor

@nmauriceTT nmauriceTT commented Jan 26, 2026

Ticket

#35915

Problem description

For Wan2.2, we want to fuse minimal_matmul and addcmul, like the following pattern:

spatial_ff_1BND = self.ff(spatial_normed_1BND, compute_kernel_config=self.ff_compute_kernel_config)
spatial_1BND = ttnn.addcmul(spatial_1BND, spatial_ff_1BND, c_gate_msa_1B1D)

What's changed

This adds a new dit_minimal_matmul_addcmul_fused operation that perform minimal_matmul + addcmul operation.
This operation is equivalent to:

matmul_output = minimal_matmul(tensor_input0, tensor_input1, tensor_bias)
output = addcmul(ternary_a, minimal_matmul, ternary_b, scalar) # ternary_a + scalar * minimal_matmul * ternary_b

To make minimal_matmul more extensible, I've also merged its device operation with that of minimal_matmul_split (i.e. reduce code duplication).

The kernels of dit_minimal_matmul_addcmul_fused have been implemented by modifying the minimal_matmul kernels. It is also defined by calling minimal_matmul (device operation of minimal_matmul has been updated with new parameters).

Note: Ideally, we'd like to use the addcmul_tile LLK. But it seems that unary_bcast<BroadcastType::ROW> does not work with fp32_acc_to_dst. Instead, row-broadcast of ternary_ais done throughadd_bcast_tile` (FPU). The downside is that the output should be less accurate than with addcmul.
If accuracy turns out to be a problem, then we can switch to other workaround (e.g. do broadcasting in dataflow kernels).

Performance

wan2.2_14b-720p-glx: Single GLX
(M, K, N) = (9472, 1280, 5120)

Name Duration [ms] Std Gains Speed-up
Separate 2.377 ms 0.016 N/A N/A
Fused 2.080 0.005 -0.30 8% faster

wan2.2_14b-720p Quad GLX
(M, K, N) = (2368, 1280, 5120)

Name Duration [ms] Std Gains [ms] Speed-up
Separate 0.835 0.006 N/A N/A
Fused 0.776 0.003 -0.059 7% faster

As a reference, here's the execution time of addcmul.

Shape Name Duration [ms]
Single GLX addcmul 0.463
Quad GLX addcmul 0.124

Fusing the operations saves us ~50-65% of the execution time addcmul.

Checklist

Model tests

If your changes cover model-related code, you should run tests corresponding to affected models and platforms (Single card, T3K, Galaxy). "Choose your pipeline" workflows facilitate running multiple kinds of tests in a single run. Each offers models-mandatory and models-extended presets.
The former includes a minimal set of tests, to be run always. The latter extends that with additional ones - use your best judgement in deciding which is the most appropriate for your PR.

@nmauriceTT nmauriceTT self-assigned this Jan 26, 2026
@nmauriceTT nmauriceTT linked an issue Jan 26, 2026 that may be closed by this pull request
@nmauriceTT nmauriceTT force-pushed the nmaurice/35915-fused-minimal-matmul-addcmul branch from b5a7dad to c21de53 Compare January 29, 2026 14:21
@nmauriceTT nmauriceTT requested a review from Copilot January 29, 2026 14:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a fused dit_minimal_matmul_addcmul_fused operation that combines minimal_matmul and addcmul for improved performance in DiT transformer blocks (specifically targeting Wan2.2). The PR also refactors the minimal_matmul device operations by unifying the previously separate minimal_matmul and minimal_matmul_split implementations into a single device operation that supports both single and chunked outputs.

Changes:

  • Added new fused operation dit_minimal_matmul_addcmul_fused that computes output = residual + scalar * matmul(input, weight) * gate
  • Unified minimal_matmul and minimal_matmul_split device operations, changing return types from Tensor to std::vector<Tensor>
  • Extended kernels to support fused ternary (addcmul) operations with new circular buffers and runtime parameters

Reviewed changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
dit_minimal_matmul_addcmul_fused/* New operation implementation with nanobind bindings and comprehensive documentation
minimal_matmul_device_operation.* Unified device operation supporting both single and split outputs, added ternary fusion parameters
minimal_matmul_program_factory.* Added circular buffers for ternary inputs, extended runtime argument handling
minimal_matmul.cpp, minimal_matmul_split.cpp Updated to use unified device operation returning vector of tensors
kernels/compute.cpp Added add_bias_and_addcmul_block function implementing fused bias and addcmul logic
kernels/dm_in*.cpp Extended dataflow kernels to read and process ternary input tensors
kernels/matmul_dataflow_common.hpp Added read_ternary_blocks_sync helper for reading ternary tensors
minimal_matmul_split_* (deleted) Removed duplicate device operation files now unified with minimal_matmul
CMakeLists.txt Updated build configuration to remove split-specific files and add new fused operation
test_dit_minimal_matmul_addcmul_fused.py Comprehensive tests covering basic functionality, Wan2.2 shapes, and different scalar values

@nmauriceTT nmauriceTT force-pushed the nmaurice/35915-fused-minimal-matmul-addcmul branch from c21de53 to a9f6f42 Compare February 5, 2026 22:21
@nmauriceTT nmauriceTT requested a review from Copilot February 5, 2026 23:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 5 comments.

@nmauriceTT nmauriceTT force-pushed the nmaurice/35915-fused-minimal-matmul-addcmul branch 2 times, most recently from fcfd4fa to 737d754 Compare February 9, 2026 12:18
@nmauriceTT nmauriceTT marked this pull request as ready for review February 9, 2026 12:26
@nmauriceTT
Copy link
Contributor Author

/codeowners ping

@tenstorrent-github-bot
Copy link

tenstorrent-github-bot commented Feb 9, 2026

CodeOwners Group Analysis

This PR requires approval from one member of each of the following groups:

Summary: 1 pending groups, 7 approved groups

Group Information:








Note: At least one approval from each group is sufficient.

@tenstorrent-github-bot
Copy link

Hi Borys Bradel (@bbradelTT), Colman Glagovich (@cglagovichTT), Edwin Lee (@edwinleeTT), Izajasz Wrosz (@iwroszTT), Jonathan Su (@jonathansuTT), NSexton (@nsextonTT), this PR Add fused minimal matmul addcmul operation by Nathan Maurice (@nmauriceTT) needs your approval/review to merge this.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

@nmauriceTT nmauriceTT force-pushed the nmaurice/35915-fused-minimal-matmul-addcmul branch from b6040c5 to dd53245 Compare February 16, 2026 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fused addcmul in epilog of minimal matmul

8 participants