MoE: Selective reduce combine by amorrisonTT · Pull Request #37432 · tenstorrent/tt-metal

amorrisonTT · 2026-02-09T18:21:10Z

Ticket

Problem description

Part of the MoE inference pipeline. Takes dense expert contribution output from compute, sparsify it, and send back to originating devices.

What's changed

New optimized a2a combine op that takes dense input. Uses pre-computed metadata, inputs sharded in L1, and fabric mux over arbitrary cores and all link.

Average case perf ~108 us. But scales poorly (linearly with experts selected per row) for worse cases. Og (with one link at least) was ~3k us

Checklist

New/Existing tests provide coverage for changes

Model tests

Galaxy E2E https://github.com/tenstorrent/tt-metal/actions/runs/21963791722

tests/nightly/tg/ccl/moe/test_selective_combine_6U.py

github-actions

⚠️ Clang-Tidy found issue(s) with the introduced code (1/2)

...imental/ccl/moe/selective_reduce_combine/device/selective_reduce_combine_program_factory.cpp

...erations/experimental/ccl/moe/selective_reduce_combine/selective_reduce_combine_nanobind.cpp

...mental/ccl/moe/selective_reduce_combine/device/selective_reduce_combine_device_operation.cpp

...mental/ccl/moe/selective_reduce_combine/device/selective_reduce_combine_device_operation.hpp

...mental/ccl/moe/selective_reduce_combine/device/selective_reduce_combine_device_operation.cpp

...mental/ccl/moe/selective_reduce_combine/device/selective_reduce_combine_device_operation.hpp

...mental/ccl/moe/selective_reduce_combine/device/selective_reduce_combine_device_operation.cpp

...p/ttnn/operations/experimental/ccl/moe/selective_reduce_combine/selective_reduce_combine.cpp

...p/ttnn/operations/experimental/ccl/moe/selective_reduce_combine/selective_reduce_combine.hpp

github-actions

⚠️ Clang-Tidy found issue(s) with the introduced code (2/2)

...p/ttnn/operations/experimental/ccl/moe/selective_reduce_combine/selective_reduce_combine.cpp

...p/ttnn/operations/experimental/ccl/moe/selective_reduce_combine/selective_reduce_combine.hpp

...imental/ccl/moe/selective_reduce_combine/device/selective_reduce_combine_program_factory.cpp

amorrisonTT · 2026-02-09T18:32:19Z

/codeowners ping

Copilot

Pull request overview

Adds a new experimental TTNN CCL op to support the MoE inference pipeline step that sparsifies dense expert contributions and returns tokens to their originating devices, using a fabric mux-based combine path.

Changes:

Introduces ttnn.experimental.selective_reduce_combine (C++ op + device op + dataflow reader/writer kernels) and binds it via nanobind.
Extends shared CCL kernel utilities for mux teardown and adds a bidirectional multicast atomic-inc helper for 1D ring.
Adds Galaxy MoE nightly tests and adjusts Galaxy e2e pipeline configuration to run MoE tests separately.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 16 comments.

Show a summary per file

File	Description
ttnn/cpp/ttnn/operations/experimental/experimental_nanobind.cpp	Minor cleanup in experimental nanobind module registration.
ttnn/cpp/ttnn/operations/experimental/ccl/moe/selective_reduce_combine/selective_reduce_combine_nanobind.hpp	Declares nanobind binding entrypoint for the new op.
ttnn/cpp/ttnn/operations/experimental/ccl/moe/selective_reduce_combine/selective_reduce_combine_nanobind.cpp	Adds Python binding + docstring for `selective_reduce_combine`.
ttnn/cpp/ttnn/operations/experimental/ccl/moe/selective_reduce_combine/selective_reduce_combine.hpp	Declares/registers the TTNN operation.
ttnn/cpp/ttnn/operations/experimental/ccl/moe/selective_reduce_combine/selective_reduce_combine.cpp	Implements host-side invoke forwarding into the prim/device op.
ttnn/cpp/ttnn/operations/experimental/ccl/moe/selective_reduce_combine/device/selective_reduce_combine_device_operation.hpp	Defines the device operation interface and prim API.
ttnn/cpp/ttnn/operations/experimental/ccl/moe/selective_reduce_combine/device/selective_reduce_combine_device_operation.cpp	Implements validation + output spec/tensor creation + prim launch.
ttnn/cpp/ttnn/operations/experimental/ccl/moe/selective_reduce_combine/device/selective_reduce_combine_program_factory.cpp	Program factory building CBs, mux workers, and setting runtime args.
ttnn/cpp/ttnn/operations/experimental/ccl/moe/selective_reduce_combine/device/kernels/dataflow/reader.cpp	Kernel to read token counts/maps and compute per-core work splits.
ttnn/cpp/ttnn/operations/experimental/ccl/moe/selective_reduce_combine/device/kernels/dataflow/writer.cpp	Kernel to send token segments locally or over fabric mux and teardown.
ttnn/cpp/ttnn/operations/experimental/ccl/ccl_experimental_nanobind.cpp	Registers the new op under the experimental CCL nanobind module.
ttnn/cpp/ttnn/operations/experimental/ccl/CMakeLists.txt	Adds new sources/kernels to the experimental CCL build target.
ttnn/CMakeLists.txt	Adds new nanobind source to TTNN build.
ttnn/cpp/ttnn/operations/ccl/common/kernels/moe_utils.hpp	Extends mux helper arg parsing/teardown; adds bidirectional atomic-inc helper.
ttnn/cpp/ttnn/operations/ccl/common/kernels/minimal_ccl_common.hpp	Extends `perform_payload_send` template to optionally skip flush.
tt_metal/hw/inc/api/debug/dprint_pages.h	Adds `print_u32_pages` helper for debugging.
tt_metal/fabric/hw/inc/linear/addrgen_api.h	Adds a `ShardedAddrGen` using-declaration and updates a helper signature.
tests/pipeline_reorg/galaxy_e2e_tests.yaml	Splits Galaxy CCL vs MoE tests; adds MoE-specific environment and timeout.
tests/nightly/tg/ccl/moe/test_selective_combine_6U.py	Adds correctness + perf/trace tests for selective reduce combine on Galaxy.
tests/nightly/t3000/ccl/test_all_to_all_combine.py	Exposes cluster-dimension helpers reused by the new MoE test.

...imental/ccl/moe/selective_reduce_combine/device/selective_reduce_combine_program_factory.cpp

...erations/experimental/ccl/moe/selective_reduce_combine/selective_reduce_combine_nanobind.cpp

tests/nightly/tg/ccl/moe/test_selective_combine_6U.py

Copilot · 2026-02-09T18:32:18Z

...erations/experimental/ccl/moe/selective_reduce_combine/selective_reduce_combine_nanobind.cpp

+
+
+                Generally expect num_data_parallel_dim=num_token_parallel_dim=4
+                This can be though of as a logical grid by a physical grid is not required as long as the ordering is


Docstring typo: “though of” should be “thought of”.

Suggested change

This can be though of as a logical grid by a physical grid is not required as long as the ordering is

This can be thought of as a logical grid by a physical grid is not required as long as the ordering is

tests/nightly/tg/ccl/moe/test_selective_combine_6U.py

Copilot · 2026-02-09T18:32:21Z

tests/nightly/tg/ccl/moe/test_selective_combine_6U.py

+
+    logger.info(f"Capturing Warmup iterations")
+    trace_id_warmup = ttnn.begin_trace_capture(mesh_device, cq_id=0)
+    tt_out = op_func(max(1, num_iters // 4))


This assignment to 'tt_out' is unnecessary as it is redefined before this value is used.

Suggested change

tt_out = op_func(max(1, num_iters // 4))

op_func(max(1, num_iters // 4))

tests/nightly/tg/ccl/moe/test_selective_combine_6U.py

tenstorrent-github-bot · 2026-02-09T18:33:08Z

CodeOwners Group Analysis

This PR requires approval from one member of each of the following groups:

Summary: 6 pending groups, 0 approved groups

Group Information:

⏳ tenstorrent/metalium-developers-infra (Team) - Members: Raymond Kim, Michael Chiou, Bryan Keith, Bryan Wilder Field Lozano, Andrew Fuller, William Ly, Kannika Kabilar, Anthony Kirby, Rose Li, Subin Lee, Evan Banerjee, NSexton, David Popov, Aditi Rajesh Shah, Jacek Jakub Lakis, Iris Wang, jessica yuan, Hasan Baig | Pending approval
📁 Files owned by this team (3 files)

⏳ tenstorrent/metalium-developers-ops-data-movement (Team) - Members: Naif Tarafdar, Saad Jameel, Juan Camilo Vega, Ligang Long, Nour Ardo, Adrian Morrison, Ilia Taraban, Sheran Cardoza | Pending approval
📁 Files owned by this team (15 files)

⏳ tenstorrent/metalium-developers-ttnn-core (Team) - Members: Pavlo Hilei, Brian Liu, Joseph Chu, Artem Yerofieiev, Diego Gomez | Pending approval
📁 Files owned by this team (2 files)
- ttnn/CMakeLists.txt
- ttnn/cpp/ttnn/operations/experimental/experimental_nanobind.cpp

⏳ tt_metal/fabric/ (Group) - Members: Abhishek Agarwal, Allan Liu, Austin Ho, Ridvan Song, Sean Nijjar, Umair Bilal Cheema, Yu Gao | Pending approval
📁 Files owned by this group (1 files)
- tt_metal/fabric/hw/inc/linear/addrgen_api.h

⏳ tt_metal/hw/inc/ (Group) - Members: Almeet Bhullar, Arik Yaacob, Ata Tuzuner, John Bauman, Kevin Stevens, Nathan Sidwell, Rui Zhang, Vuk Vukomanovic | Pending approval
📁 Files owned by this group (1 files)
- tt_metal/hw/inc/api/debug/dprint_pages.h

⏳ ttnn//nanobind** (Group) - Members: NSexton | Pending approval
📁 Files owned by this group (1 files)
- ttnn/cpp/ttnn/operations/experimental/experimental_nanobind.cpp

Note: At least one approval from each group is sufficient.

ayerofieiev-tt · 2026-02-12T22:25:05Z

tt_metal/fabric/hw/inc/linear/addrgen_api.h


 template <typename ShardingInfoType>
-uint32_t get_page_size(const experimental::ShardedAddrGen<ShardingInfoType>& d) {
+uint32_t get_page_size(const _ttnn_operations_experimental_ShardedAddrGen<ShardingInfoType>& d) {


I see a new using added, but its a weird using.
Whats going on here? :)

ayerofieiev-tt · 2026-02-12T22:25:38Z

tt_metal/hw/inc/api/debug/dprint_pages.h

    }
 }

+inline void print_u32_pages(uint32_t l1_addr, uint32_t elts_per_page, uint32_t npages, uint32_t start = 0) {


@jbaumanTT , @akerteszTT can you look at this addition to hw/inc/api please?

ayerofieiev-tt · 2026-02-12T22:25:59Z

ttnn/cpp/ttnn/operations/ccl/common/kernels/moe_utils.hpp

+            // DPRINT << "OPENING MUX CORE: " << (uint32_t)args.fabric_mux_x << ", " << (uint32_t)args.fabric_mux_y
+            //                    << "\n";


ayerofieiev-tt · 2026-02-12T22:26:13Z

ttnn/cpp/ttnn/operations/ccl/common/kernels/moe_utils.hpp

+            // DPRINT << "CLOSING MUX CORE: " << (uint32_t)args.fabric_mux_x << ", " << (uint32_t)args.fabric_mux_y
+            //                    << "\n";


ayerofieiev-tt · 2026-02-12T22:27:32Z

ttnn/cpp/ttnn/operations/ccl/common/kernels/moe_utils.hpp

+    class SenderType = WorkerToFabricEdmSender>
+FORCE_INLINE void fabric_multicast_bidirectional_atomic_inc_ring_1d(
+    std::array<SenderType, 4>& fabric_connections,
+    volatile PACKET_HEADER_TYPE* packet_header_pos,


I am a little out of the loop, what is PACKET_HEADER_TYPE? why is it all caps? is it a macro?

ayerofieiev-tt · 2026-02-12T22:28:28Z

...p/ttnn/operations/experimental/ccl/moe/selective_reduce_combine/selective_reduce_combine.hpp

+namespace ttnn {
+namespace operations::experimental::ccl::moe {
+
+struct ExecuteSelectiveReduceCombine {


please remove the struct with invoke and register_operation
see what we did here
#36303

ayerofieiev-tt · 2026-02-12T22:30:02Z

...mental/ccl/moe/selective_reduce_combine/device/selective_reduce_combine_device_operation.hpp

+        const uint32_t hidden_size;
+        const uint32_t batch_size;
+        const uint32_t seq_size;
+        const uint32_t select_experts_k;
+        const uint32_t experts;
+        const uint32_t num_links;
+
+        const std::optional<uint32_t> axis;
+        tt::tt_fabric::Topology topology;
+
+        const uint32_t num_token_parallel_cores;
+        const uint32_t num_data_parallel_cores;
+        const CoreRangeSet worker_core_range_set;
+        const CoreRangeSet mux_core_range_set;
+        const ttnn::MemoryConfig output_memory_config;
+        const std::optional<GlobalSemaphore> optional_cross_device_semaphore;


these params should not really be const.
if we want things const the whole struct is passed like const

ayerofieiev-tt · 2026-02-12T22:30:13Z

...mental/ccl/moe/selective_reduce_combine/device/selective_reduce_combine_device_operation.hpp

+    struct tensor_args_t {
+        const ttnn::Tensor dense_input_tensor;
+        const ttnn::Tensor dense_metadata_tensor;
+        const ttnn::Tensor dense_token_maps_tensor;
+        const ttnn::Tensor dense_token_counts_tensor;
+        const std::optional<ttnn::Tensor> optional_output_tensor;
+    };


same note for const-ness

ayerofieiev-tt · 2026-02-12T22:30:41Z

...mental/ccl/moe/selective_reduce_combine/device/selective_reduce_combine_device_operation.hpp

+    struct UnifiedSelectReduce {
+        // Shared variables are the variables that are shared between the create and override_runtime_arguments methods
+        struct shared_variables_t {
+            tt::tt_metal::KernelHandle reader_kernel_id;
+            tt::tt_metal::KernelHandle writer_kernel_id;
+            std::vector<CoreCoord> cores;
+            const GlobalSemaphore init_semaphore;
+            const GlobalSemaphore cross_device_semaphore;
+        };
+        using cached_mesh_workload_t = ttnn::device_operation::AdaptedCachedMeshWorkload<shared_variables_t>;
+
+        static cached_mesh_workload_t create_mesh_workload(
+            const operation_attributes_t& operation_attributes,
+            const ttnn::MeshCoordinateRangeSet& tensor_coords,
+            const tensor_args_t& tensor_args,
+            tensor_return_value_t& tensor_return_value);
+
+        static ttnn::device_operation::CachedProgram<shared_variables_t> create_at(
+            const operation_attributes_t& operation_attributes,
+            const ttnn::MeshCoordinate& mesh_coordinate,
+            const std::vector<ttnn::MeshCoordinate>& all_mesh_coordinates,
+            const tensor_args_t& tensor_args,
+            tensor_return_value_t& tensor_return_value,
+            const GlobalSemaphore& init_semaphore,
+            const GlobalSemaphore& cross_device_semaphore);
+
+        static void override_runtime_arguments(
+            cached_mesh_workload_t& cached_workload,
+            const operation_attributes_t& operation_attributes,
+            const tensor_args_t& tensor_args,
+            tensor_return_value_t& tensor_return_value);
+    };


please move factory to its own .hpp/.cpp

ayerofieiev-tt · 2026-02-12T22:30:59Z

...mental/ccl/moe/selective_reduce_combine/device/selective_reduce_combine_device_operation.hpp

+    // Mandatory methods
+
+    // Select the program factory based on the operation attributes and tensor args
+    static program_factory_t select_program_factory(const operation_attributes_t&, const tensor_args_t&);


there is a PR from @dgomezTT that removes the need for this when there is a single program

ayerofieiev-tt · 2026-02-12T22:31:12Z

...mental/ccl/moe/selective_reduce_combine/device/selective_reduce_combine_device_operation.hpp

+    static void validate_on_program_cache_miss(const operation_attributes_t&, const tensor_args_t&);
+
+    // Empty as there doesn't seem to be any complicated hashing requirement
+    static void validate_on_program_cache_hit(const operation_attributes_t&, const tensor_args_t&);


there is a PR from @dgomezTT that removes the need for this when it simply calls cache_miss inside

ayerofieiev-tt · 2026-02-12T22:51:15Z

...p/ttnn/operations/experimental/ccl/moe/selective_reduce_combine/selective_reduce_combine.cpp

+    const std::optional<ttnn::MemoryConfig>& memory_config,
+    const std::optional<ttnn::Tensor>& optional_output_tensor,
+    const std::optional<GlobalSemaphore>& optional_cross_device_semaphore) {
+    auto input_memory_config = memory_config.value_or(ttnn::DRAM_MEMORY_CONFIG);


that should ideally happen in the prim function i think

then you can simply do using of that prim function in ttnn namespace and thats it

amorrisonTT marked this pull request as ready for review February 9, 2026 18:21

Copilot AI review requested due to automatic review settings February 9, 2026 18:21

amorrisonTT requested review from a team, Riddy21, SeanNijjar, aagarwalTT, abhullar-tt, aliuTT, arikTT, atatuzunerTT, jbaumanTT, kstevensTT, nathan-TT, nsextonTT, ruizhangTT, tt-aho, ubcheema, vvukomanovicTT and yugaoTT as code owners February 9, 2026 18:21

Copilot started reviewing on behalf of amorrisonTT February 9, 2026 18:22 View session

amorrisonTT mentioned this pull request Feb 9, 2026

Deepseek MoE Merge Train - first optimizations #37288

Open

8 tasks

github-code-quality bot found potential problems Feb 9, 2026

View reviewed changes

github-actions bot reviewed Feb 9, 2026

View reviewed changes

Copilot AI reviewed Feb 9, 2026

View reviewed changes

amorrisonTT added 13 commits February 12, 2026 18:49

better token parallelism

2e526cf

bug fix

27cfb37

skip init semaphore

a6ede1f

multi-cast final sync semaphore

9940452

docs, move to experimental namespace, test fixes, clean up

24dd817

pre commit fixes and AI suggestions

d65e7ee

AI PR comments

122cc4c

evenly split tokens over packets (save needed L1 space)

918405d

rebase fixes

6ee255c

Copilot and lint things

b17c070

fix pytest command in pipeline yaml

6ecc4f5

more AI reviews

e219280

fix bad AI recommendation

b80fa57

amorrisonTT force-pushed the amorrison/moe-selective-reduce-combine-mux-rebase branch from fa4e393 to b80fa57 Compare February 12, 2026 18:55

fix yaml lint, AI review response

6d63007

aliuTT approved these changes Feb 12, 2026

View reviewed changes

shift test timeouts

8e51d9b

roseli-TT approved these changes Feb 12, 2026

View reviewed changes

ayerofieiev-tt reviewed Feb 12, 2026

View reviewed changes



		Generally expect num_data_parallel_dim=num_token_parallel_dim=4
		This can be though of as a logical grid by a physical grid is not required as long as the ordering is

	This can be though of as a logical grid by a physical grid is not required as long as the ordering is
	This can be thought of as a logical grid by a physical grid is not required as long as the ordering is

	tt_out = op_func(max(1, num_iters // 4))
	op_func(max(1, num_iters // 4))

		// DPRINT << "OPENING MUX CORE: " << (uint32_t)args.fabric_mux_x << ", " << (uint32_t)args.fabric_mux_y
		// << "\n";

		// DPRINT << "CLOSING MUX CORE: " << (uint32_t)args.fabric_mux_x << ", " << (uint32_t)args.fabric_mux_y
		// << "\n";

Conversation

amorrisonTT commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ticket

Problem description

What's changed

Checklist

Model tests

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amorrisonTT commented Feb 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tenstorrent-github-bot commented Feb 9, 2026

CodeOwners Group Analysis

Group Information:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

amorrisonTT commented Feb 9, 2026 •

edited

Loading