Make ops work when some ranks are inactive by p1-0tr · Pull Request #36091 · tenstorrent/tt-metal

p1-0tr · 2026-01-19T15:29:17Z

Ticket

Problem description

Currently when a MeshDevice does not span all ranks in a distributed
set-up running ops (or creating buffers) for it will fail. Instead, in
such cases, the "inactive" ranks should execute nop's and let the
"active" ranks carry out their work.

What's changed

This patch is the first step towards supporting ops on meshes and/or
sub-meshes which do not span all ranks in a distributed system. It aims
to unlock running basic ops, by short-circuiting their launch at the
ttnn level for "inactive" MeshDevice instances. A large portion of the
MeshDevice interface will remain broken for "inactive" meshes.
Eventually, the MeshDevice class should be reworked in a way that will
make it safe to use on "inactive" ranks, and allow ttnn not to care
about whether a MeshDevice is "active" or "inactive".

Checklist

New/Existing tests provide coverage for changes

Model tests

If your changes cover model-related code, you should run tests corresponding to affected models and platforms (Single card, T3K, Galaxy). "Choose your pipeline" workflows facilitate running multiple kinds of tests in a single run. Each offers models-mandatory and models-extended presets.
The former includes a minimal set of tests, to be run always. The latter extends that with additional ones - use your best judgement in deciding which is the most appropriate for your PR.

ttnn/api/ttnn/device_operation.hpp

Copilot

Pull request overview

This pull request implements support for running operations on MeshDevices that don't span all ranks in a distributed setup. When a MeshDevice exists on only a subset of ranks, "inactive" ranks (those without local devices) now execute no-ops instead of failing.

Changes:

Splits distributed context into active and inactive groups using MPI split() with different color values
Adds DummyMeshCommandQueue class for inactive ranks that no-ops all operations
Adds early-return checks in operation launch and buffer creation paths for inactive devices
Passes active_distributed_context_ (instead of full distributed_context_) to mesh command queues for proper barrier synchronization among active ranks only

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
ttnn/api/ttnn/device_operation.hpp	Adds short-circuit early return when launching ops on inactive MeshDevices
tt_metal/distributed/mesh_device_impl.hpp	Declares new `active_distributed_context_` field for active-rank synchronization
tt_metal/distributed/mesh_device.cpp	Implements distributed context splitting and DummyMeshCommandQueue instantiation for inactive ranks
tt_metal/distributed/mesh_buffer.cpp	Adds early return in buffer creation for inactive MeshDevices
tt_metal/distributed/sd_mesh_command_queue.*	Updates to accept and use active distributed context for barriers
tt_metal/distributed/fd_mesh_command_queue.*	Updates to accept and use active distributed context for barriers
tt_metal/distributed/dummy_mesh_command_queue.*	New no-op command queue implementation for inactive ranks
tt_metal/distributed/distributed.cpp	Adds short-circuit for workload enqueue on inactive devices
tt_metal/distributed/dispatch_context.cpp	Passes active context when creating command queues
tt_metal/distributed/CMakeLists.txt	Adds new dummy_mesh_command_queue.cpp to build

github-actions

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

tt_metal/distributed/mesh_device.cpp

tt_metal/distributed/sd_mesh_command_queue.cpp

tt_metal/distributed/fd_mesh_command_queue.cpp

p1-0tr · 2026-02-06T12:42:06Z

/codeowners ping

tenstorrent-github-bot · 2026-02-06T12:43:49Z

CodeOwners Group Analysis

This PR requires approval from one member of each of the following groups:

Summary: 5 pending groups, 0 approved groups

Group Information:

⏳ tenstorrent/metalium-developers-infra (Team) - Members: Raymond Kim, Michael Chiou, Bryan Keith, Bryan Wilder Field Lozano, Andrew Fuller, William Ly, Kannika Kabilar, Anthony Kirby, Rose Li, Subin Lee, Evan Banerjee, NSexton, David Popov, Aditi Rajesh Shah, Jacek Jakub Lakis, Iris Wang, jessica yuan, Hasan Baig | Pending approval
📁 Files owned by this team (2 files)
- tests/scripts/t3000/run_t3000_unit_tests.sh
- tt_metal/distributed/CMakeLists.txt

⏳ tenstorrent/metalium-developers-metal-distributed (Team) - Members: Austin Ho, Brian Liu, Joseph Chu, Aditya Saigal, Allan Liu | Pending approval
📁 Files owned by this team (1 files)
- tests/tt_metal/distributed/multiprocess/test_sanity.cpp

⏳ tenstorrent/metalium-developers-ttnn-core (Team) - Members: Pavlo Hilei, Brian Liu, Joseph Chu, Artem Yerofieiev, Diego Gomez | Pending approval
📁 Files owned by this team (2 files)
- tests/ttnn/distributed/test_submesh_not_spanning_all_ranks_T3000.py
- ttnn/api/ttnn/device_operation.hpp

⏳ tt_metal/distributed//CMakeLists.txt** (Group) - Members: Aditya Saigal, Allan Liu, John Bauman, Joseph Chu | Pending approval
📁 Files owned by this group (1 files)
- tt_metal/distributed/CMakeLists.txt

⏳ tt_metal/distributed/ (Group) - Members: Aditya Saigal, Allan Liu, John Bauman, Joseph Chu, Nigel Huang | Pending approval
📁 Files owned by this group (11 files)

Note: At least one approval from each group is sufficient.

tenstorrent-github-bot · 2026-02-06T12:43:59Z

Hi Allan Liu (@aliuTT), Joseph Chu (@cfjchu), Nigel Huang (@nhuang-tt), Aditya Saigal (@tt-asaigal), this PR Make ops work when some ranks are inactive by Piotr Stankiewicz (@p1-0tr) needs your approval/review to merge this.

p1-0tr · 2026-02-09T14:30:29Z

/codeowners ping

tenstorrent-github-bot · 2026-02-09T14:31:25Z

🔄 CodeOwners Summary Updated

✅ CodeOwners summary updated here

💡 Tip: Use /codeowners new to post a fresh summary comment instead of updating the existing one.

tenstorrent-github-bot · 2026-02-09T14:31:35Z

Hi Joseph Chu (@cfjchu), John Bauman (@jbaumanTT), Nigel Huang (@nhuang-tt), Aditya Saigal (@tt-asaigal), this PR Make ops work when some ranks are inactive by Piotr Stankiewicz (@p1-0tr) needs your approval/review to merge this.

p1-0tr · 2026-02-10T15:43:30Z

/codeowners ping

tenstorrent-github-bot · 2026-02-10T15:44:34Z

🔄 CodeOwners Summary Updated

✅ CodeOwners summary updated here

💡 Tip: Use /codeowners new to post a fresh summary comment instead of updating the existing one.

tenstorrent-github-bot · 2026-02-10T15:44:45Z

Hi Allan Liu (@aliuTT), Joseph Chu (@cfjchu), John Bauman (@jbaumanTT), Nigel Huang (@nhuang-tt), this PR Make ops work when some ranks are inactive by Piotr Stankiewicz (@p1-0tr) needs your approval/review to merge this.

p1-0tr · 2026-02-11T15:16:07Z

/codeowners ping

tenstorrent-github-bot · 2026-02-11T15:17:02Z

🔄 CodeOwners Summary Updated

✅ CodeOwners summary updated here

💡 Tip: Use /codeowners new to post a fresh summary comment instead of updating the existing one.

tenstorrent-github-bot · 2026-02-11T15:17:12Z

Hi Allan Liu (@aliuTT), John Bauman (@jbaumanTT), Aditya Saigal (@tt-asaigal), this PR Make ops work when some ranks are inactive by Piotr Stankiewicz (@p1-0tr) needs your approval/review to merge this.

Currently when a MeshDevice does not span all ranks in a distributed set-up running ops (or creating buffers) for it will fail. Instead, in such cases, the "inactive" ranks should execute nop's and let the "active" ranks carry out their work. This patch is the first step towards supporting ops on meshes and/or sub-meshes which do not span all ranks in a distributed system. It aims to unlock running basic ops, by short-circuiting their launch at the ttnn level for "inactive" MeshDevice instances. A large portion of the MeshDevice interface will remain broken for "inactive" meshes. Eventually, the MeshDevice class should be reworked in a way that will make it safe to use on "inactive" ranks, and allow ttnn not to care about whether a MeshDevice is "active" or "inactive". Signed-off-by: Piotr Stankiewicz <pstankiewicz@tenstorrent.com>

p1-0tr · 2026-02-16T14:09:55Z

/codeowners ping

tenstorrent-github-bot · 2026-02-16T14:10:44Z

🔄 CodeOwners Summary Updated

✅ CodeOwners summary updated here

💡 Tip: Use /codeowners new to post a fresh summary comment instead of updating the existing one.

p1-0tr · 2026-02-16T14:10:52Z

all post-commit [PASS] - https://github.com/tenstorrent/tt-metal/actions/runs/22054843705
t3k UT [FAIL - consistent with main] - https://github.com/tenstorrent/tt-metal/actions/runs/22054831227

tenstorrent-github-bot · 2026-02-16T14:10:54Z

Hi Brian Liu (@TT-BrianLiu), Allan Liu (@aliuTT), John Bauman (@jbaumanTT), Austin Ho (@tt-aho), Aditya Saigal (@tt-asaigal), this PR Make ops work when some ranks are inactive by Piotr Stankiewicz (@p1-0tr) needs your approval/review to merge this.

p1-0tr force-pushed the p1-0tr/33901-run-op-with-inactive-ranks branch from 414f284 to 648e3b3 Compare January 26, 2026 11:16

p1-0tr commented Jan 26, 2026

View reviewed changes

ttnn/api/ttnn/device_operation.hpp Show resolved Hide resolved

p1-0tr force-pushed the p1-0tr/33901-run-op-with-inactive-ranks branch from 648e3b3 to 043dc77 Compare February 6, 2026 11:31

p1-0tr marked this pull request as ready for review February 6, 2026 11:32

p1-0tr requested a review from a team as a code owner February 6, 2026 11:32

Copilot AI review requested due to automatic review settings February 6, 2026 11:32

p1-0tr requested review from a team, aliuTT, jbaumanTT, nhuang-tt and tt-asaigal as code owners February 6, 2026 11:32

Copilot started reviewing on behalf of p1-0tr February 6, 2026 11:32 View session

Copilot AI reviewed Feb 6, 2026

View reviewed changes

github-actions bot reviewed Feb 6, 2026

View reviewed changes

p1-0tr force-pushed the p1-0tr/33901-run-op-with-inactive-ranks branch from 043dc77 to a3eb6d7 Compare February 6, 2026 11:59

p1-0tr force-pushed the p1-0tr/33901-run-op-with-inactive-ranks branch from a3eb6d7 to 432d845 Compare February 9, 2026 12:00

p1-0tr mentioned this pull request Feb 10, 2026

🚧 [v2] Make ops work when some ranks are inactive #36645

Closed

16 tasks

p1-0tr force-pushed the p1-0tr/33901-run-op-with-inactive-ranks branch from 432d845 to 8ae31ff Compare February 13, 2026 15:05

p1-0tr requested a review from a team as a code owner February 13, 2026 15:05

p1-0tr force-pushed the p1-0tr/33901-run-op-with-inactive-ranks branch from 8ae31ff to c292ce7 Compare February 16, 2026 08:06

Conversation

p1-0tr commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ticket

Problem description

What's changed

Checklist

Model tests

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

p1-0tr commented Feb 6, 2026

Uh oh!

tenstorrent-github-bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodeOwners Group Analysis

Group Information:

Uh oh!

tenstorrent-github-bot commented Feb 6, 2026

Uh oh!

p1-0tr commented Feb 9, 2026

Uh oh!

tenstorrent-github-bot commented Feb 9, 2026

🔄 CodeOwners Summary Updated

Uh oh!

tenstorrent-github-bot commented Feb 9, 2026

Uh oh!

p1-0tr commented Feb 10, 2026

Uh oh!

tenstorrent-github-bot commented Feb 10, 2026

🔄 CodeOwners Summary Updated

Uh oh!

tenstorrent-github-bot commented Feb 10, 2026

Uh oh!

p1-0tr commented Feb 11, 2026

Uh oh!

tenstorrent-github-bot commented Feb 11, 2026

🔄 CodeOwners Summary Updated

Uh oh!

tenstorrent-github-bot commented Feb 11, 2026

Uh oh!

p1-0tr commented Feb 16, 2026

Uh oh!

tenstorrent-github-bot commented Feb 16, 2026

🔄 CodeOwners Summary Updated

Uh oh!

p1-0tr commented Feb 16, 2026

Uh oh!

tenstorrent-github-bot commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

p1-0tr commented Jan 19, 2026 •

edited

Loading

tenstorrent-github-bot commented Feb 6, 2026 •

edited

Loading