Skip to content

Make ops work when some ranks are inactive#36091

Open
p1-0tr wants to merge 1 commit intomainfrom
p1-0tr/33901-run-op-with-inactive-ranks
Open

Make ops work when some ranks are inactive#36091
p1-0tr wants to merge 1 commit intomainfrom
p1-0tr/33901-run-op-with-inactive-ranks

Conversation

@p1-0tr
Copy link
Contributor

@p1-0tr p1-0tr commented Jan 19, 2026

Ticket

#33901

Problem description

Currently when a MeshDevice does not span all ranks in a distributed
set-up running ops (or creating buffers) for it will fail. Instead, in
such cases, the "inactive" ranks should execute nop's and let the
"active" ranks carry out their work.

What's changed

This patch is the first step towards supporting ops on meshes and/or
sub-meshes which do not span all ranks in a distributed system. It aims
to unlock running basic ops, by short-circuiting their launch at the
ttnn level for "inactive" MeshDevice instances. A large portion of the
MeshDevice interface will remain broken for "inactive" meshes.
Eventually, the MeshDevice class should be reworked in a way that will
make it safe to use on "inactive" ranks, and allow ttnn not to care
about whether a MeshDevice is "active" or "inactive".

Checklist

  • All post-commit tests
  • Blackhole Post commit
  • cpp-unit-tests
  • New/Existing tests provide coverage for changes

Model tests

If your changes cover model-related code, you should run tests corresponding to affected models and platforms (Single card, T3K, Galaxy). "Choose your pipeline" workflows facilitate running multiple kinds of tests in a single run. Each offers models-mandatory and models-extended presets.
The former includes a minimal set of tests, to be run always. The latter extends that with additional ones - use your best judgement in deciding which is the most appropriate for your PR.

@p1-0tr p1-0tr force-pushed the p1-0tr/33901-run-op-with-inactive-ranks branch from 414f284 to 648e3b3 Compare January 26, 2026 11:16
@p1-0tr p1-0tr force-pushed the p1-0tr/33901-run-op-with-inactive-ranks branch from 648e3b3 to 043dc77 Compare February 6, 2026 11:31
@p1-0tr p1-0tr marked this pull request as ready for review February 6, 2026 11:32
@p1-0tr p1-0tr requested a review from a team as a code owner February 6, 2026 11:32
Copilot AI review requested due to automatic review settings February 6, 2026 11:32
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements support for running operations on MeshDevices that don't span all ranks in a distributed setup. When a MeshDevice exists on only a subset of ranks, "inactive" ranks (those without local devices) now execute no-ops instead of failing.

Changes:

  • Splits distributed context into active and inactive groups using MPI split() with different color values
  • Adds DummyMeshCommandQueue class for inactive ranks that no-ops all operations
  • Adds early-return checks in operation launch and buffer creation paths for inactive devices
  • Passes active_distributed_context_ (instead of full distributed_context_) to mesh command queues for proper barrier synchronization among active ranks only

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated no comments.

Show a summary per file
File Description
ttnn/api/ttnn/device_operation.hpp Adds short-circuit early return when launching ops on inactive MeshDevices
tt_metal/distributed/mesh_device_impl.hpp Declares new active_distributed_context_ field for active-rank synchronization
tt_metal/distributed/mesh_device.cpp Implements distributed context splitting and DummyMeshCommandQueue instantiation for inactive ranks
tt_metal/distributed/mesh_buffer.cpp Adds early return in buffer creation for inactive MeshDevices
tt_metal/distributed/sd_mesh_command_queue.* Updates to accept and use active distributed context for barriers
tt_metal/distributed/fd_mesh_command_queue.* Updates to accept and use active distributed context for barriers
tt_metal/distributed/dummy_mesh_command_queue.* New no-op command queue implementation for inactive ranks
tt_metal/distributed/distributed.cpp Adds short-circuit for workload enqueue on inactive devices
tt_metal/distributed/dispatch_context.cpp Passes active context when creating command queues
tt_metal/distributed/CMakeLists.txt Adds new dummy_mesh_command_queue.cpp to build

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

@p1-0tr p1-0tr force-pushed the p1-0tr/33901-run-op-with-inactive-ranks branch from 043dc77 to a3eb6d7 Compare February 6, 2026 11:59
@p1-0tr
Copy link
Contributor Author

p1-0tr commented Feb 6, 2026

/codeowners ping

@tenstorrent-github-bot
Copy link

tenstorrent-github-bot commented Feb 6, 2026

CodeOwners Group Analysis

This PR requires approval from one member of each of the following groups:

Summary: 5 pending groups, 0 approved groups

Group Information:

  • tenstorrent/metalium-developers-infra (Team) - Members: Raymond Kim, Michael Chiou, Bryan Keith, Bryan Wilder Field Lozano, Andrew Fuller, William Ly, Kannika Kabilar, Anthony Kirby, Rose Li, Subin Lee, Evan Banerjee, NSexton, David Popov, Aditi Rajesh Shah, Jacek Jakub Lakis, Iris Wang, jessica yuan, Hasan Baig | Pending approval

    📁 Files owned by this team (2 files)



  • tt_metal/distributed//CMakeLists.txt** (Group) - Members: Aditya Saigal, Allan Liu, John Bauman, Joseph Chu | Pending approval

    📁 Files owned by this group (1 files)

Note: At least one approval from each group is sufficient.

@tenstorrent-github-bot
Copy link

Hi Allan Liu (@aliuTT), Joseph Chu (@cfjchu), Nigel Huang (@nhuang-tt), Aditya Saigal (@tt-asaigal), this PR Make ops work when some ranks are inactive by Piotr Stankiewicz (@p1-0tr) needs your approval/review to merge this.

@p1-0tr p1-0tr force-pushed the p1-0tr/33901-run-op-with-inactive-ranks branch from a3eb6d7 to 432d845 Compare February 9, 2026 12:00
@p1-0tr
Copy link
Contributor Author

p1-0tr commented Feb 9, 2026

/codeowners ping

@tenstorrent-github-bot
Copy link

🔄 CodeOwners Summary Updated

CodeOwners summary updated here

💡 Tip: Use /codeowners new to post a fresh summary comment instead of updating the existing one.

@tenstorrent-github-bot
Copy link

Hi Joseph Chu (@cfjchu), John Bauman (@jbaumanTT), Nigel Huang (@nhuang-tt), Aditya Saigal (@tt-asaigal), this PR Make ops work when some ranks are inactive by Piotr Stankiewicz (@p1-0tr) needs your approval/review to merge this.

@p1-0tr
Copy link
Contributor Author

p1-0tr commented Feb 10, 2026

/codeowners ping

@tenstorrent-github-bot
Copy link

🔄 CodeOwners Summary Updated

CodeOwners summary updated here

💡 Tip: Use /codeowners new to post a fresh summary comment instead of updating the existing one.

@tenstorrent-github-bot
Copy link

Hi Allan Liu (@aliuTT), Joseph Chu (@cfjchu), John Bauman (@jbaumanTT), Nigel Huang (@nhuang-tt), this PR Make ops work when some ranks are inactive by Piotr Stankiewicz (@p1-0tr) needs your approval/review to merge this.

@p1-0tr
Copy link
Contributor Author

p1-0tr commented Feb 11, 2026

/codeowners ping

@tenstorrent-github-bot
Copy link

🔄 CodeOwners Summary Updated

CodeOwners summary updated here

💡 Tip: Use /codeowners new to post a fresh summary comment instead of updating the existing one.

@tenstorrent-github-bot
Copy link

Hi Allan Liu (@aliuTT), John Bauman (@jbaumanTT), Aditya Saigal (@tt-asaigal), this PR Make ops work when some ranks are inactive by Piotr Stankiewicz (@p1-0tr) needs your approval/review to merge this.

@p1-0tr p1-0tr force-pushed the p1-0tr/33901-run-op-with-inactive-ranks branch from 432d845 to 8ae31ff Compare February 13, 2026 15:05
@p1-0tr p1-0tr requested a review from a team as a code owner February 13, 2026 15:05
Currently when a MeshDevice does not span all ranks in a distributed
set-up running ops (or creating buffers) for it will fail. Instead, in
such cases, the "inactive" ranks should execute nop's and let the
"active" ranks carry out their work.

This patch is the first step towards supporting ops on meshes and/or
sub-meshes which do not span all ranks in a distributed system. It aims
to unlock running basic ops, by short-circuiting their launch at the
ttnn level for "inactive" MeshDevice instances. A large portion of the
MeshDevice interface will remain broken for "inactive" meshes.
Eventually, the MeshDevice class should be reworked in a way that will
make it safe to use on "inactive" ranks, and allow ttnn not to care
about whether a MeshDevice is "active" or "inactive".

Signed-off-by: Piotr Stankiewicz <pstankiewicz@tenstorrent.com>
@p1-0tr p1-0tr force-pushed the p1-0tr/33901-run-op-with-inactive-ranks branch from 8ae31ff to c292ce7 Compare February 16, 2026 08:06
@p1-0tr
Copy link
Contributor Author

p1-0tr commented Feb 16, 2026

/codeowners ping

@tenstorrent-github-bot
Copy link

🔄 CodeOwners Summary Updated

CodeOwners summary updated here

💡 Tip: Use /codeowners new to post a fresh summary comment instead of updating the existing one.

@p1-0tr
Copy link
Contributor Author

p1-0tr commented Feb 16, 2026

@tenstorrent-github-bot
Copy link

Hi Brian Liu (@TT-BrianLiu), Allan Liu (@aliuTT), John Bauman (@jbaumanTT), Austin Ho (@tt-aho), Aditya Saigal (@tt-asaigal), this PR Make ops work when some ranks are inactive by Piotr Stankiewicz (@p1-0tr) needs your approval/review to merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants