Conversation
414f284 to
648e3b3
Compare
648e3b3 to
043dc77
Compare
There was a problem hiding this comment.
Pull request overview
This pull request implements support for running operations on MeshDevices that don't span all ranks in a distributed setup. When a MeshDevice exists on only a subset of ranks, "inactive" ranks (those without local devices) now execute no-ops instead of failing.
Changes:
- Splits distributed context into active and inactive groups using MPI
split()with different color values - Adds
DummyMeshCommandQueueclass for inactive ranks that no-ops all operations - Adds early-return checks in operation launch and buffer creation paths for inactive devices
- Passes
active_distributed_context_(instead of fulldistributed_context_) to mesh command queues for proper barrier synchronization among active ranks only
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| ttnn/api/ttnn/device_operation.hpp | Adds short-circuit early return when launching ops on inactive MeshDevices |
| tt_metal/distributed/mesh_device_impl.hpp | Declares new active_distributed_context_ field for active-rank synchronization |
| tt_metal/distributed/mesh_device.cpp | Implements distributed context splitting and DummyMeshCommandQueue instantiation for inactive ranks |
| tt_metal/distributed/mesh_buffer.cpp | Adds early return in buffer creation for inactive MeshDevices |
| tt_metal/distributed/sd_mesh_command_queue.* | Updates to accept and use active distributed context for barriers |
| tt_metal/distributed/fd_mesh_command_queue.* | Updates to accept and use active distributed context for barriers |
| tt_metal/distributed/dummy_mesh_command_queue.* | New no-op command queue implementation for inactive ranks |
| tt_metal/distributed/distributed.cpp | Adds short-circuit for workload enqueue on inactive devices |
| tt_metal/distributed/dispatch_context.cpp | Passes active context when creating command queues |
| tt_metal/distributed/CMakeLists.txt | Adds new dummy_mesh_command_queue.cpp to build |
043dc77 to
a3eb6d7
Compare
|
/codeowners ping |
CodeOwners Group AnalysisThis PR requires approval from one member of each of the following groups: Summary: 5 pending groups, 0 approved groups Group Information:
Note: At least one approval from each group is sufficient. |
|
Hi Allan Liu (@aliuTT), Joseph Chu (@cfjchu), Nigel Huang (@nhuang-tt), Aditya Saigal (@tt-asaigal), this PR Make ops work when some ranks are inactive by Piotr Stankiewicz (@p1-0tr) needs your approval/review to merge this. |
a3eb6d7 to
432d845
Compare
|
/codeowners ping |
🔄 CodeOwners Summary Updated✅ CodeOwners summary updated here 💡 Tip: Use |
|
Hi Joseph Chu (@cfjchu), John Bauman (@jbaumanTT), Nigel Huang (@nhuang-tt), Aditya Saigal (@tt-asaigal), this PR Make ops work when some ranks are inactive by Piotr Stankiewicz (@p1-0tr) needs your approval/review to merge this. |
|
/codeowners ping |
🔄 CodeOwners Summary Updated✅ CodeOwners summary updated here 💡 Tip: Use |
|
Hi Allan Liu (@aliuTT), Joseph Chu (@cfjchu), John Bauman (@jbaumanTT), Nigel Huang (@nhuang-tt), this PR Make ops work when some ranks are inactive by Piotr Stankiewicz (@p1-0tr) needs your approval/review to merge this. |
|
/codeowners ping |
🔄 CodeOwners Summary Updated✅ CodeOwners summary updated here 💡 Tip: Use |
|
Hi Allan Liu (@aliuTT), John Bauman (@jbaumanTT), Aditya Saigal (@tt-asaigal), this PR Make ops work when some ranks are inactive by Piotr Stankiewicz (@p1-0tr) needs your approval/review to merge this. |
432d845 to
8ae31ff
Compare
Currently when a MeshDevice does not span all ranks in a distributed set-up running ops (or creating buffers) for it will fail. Instead, in such cases, the "inactive" ranks should execute nop's and let the "active" ranks carry out their work. This patch is the first step towards supporting ops on meshes and/or sub-meshes which do not span all ranks in a distributed system. It aims to unlock running basic ops, by short-circuiting their launch at the ttnn level for "inactive" MeshDevice instances. A large portion of the MeshDevice interface will remain broken for "inactive" meshes. Eventually, the MeshDevice class should be reworked in a way that will make it safe to use on "inactive" ranks, and allow ttnn not to care about whether a MeshDevice is "active" or "inactive". Signed-off-by: Piotr Stankiewicz <pstankiewicz@tenstorrent.com>
8ae31ff to
c292ce7
Compare
|
/codeowners ping |
🔄 CodeOwners Summary Updated✅ CodeOwners summary updated here 💡 Tip: Use |
|
|
Hi Brian Liu (@TT-BrianLiu), Allan Liu (@aliuTT), John Bauman (@jbaumanTT), Austin Ho (@tt-aho), Aditya Saigal (@tt-asaigal), this PR Make ops work when some ranks are inactive by Piotr Stankiewicz (@p1-0tr) needs your approval/review to merge this. |
Ticket
#33901
Problem description
Currently when a MeshDevice does not span all ranks in a distributed
set-up running ops (or creating buffers) for it will fail. Instead, in
such cases, the "inactive" ranks should execute nop's and let the
"active" ranks carry out their work.
What's changed
This patch is the first step towards supporting ops on meshes and/or
sub-meshes which do not span all ranks in a distributed system. It aims
to unlock running basic ops, by short-circuiting their launch at the
ttnn level for "inactive" MeshDevice instances. A large portion of the
MeshDevice interface will remain broken for "inactive" meshes.
Eventually, the MeshDevice class should be reworked in a way that will
make it safe to use on "inactive" ranks, and allow ttnn not to care
about whether a MeshDevice is "active" or "inactive".
Checklist
Model tests
If your changes cover model-related code, you should run tests corresponding to affected models and platforms (Single card, T3K, Galaxy). "Choose your pipeline" workflows facilitate running multiple kinds of tests in a single run. Each offers
models-mandatoryandmodels-extendedpresets.The former includes a minimal set of tests, to be run always. The latter extends that with additional ones - use your best judgement in deciding which is the most appropriate for your PR.
models-mandatorypreset (runs: Device perf regressions and Frequent model and ttnn tests)models-extendedpreset (runs: the mandatory tests, plus Demo and Model perf tests)models-mandatorypreset (runs: Unit tests)models-extendedpreset (runs: the mandatory tests, plus Demo and Model perf tests)models-mandatorypreset (runs: Quick tests)models-extendedpreset (runs: the mandatory tests, plus Demo and Model perf tests)