Skip to content

Commit bede94c

Browse files
Add multi-mesh tests to Galaxy CI (#36005)
### Ticket [Link to Github Issue](#34551) ### Problem description Currently there are a lack of multi-mesh stability tests in existing pipelines ### What's changed Added stability tests for multi-mesh tests for two arrangements of all-to-all mesh configurations, each having 3 meshes total. Each added suite takes ~24 minutes to run. https://github.com/tenstorrent/tt-metal/actions/runs/21050162315 <-- Passing run on CI ### Checklist - [ ] [![All post-commit tests](https://github.com/tenstorrent/tt-metal/actions/workflows/all-post-commit-workflows.yaml/badge.svg?branch={{branch_name}})](https://github.com/tenstorrent/tt-metal/actions/workflows/all-post-commit-workflows.yaml?query=branch:{{branch_name}}) - [ ] [![Blackhole Post commit](https://github.com/tenstorrent/tt-metal/actions/workflows/blackhole-post-commit.yaml/badge.svg?branch={{branch_name}})](https://github.com/tenstorrent/tt-metal/actions/workflows/blackhole-post-commit.yaml?query=branch:{{branch_name}}) - [ ] [![cpp-unit-tests](https://github.com/tenstorrent/tt-metal/actions/workflows/tt-metal-l2-nightly.yaml/badge.svg?branch={{branch_name}})](https://github.com/tenstorrent/tt-metal/actions/workflows/tt-metal-l2-nightly.yaml?query=branch:{{branch_name}}) - [ ] New/Existing tests provide coverage for changes #### Model tests If your changes cover model-related code, you should run tests corresponding to affected models and platforms (Single card, T3K, Galaxy). "Choose your pipeline" workflows facilitate running multiple kinds of tests in a single run. Each offers `models-mandatory` and `models-extended` presets. The former includes a minimal set of tests, to be run always. The latter extends that with additional ones - use your best judgement in deciding which is the most appropriate for your PR. - [ ] [![(Single) Choose your pipeline](https://github.com/tenstorrent/tt-metal/actions/workflows/pipeline-select.yaml/badge.svg?branch={{branch_name}})](https://github.com/tenstorrent/tt-metal/actions/workflows/pipeline-select.yaml?query=branch:{{branch_name}}) - [ ] `models-mandatory` preset (runs: [Device perf regressions](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-device-models.yaml) and [Frequent model and ttnn tests](https://github.com/tenstorrent/tt-metal/actions/workflows/fast-dispatch-full-regressions-and-models.yaml)) - [ ] `models-extended` preset (runs: the mandatory tests, plus [Demo](https://github.com/tenstorrent/tt-metal/actions/workflows/single-card-demo-tests.yaml) and [Model perf](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-models.yaml) tests) - [ ] other selection - specify runs - [ ] [![(T3K) Choose your pipeline](https://github.com/tenstorrent/tt-metal/actions/workflows/pipeline-select-t3k.yaml/badge.svg?branch={{branch_name}})](https://github.com/tenstorrent/tt-metal/actions/workflows/pipeline-select-t3k.yaml?query=branch:{{branch_name}}) - [ ] `models-mandatory` preset (runs: [Unit tests](https://github.com/tenstorrent/tt-metal/actions/workflows/t3000-unit-tests.yaml)) - [ ] `models-extended` preset (runs: the mandatory tests, plus [Demo](https://github.com/tenstorrent/tt-metal/actions/workflows/t3000-demo-tests.yaml) and [Model perf](https://github.com/tenstorrent/tt-metal/actions/workflows/t3000-model-perf-tests.yaml) tests) - [ ] other selection - specify runs - [ ] [![(Galaxy) Choose your pipeline](https://github.com/tenstorrent/tt-metal/actions/workflows/pipeline-select-galaxy.yaml/badge.svg?branch={{branch_name}})](https://github.com/tenstorrent/tt-metal/actions/workflows/pipeline-select-galaxy.yaml?query=branch:{{branch_name}}) - [ ] `models-mandatory` preset (runs: [Quick tests](https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-quick.yaml)) - [ ] `models-extended` preset (runs: the mandatory tests, plus [Demo](https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-demo-tests.yaml) and [Model perf](https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-model-perf-tests.yaml) tests) - [ ] other selection - specify runs
1 parent c2f0d81 commit bede94c

File tree

5 files changed

+198
-0
lines changed

5 files changed

+198
-0
lines changed

.github/workflows/galaxy-nightly-tests-impl.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,8 @@ jobs:
3838
{ name: "Galaxy CCL tests", arch: wormhole_b0, cmd: "pytest tests/nightly/tg/ccl", timeout: 40, owner_id: U05ACKAJTHS}, # Naif Tarafdar
3939
{ name: "Galaxy Fabric unit tests", arch: wormhole_b0, cmd: build/test/tt_metal/tt_fabric/fabric_unit_tests --gtest_filter="NightlyFabric*Fixture.*", timeout: 25, owner_id: U08UBDUKH6Z}, # Neel Nyamagoudar
4040
{ name: "Galaxy Fabric 2D Torus Nightly Tests", arch: wormhole_b0, cmd: ./build/test/tt_metal/perf_microbenchmark/routing/test_tt_fabric --test_config tests/tt_metal/tt_metal/perf_microbenchmark/routing/test_fabric_2d_torus_nightly.yaml, timeout: 45, owner_id: U08UBDUKH6Z}, # Neel Nyamagoudar
41+
{ name: "Galaxy Fabric Multi-Mesh 4x4 and 2x4s stability tests", arch: wormhole_b0, cmd: python tests/tt_metal/tt_fabric/utils/generate_rank_bindings.py && tt-run --rank-binding 4x4_2x4_3_mesh_rank_binding.yaml --mpi-args "--allow-run-as-root --tag-output" ./build/test/tt_metal/perf_microbenchmark/routing/test_tt_fabric --test_config tests/tt_metal/tt_metal/perf_microbenchmark/routing/test_fabric_multi_mesh_stability_short_running.yaml, timeout: 15, owner_id: U08UBDUKH6Z}, # Neel Nyamagoudar
42+
{ name: "Galaxy Fabric Multi-mesh 2x8 and 2x4s stability tests", arch: wormhole_b0, cmd: python tests/tt_metal/tt_fabric/utils/generate_rank_bindings.py && tt-run --rank-binding 2x8_2x4_3_mesh_rank_binding.yaml --mpi-args "--allow-run-as-root --tag-output" ./build/test/tt_metal/perf_microbenchmark/routing/test_tt_fabric --test_config tests/tt_metal/tt_metal/perf_microbenchmark/routing/test_fabric_multi_mesh_stability_short_running.yaml, timeout: 15, owner_id: U08UBDUKH6Z}, # Neel Nyamagoudar
4143
# {
4244
# name: "Llama Galaxy Accuracy Test",
4345
# arch: wormhole_b0,
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# --- Meshes ---------------------------------------------------------------
2+
3+
mesh_descriptors {
4+
name: "MESH4"
5+
arch: WORMHOLE_B0
6+
device_topology { dims: [ 2, 8 ] }
7+
host_topology { dims: [ 1, 1 ] }
8+
channels {
9+
count: 4
10+
policy: RELAXED
11+
}
12+
}
13+
mesh_descriptors {
14+
name: "MESH2"
15+
arch: WORMHOLE_B0
16+
device_topology { dims: [ 2, 4 ] }
17+
host_topology { dims: [ 1, 1 ] }
18+
channels {
19+
count: 4
20+
policy: RELAXED
21+
}
22+
}
23+
graph_descriptors {
24+
name: "G0"
25+
type: "FABRIC"
26+
instances { mesh { mesh_descriptor: "MESH2" mesh_id: 0 } }
27+
instances { mesh { mesh_descriptor: "MESH2" mesh_id: 1 } }
28+
instances { mesh { mesh_descriptor: "MESH4" mesh_id: 2 } }
29+
30+
31+
connections {
32+
nodes { mesh { mesh_descriptor: "MESH2" mesh_id: 0 } }
33+
nodes { mesh { mesh_descriptor: "MESH2" mesh_id: 1 } }
34+
channels { count: 4 policy: RELAXED }
35+
}
36+
connections{
37+
nodes { mesh { mesh_descriptor: "MESH2" mesh_id: 0 } }
38+
nodes { mesh { mesh_descriptor: "MESH4" mesh_id: 2 } }
39+
channels { count: 4 policy: RELAXED }
40+
}
41+
connections{
42+
nodes { mesh { mesh_descriptor: "MESH2" mesh_id: 1 } }
43+
nodes { mesh { mesh_descriptor: "MESH4" mesh_id: 2 } }
44+
channels { count: 4 policy: RELAXED }
45+
}
46+
}
47+
# --- Instantiation ----------------------------------------------------------
48+
top_level_instance { graph { graph_descriptor: "G0" graph_id: 0 } }
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# --- Meshes ---------------------------------------------------------------
2+
3+
mesh_descriptors {
4+
name: "MESH4"
5+
arch: WORMHOLE_B0
6+
device_topology { dims: [ 4, 4 ] }
7+
host_topology { dims: [ 1, 1 ] }
8+
channels {
9+
count: 4
10+
policy: RELAXED
11+
}
12+
}
13+
mesh_descriptors {
14+
name: "MESH2"
15+
arch: WORMHOLE_B0
16+
device_topology { dims: [ 2, 4 ] }
17+
host_topology { dims: [ 1, 1 ] }
18+
channels {
19+
count: 4
20+
policy: RELAXED
21+
}
22+
}
23+
graph_descriptors {
24+
name: "G0"
25+
type: "FABRIC"
26+
instances { mesh { mesh_descriptor: "MESH4" mesh_id: 0 } }
27+
instances { mesh { mesh_descriptor: "MESH2" mesh_id: 1 } }
28+
instances { mesh { mesh_descriptor: "MESH2" mesh_id: 2 } }
29+
30+
connections {
31+
nodes { mesh { mesh_descriptor: "MESH4" mesh_id: 0 } }
32+
nodes { mesh { mesh_descriptor: "MESH2" mesh_id: 1 } }
33+
channels { count: 4 policy: RELAXED }
34+
}
35+
connections{
36+
nodes { mesh { mesh_descriptor: "MESH4" mesh_id: 0 } }
37+
nodes { mesh { mesh_descriptor: "MESH2" mesh_id: 2 } }
38+
channels { count: 4 policy: RELAXED }
39+
}
40+
connections{
41+
nodes { mesh { mesh_descriptor: "MESH2" mesh_id: 1 } }
42+
nodes { mesh { mesh_descriptor: "MESH2" mesh_id: 2 } }
43+
channels { count: 4 policy: RELAXED }
44+
}
45+
}
46+
47+
# --- Instantiation ----------------------------------------------------------
48+
top_level_instance { graph { graph_descriptor: "G0" graph_id: 0 } }

tests/tt_metal/tt_fabric/utils/generate_rank_bindings.py

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,18 @@ def generate_supported_rank_bindings():
112112
2: [4],
113113
3: [2],
114114
}
115+
# Process Rank ID To Tray ID Mapping for 4x4 + 2x4 + 2x4 (3 mesh) configuration
116+
WH_GLX_4X4_2X4_3_MESH_RANK_TO_TRAY_MAPPING = {
117+
0: [1, 2], # 4x4 mesh needs 2 trays (16 devices)
118+
1: [3], # 2x4 mesh needs 1 tray (8 devices)
119+
2: [4], # 2x4 mesh needs 1 tray (8 devices)
120+
}
121+
# Process Rank ID To Tray ID Mapping for 2x4 + 2x4 + 2x8 (3 mesh) configuration
122+
WH_GLX_2X8_2X4_3_MESH_RANK_TO_TRAY_MAPPING = {
123+
0: [1], # 2x4 mesh needs 1 tray (8 devices)
124+
1: [3], # 2x4 mesh needs 1 tray (8 devices)
125+
2: [2, 4], # 2x8 mesh needs 2 trays (16 devices)
126+
}
115127

116128
# Rank bindings for Dual Mesh Setup (1 process per mesh)
117129
DUAL_MESH_RANK_BINDINGS = [
@@ -197,6 +209,44 @@ def generate_supported_rank_bindings():
197209
},
198210
]
199211

212+
# Rank bindings for Tri Mesh Setup: 4x4 + 2x4 + 2x4 (1 process per mesh, 3 meshes)
213+
TRI_MESH_4X4_2X4_RANK_BINDINGS = [
214+
{
215+
"rank": 0,
216+
"mesh_id": 0,
217+
"mesh_host_rank": 0,
218+
},
219+
{
220+
"rank": 1,
221+
"mesh_id": 1,
222+
"mesh_host_rank": 0,
223+
},
224+
{
225+
"rank": 2,
226+
"mesh_id": 2,
227+
"mesh_host_rank": 0,
228+
},
229+
]
230+
231+
# Rank bindings for Tri Mesh Setup: 2x4 + 2x4 + 2x8 (1 process per mesh, 3 meshes)
232+
TRI_MESH_2X8_2X4_RANK_BINDINGS = [
233+
{
234+
"rank": 0,
235+
"mesh_id": 0,
236+
"mesh_host_rank": 0,
237+
},
238+
{
239+
"rank": 1,
240+
"mesh_id": 1,
241+
"mesh_host_rank": 0,
242+
},
243+
{
244+
"rank": 2,
245+
"mesh_id": 2,
246+
"mesh_host_rank": 0,
247+
},
248+
]
249+
200250
mapping_file = "tray_to_pcie_device_mapping.yaml"
201251
generate_tray_to_pcie_device_mapping(mapping_file)
202252
with open(mapping_file, "r") as f:
@@ -231,6 +281,20 @@ def generate_supported_rank_bindings():
231281
"tests/tt_metal/tt_fabric/custom_mesh_descriptors/wh_galaxy_2x4_mesh_graph_descriptor.textproto",
232282
"2x4_multi_mesh_cyclic_rank_binding.yaml",
233283
)
284+
generate_rank_binding_yaml(
285+
tray_to_pcie_device_mapping,
286+
TRI_MESH_4X4_2X4_RANK_BINDINGS,
287+
WH_GLX_4X4_2X4_3_MESH_RANK_TO_TRAY_MAPPING,
288+
"tests/tt_metal/tt_fabric/custom_mesh_descriptors/wh_galaxy_split_4x4_2x4_3_mesh.textproto",
289+
"4x4_2x4_3_mesh_rank_binding.yaml",
290+
)
291+
generate_rank_binding_yaml(
292+
tray_to_pcie_device_mapping,
293+
TRI_MESH_2X8_2X4_RANK_BINDINGS,
294+
WH_GLX_2X8_2X4_3_MESH_RANK_TO_TRAY_MAPPING,
295+
"tests/tt_metal/tt_fabric/custom_mesh_descriptors/wh_galaxy_split_2x8_2x4_3_mesh.textproto",
296+
"2x8_2x4_3_mesh_rank_binding.yaml",
297+
)
234298

235299

236300
if __name__ == "__main__":
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
Tests:
2+
- name: "FlowControlMultiMesh"
3+
enable_flow_control: true
4+
fabric_setup:
5+
topology: Mesh
6+
7+
parametrization_params:
8+
ntype: [unicast_write, atomic_inc, fused_atomic_inc, unicast_scatter_write]
9+
num_links: [1, 2, 3, 4]
10+
size: [1024, 2048, 4096]
11+
12+
defaults:
13+
ftype: unicast
14+
num_packets: 10000
15+
16+
patterns:
17+
- type: all_to_all
18+
19+
- name: "MeshLowLatencyLooped"
20+
fabric_setup:
21+
topology: Mesh
22+
23+
top_level_iterations: 5
24+
25+
parametrization_params:
26+
ntype: [unicast_write, atomic_inc , fused_atomic_inc, unicast_scatter_write]
27+
num_links: [1, 2, 3, 4]
28+
29+
defaults:
30+
ftype: unicast
31+
size: 32
32+
num_packets: 10000
33+
34+
patterns:
35+
- type: all_to_all
36+
iterations: 2

0 commit comments

Comments
 (0)