Fallback to standard mesh on neuron backend for incompatible multi-granule meshes #1146

apoorvtintin · 2025-05-01T22:40:47Z

Allow fallback to standard mesh for multi-granule mesh as such a mesh provides better performance on TRN2

Added corresponding tests for fallback and mesh creation for TRN2.
Switch to a new mesh for neuron-(trn2|trn2n).48xlarge-64 on 70B-FujiV2 as it provides better scale-out performance.

apoorvtintin · 2025-05-13T22:40:54Z

Rebased the PR

ruomingp · 2025-05-21T13:27:16Z

axlearn/common/utils.py

    if (
-        device_platform == "gpu"
+        device_platform in ("gpu", "neuron")


Suggested change

device_platform in ("gpu", "neuron")

device_platform != "tpu"

Made this change, thank you!

ruomingp · 2025-05-21T13:27:52Z

axlearn/common/utils.py

@@ -1743,13 +1743,15 @@ def create_device_mesh(
    assert num_devices % num_granules == 0, "Number of devices should divide number of granules."
    num_devices_per_granule = num_devices // num_granules

-    # Fallback to a standard mesh if on GPU with incompatible multi-granule mesh.
+    # Fallback to a standard mesh if on GPU or neuron with incompatible multi-granule mesh.


Suggested change

# Fallback to a standard mesh if on GPU or neuron with incompatible multi-granule mesh.

# Fallback to a standard mesh with incompatible multi-granule mesh if not on TPU.

Made this change, thank you!

- Switch to a new mesh for neuron-(trn2|trn2n).48xlarge-64 with better scale-out performance.

apoorvtintin requested review from ruomingp, markblee and a team as code owners May 1, 2025 22:40

apoorvtintin force-pushed the neuron_mesh branch from 4793e44 to 5b17410 Compare May 13, 2025 22:40

apoorvtintin force-pushed the neuron_mesh branch from 5b17410 to 368db63 Compare May 19, 2025 18:25

ruomingp approved these changes May 21, 2025

View reviewed changes

Build standard mesh for neuron backend

b6f2181

- Switch to a new mesh for neuron-(trn2|trn2n).48xlarge-64 with better scale-out performance.

apoorvtintin force-pushed the neuron_mesh branch from 368db63 to b6f2181 Compare May 28, 2025 05:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fallback to standard mesh on neuron backend for incompatible multi-granule meshes #1146

Fallback to standard mesh on neuron backend for incompatible multi-granule meshes #1146

Uh oh!

apoorvtintin commented May 1, 2025 •

edited

Loading

Uh oh!

apoorvtintin commented May 13, 2025

Uh oh!

ruomingp May 21, 2025

Uh oh!

apoorvtintin May 28, 2025

Uh oh!

ruomingp May 21, 2025

Uh oh!

apoorvtintin May 28, 2025

Uh oh!

Uh oh!

	device_platform in ("gpu", "neuron")
	device_platform != "tpu"

	# Fallback to a standard mesh if on GPU or neuron with incompatible multi-granule mesh.
	# Fallback to a standard mesh with incompatible multi-granule mesh if not on TPU.

Fallback to standard mesh on neuron backend for incompatible multi-granule meshes #1146

Are you sure you want to change the base?

Fallback to standard mesh on neuron backend for incompatible multi-granule meshes #1146

Uh oh!

Conversation

apoorvtintin commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apoorvtintin commented May 13, 2025

Uh oh!

ruomingp May 21, 2025

Choose a reason for hiding this comment

Uh oh!

apoorvtintin May 28, 2025

Choose a reason for hiding this comment

Uh oh!

ruomingp May 21, 2025

Choose a reason for hiding this comment

Uh oh!

apoorvtintin May 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

apoorvtintin commented May 1, 2025 •

edited

Loading