Add support for tensor transfers in eager to allow for multi-device execution by sogartar · Pull Request #1157 · iree-org/iree-turbine

sogartar · 2025-09-19T21:53:51Z

Currently we only issue device transfer ops when exporting. With this change a new export-device-affinity-to-torch-device configuration map is introduced that allows us to do actual transfers in eager.

t = torch.tensor([1, 2], device="cuda:2")
with IreeDeviceAffinityToTorchDevice({
    DeviceAffinity(0): torch.device("cuda:2"),
    DeviceAffinity(1): torch.device("cuda:3")
}):
    t2 = transfer_to_logical_device("1", t) # transfer to cuda:3
    t3 = transfer_to_logical_device("0", t2) # transfer back to cuda:2

…xecution Currently we only issue device transfer ops when exporting. With this change a new export-device-affinity-to-torch-device configuration map is introduced that allows us to do actual transfers in eager. ``` t = torch.tensor([1, 2], device="cuda:2") with IreeDeviceAffinityToTorchDevice({ DeviceAffinity(0): torch.device("cuda:2"), DeviceAffinity(1): torch.device("cuda:3") }): t2 = transfer_to_logical_device("1", t) # transfer to cuda:3 t3 = transfer_to_logical_device("0", t2) # transfer back to cuda:2 ``` Signed-off-by: Boian Petkantchin <boian.petkantchin@amd.com>

sogartar · 2025-09-19T22:01:58Z

+################################################################################
+# IREE device affinity to torch device map
+################################################################################


I am not sure if I should move this section somewhere else. For example in iree/turbine/runtime/device.py.

rsuderman

This looks like a solution in search of a problem. Device affinity is predominately used for when tracing and will essentially be disconnected from device level tracing. E.g. when tracing for 8 devices this could be done a a single cpu torch instance as devices are not required at tracing time.

Requiring the device affinity and tensor placements are aligned / correct will likely just generate more upkeep for a feature that is not needed.

rsuderman · 2025-09-29T22:16:11Z

 ]


+@dataclass(frozen=True)


No, device affinity is not a dataclass. You should only use this annotation when the type is struct like.

Why is DeviceTensorTrait bellow a dataclass?

@dataclass class DeviceTensorTrait:

It is pretty much the same thing.

sogartar · 2025-09-30T14:46:06Z

This looks like a solution in search of a problem.

@rsuderman, it is true that currently the Llama 405b f4 model fits on a single Mi355 instance, but before we had models that we wanted to run eagerly but could not. We also may want to run the f16 variant, which would not fit. It is likely that when the next big model comes we would not be able to fit it on a single GPU. This feature is not about tracing, but when running eagerly.

How do you suggest we enable running our models on multiple devices eagerly?

sogartar commented Sep 19, 2025

View reviewed changes

sogartar requested review from Alex-Vasile, IanNod, dan-garvey and rsuderman September 19, 2025 22:02

Alex-Vasile mentioned this pull request Sep 26, 2025

[sharktank] Add tests and fixes for pipeline-parallel Llama 405B and Toy f4 vs non-parallel nod-ai/amd-shark-ai#2353

Merged

rsuderman requested changes Sep 29, 2025

View reviewed changes

sogartar requested a review from rsuderman September 30, 2025 14:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for tensor transfers in eager to allow for multi-device execution#1157

Add support for tensor transfers in eager to allow for multi-device execution#1157
sogartar wants to merge 1 commit into
mainfrom
users/sogartar/eager-tensor-transfer

sogartar commented Sep 19, 2025

Uh oh!

sogartar Sep 19, 2025

Uh oh!

rsuderman left a comment

Uh oh!

rsuderman Sep 29, 2025

Uh oh!

sogartar Sep 30, 2025

Uh oh!

sogartar commented Sep 30, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		]


		@dataclass(frozen=True)

Conversation

sogartar commented Sep 19, 2025

Uh oh!

sogartar Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

rsuderman left a comment

Choose a reason for hiding this comment

Uh oh!

rsuderman Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

sogartar Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

sogartar commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sogartar commented Sep 30, 2025 •

edited

Loading