fix(torchx): better mps support #1652

polvalente · 2025-12-28T04:33:32Z

closes #679
closes #1608

This PR is currently masking f64 as f32, however, I think it would be better if we instead just raise whenever f64/c128 show up.

polvalente · 2025-12-28T04:34:11Z

@josevalim thoughts on the f64 issue mentioned on the PR description?

josevalim · 2025-12-28T08:41:44Z

torchx/lib/torchx/backend.ex

+  @impl true
+  def optional(function_name, args, default_impl) do
+    # For MPS device, some linear algebra operations are not supported
+    # Delegate to default implementation which will fall back to BinaryBackend


binary backend? or you mean default implementation?

Yeah, this was the LLM getting confused. This falls back to elementary Nx operations, not the binary backend.

josevalim · 2025-12-28T08:42:35Z

torchx/lib/torchx.ex

+    target_device_struct = torch_device!(user_device, index)
+
+    tensor_to_move =
+      if user_device == :mps do


Agreed we should just let it raise on f64.

josevalim · 2025-12-28T08:43:10Z

torchx/lib/torchx/backend.ex

+    device = device_option(backend_options)
+    torch_type = to_torch_type(type, device)
+
+    # Handle type downgrading for MPS - need to convert binary data format


Just let it raise here too. Convert the tests to f32 if necessary (I think we changed the overall defaults to f32 a long time ago).

f32 was always the default. the vast majority of tests we had were actually for-generated, so it was very easy to get all tests to green out

josevalim · 2025-12-28T08:44:31Z

torchx/test/torchx/nx_test.exs


+    # TODO: MPS uses different rounding rules (half-to-even vs half-away-from-zero)
+    # Need to investigate if this can be fixed or if tests need to account for it
+    @tag :skip_on_mps


We shouldn't call it :skip_on_maps, but rather, :round_up, :requires_f64, etc. And then exclude those if the device is mps.

polvalente · 2025-12-28T20:47:00Z

torchx/test/test_helper.exs

+# Tests must run synchronously to avoid GPU framework crashes
+mps_opts =
+  if device_is_mps do
+    [max_cases: 1]


@josevalim do you think we should try to add some device lock mechanism to torchx?

Ideally we would need to understand why it happens. Are the failures due to mutation or just the lack of queueing in the device itself? What does PyTorch do?

The errors I was getting were signaling something about "device in use", so I don't think it's mutation happening.

I'll see if I can find what pytorch does

I'm not getting the errors anymore, so maybe we can kick this can down the road.
I did find that there are a few mechanisms we could use, but generally MPS assumes a single command queue per os process.

polvalente added 7 commits December 27, 2025 18:27

fix: mps support

3ea8074

feat: all tests pass (or skip)

b28fe82

wip

ac868b8

fix: tri solve

38fc4b9

fix: qr impl and svd test

243b12a

fix: svd fallback to cpu

b5095a4

fix: cpu tests

f57599c

polvalente self-assigned this Dec 28, 2025

josevalim reviewed Dec 28, 2025

View reviewed changes

refactor: raise on f64 and skip f64 tests

0c6de55

polvalente marked this pull request as ready for review December 28, 2025 20:45

polvalente commented Dec 28, 2025

View reviewed changes

polvalente added 2 commits December 28, 2025 18:11

allow async tests

b02431d

simplify to_device

baef8e3

polvalente requested a review from josevalim December 28, 2025 21:15

josevalim approved these changes Dec 28, 2025

View reviewed changes

polvalente merged commit 1781659 into main Dec 28, 2025
9 checks passed

polvalente deleted the pv-fix/mps-support branch December 28, 2025 22:02

pxp9 mentioned this pull request Dec 29, 2025

Nx.Random.normal fails when using {Torchx.Backend, device: :mps} as reported in #1608 #1651

Closed

fix(torchx): better mps support #1652

fix(torchx): better mps support #1652

Uh oh!

Conversation

polvalente commented Dec 28, 2025

Uh oh!

polvalente commented Dec 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants