Skip to content

Garbled audio noise / NaNs on Apple Silicon (MPS) with default float16/SDPA #172

@mike-albano

Description

@mike-albano

Checks

  • This template is only for bug reports, usage problems go with 'Help Wanted'.
  • I have thoroughly reviewed the project documentation but couldn't find information to solve my problem.
  • I have searched for existing issues, including closed ones, and couldn't find a solution.
  • I am using English to submit this issue to facilitate community communication.

Environment Details

TLDR; My Mac Mini would not generate usable Voice Clone until Antigravity applied the below patch. Opening bug here to see if this is known/fixed. Happy to proceed with a PR if not.

Environment Details

  • Hardware: Apple Silicon (Mac mini M4, Mac15,7)
  • OS: macOS 15.7.7
  • Python Version: 3.9.6
  • PyTorch Version: 2.8.0
  • Model: k2-fsa/OmniVoice

When launching OmniVoice on Apple Silicon using the MPS backend (device_map="mps"), the generated audio consists entirely of garbled, repetitive noise (often starting with a brief high-pitched tone followed by static or silence).

Cause
The issue is triggered by two factors when running on MPS:

Numerical stability in float16: Multi-head attention layers using float16 can experience numerical overflows/NaNs under MPS during softmax computation when processing fully padded/masked rows.
SDPA kernel limitations on MPS: PyTorch's Scaled Dot Product Attention (SDPA) backend on MPS doesn't gracefully handle the attention masks used in the model under float16 precision, causing output corruption.
Setting device_map="cpu" avoids this issue but is extremely slow (~2.5 minutes for generation compared to ~30 seconds on MPS).

Proposed Fix
To resolve this, we can catch when device_map is configured for MPS and automatically:

Fall back to standard "eager" attention (bypassing the problematic SDPA kernels on MPS).
Use torch.float32 precision instead of float16 to prevent softmax overflows.
Here is the patch applied to omnivoice/models/omnivoice.py inside OmniVoice.from_pretrained that resolves the issue:

diff --git a/omnivoice/models/omnivoice.py b/omnivoice/models/omnivoice.py
index 4d78f8a..7f73ce7 100644
--- a/omnivoice/models/omnivoice.py
+++ b/omnivoice/models/omnivoice.py
@@ -256,6 +256,13 @@ class OmniVoice(PreTrainedModel):
             # Resolve to local path first; download only if not already cached
             resolved_path = _resolve_model_path(pretrained_model_name_or_path)
 
+            # Workaround for MPS garbled noise / NaNs with SDPA or float16
+            dev_map = str(kwargs.get("device_map", ""))
+            if dev_map.startswith("mps") or dev_map == "mps":
+                kwargs.setdefault("attn_implementation", "eager")
+                # Force float32 on MPS, as float16 softmax overflows cause garbled noise
+                kwargs["dtype"] = torch.float32
+
             model = super().from_pretrained(resolved_path, *args, **kwargs)
 
             if not train_mode:

Steps to Reproduce

Run without CPU to reproduce garbled noise output:
omnivoice-demo --ip 0.0.0.0 --port 7860

Run with CPU to produce normal audio:
omnivoice-demo --device cpu --ip 0.0.0.0 --port 7860

When above patch/fix is applied, MPS can be used, dramatically increaseing render speed.

✔️ Expected Behavior

Working voice clone audio.

❌ Actual Behavior

Garbled noise audio.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions