Garbled audio noise / NaNs on Apple Silicon (MPS) with default float16/SDPA

### Checks

- [x] This template is only for bug reports, usage problems go with 'Help Wanted'.
- [x] I have thoroughly reviewed the project documentation but couldn't find information to solve my problem.
- [x] I have searched for existing issues, including closed ones, and couldn't find a solution.
- [x] I am using English to submit this issue to facilitate community communication.

### Environment Details

TLDR; My Mac Mini would not generate usable Voice Clone until Antigravity applied the below patch. Opening bug here to see if this is known/fixed. Happy to proceed with a PR if not.

Environment Details
- Hardware: Apple Silicon (Mac mini M4, Mac15,7)
- OS: macOS 15.7.7
- Python Version: 3.9.6
- PyTorch Version: 2.8.0
- Model: k2-fsa/OmniVoice

When launching OmniVoice on Apple Silicon using the MPS backend (device_map="mps"), the generated audio consists entirely of garbled, repetitive noise (often starting with a brief high-pitched tone followed by static or silence).

Cause
The issue is triggered by two factors when running on MPS:

Numerical stability in float16: Multi-head attention layers using float16 can experience numerical overflows/NaNs under MPS during softmax computation when processing fully padded/masked rows.
SDPA kernel limitations on MPS: PyTorch's Scaled Dot Product Attention (SDPA) backend on MPS doesn't gracefully handle the attention masks used in the model under float16 precision, causing output corruption.
Setting device_map="cpu" avoids this issue but is extremely slow (~2.5 minutes for generation compared to ~30 seconds on MPS).

Proposed Fix
To resolve this, we can catch when device_map is configured for MPS and automatically:

Fall back to standard "eager" attention (bypassing the problematic SDPA kernels on MPS).
Use torch.float32 precision instead of float16 to prevent softmax overflows.
Here is the patch applied to omnivoice/models/omnivoice.py inside OmniVoice.from_pretrained that resolves the issue:
```
diff --git a/omnivoice/models/omnivoice.py b/omnivoice/models/omnivoice.py
index 4d78f8a..7f73ce7 100644
--- a/omnivoice/models/omnivoice.py
+++ b/omnivoice/models/omnivoice.py
@@ -256,6 +256,13 @@ class OmniVoice(PreTrainedModel):
             # Resolve to local path first; download only if not already cached
             resolved_path = _resolve_model_path(pretrained_model_name_or_path)
 
+            # Workaround for MPS garbled noise / NaNs with SDPA or float16
+            dev_map = str(kwargs.get("device_map", ""))
+            if dev_map.startswith("mps") or dev_map == "mps":
+                kwargs.setdefault("attn_implementation", "eager")
+                # Force float32 on MPS, as float16 softmax overflows cause garbled noise
+                kwargs["dtype"] = torch.float32
+
             model = super().from_pretrained(resolved_path, *args, **kwargs)
 
             if not train_mode:
```


### Steps to Reproduce

Run without CPU to reproduce garbled noise output:
`omnivoice-demo --ip 0.0.0.0 --port 7860`

Run with CPU to produce normal audio:
`omnivoice-demo --device cpu --ip 0.0.0.0 --port 7860`

When above patch/fix is applied, MPS can be used, dramatically increaseing render speed.

### ✔️ Expected Behavior

Working voice clone audio.

### ❌ Actual Behavior

Garbled noise audio.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Garbled audio noise / NaNs on Apple Silicon (MPS) with default float16/SDPA #172

Checks

Environment Details

Steps to Reproduce

✔️ Expected Behavior

❌ Actual Behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Garbled audio noise / NaNs on Apple Silicon (MPS) with default float16/SDPA #172

Description

Checks

Environment Details

Steps to Reproduce

✔️ Expected Behavior

❌ Actual Behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions