Checks
Environment Details
TLDR; My Mac Mini would not generate usable Voice Clone until Antigravity applied the below patch. Opening bug here to see if this is known/fixed. Happy to proceed with a PR if not.
Environment Details
- Hardware: Apple Silicon (Mac mini M4, Mac15,7)
- OS: macOS 15.7.7
- Python Version: 3.9.6
- PyTorch Version: 2.8.0
- Model: k2-fsa/OmniVoice
When launching OmniVoice on Apple Silicon using the MPS backend (device_map="mps"), the generated audio consists entirely of garbled, repetitive noise (often starting with a brief high-pitched tone followed by static or silence).
Cause
The issue is triggered by two factors when running on MPS:
Numerical stability in float16: Multi-head attention layers using float16 can experience numerical overflows/NaNs under MPS during softmax computation when processing fully padded/masked rows.
SDPA kernel limitations on MPS: PyTorch's Scaled Dot Product Attention (SDPA) backend on MPS doesn't gracefully handle the attention masks used in the model under float16 precision, causing output corruption.
Setting device_map="cpu" avoids this issue but is extremely slow (~2.5 minutes for generation compared to ~30 seconds on MPS).
Proposed Fix
To resolve this, we can catch when device_map is configured for MPS and automatically:
Fall back to standard "eager" attention (bypassing the problematic SDPA kernels on MPS).
Use torch.float32 precision instead of float16 to prevent softmax overflows.
Here is the patch applied to omnivoice/models/omnivoice.py inside OmniVoice.from_pretrained that resolves the issue:
diff --git a/omnivoice/models/omnivoice.py b/omnivoice/models/omnivoice.py
index 4d78f8a..7f73ce7 100644
--- a/omnivoice/models/omnivoice.py
+++ b/omnivoice/models/omnivoice.py
@@ -256,6 +256,13 @@ class OmniVoice(PreTrainedModel):
# Resolve to local path first; download only if not already cached
resolved_path = _resolve_model_path(pretrained_model_name_or_path)
+ # Workaround for MPS garbled noise / NaNs with SDPA or float16
+ dev_map = str(kwargs.get("device_map", ""))
+ if dev_map.startswith("mps") or dev_map == "mps":
+ kwargs.setdefault("attn_implementation", "eager")
+ # Force float32 on MPS, as float16 softmax overflows cause garbled noise
+ kwargs["dtype"] = torch.float32
+
model = super().from_pretrained(resolved_path, *args, **kwargs)
if not train_mode:
Steps to Reproduce
Run without CPU to reproduce garbled noise output:
omnivoice-demo --ip 0.0.0.0 --port 7860
Run with CPU to produce normal audio:
omnivoice-demo --device cpu --ip 0.0.0.0 --port 7860
When above patch/fix is applied, MPS can be used, dramatically increaseing render speed.
✔️ Expected Behavior
Working voice clone audio.
❌ Actual Behavior
Garbled noise audio.
Checks
Environment Details
TLDR; My Mac Mini would not generate usable Voice Clone until Antigravity applied the below patch. Opening bug here to see if this is known/fixed. Happy to proceed with a PR if not.
Environment Details
When launching OmniVoice on Apple Silicon using the MPS backend (device_map="mps"), the generated audio consists entirely of garbled, repetitive noise (often starting with a brief high-pitched tone followed by static or silence).
Cause
The issue is triggered by two factors when running on MPS:
Numerical stability in float16: Multi-head attention layers using float16 can experience numerical overflows/NaNs under MPS during softmax computation when processing fully padded/masked rows.
SDPA kernel limitations on MPS: PyTorch's Scaled Dot Product Attention (SDPA) backend on MPS doesn't gracefully handle the attention masks used in the model under float16 precision, causing output corruption.
Setting device_map="cpu" avoids this issue but is extremely slow (~2.5 minutes for generation compared to ~30 seconds on MPS).
Proposed Fix
To resolve this, we can catch when device_map is configured for MPS and automatically:
Fall back to standard "eager" attention (bypassing the problematic SDPA kernels on MPS).
Use torch.float32 precision instead of float16 to prevent softmax overflows.
Here is the patch applied to omnivoice/models/omnivoice.py inside OmniVoice.from_pretrained that resolves the issue:
Steps to Reproduce
Run without CPU to reproduce garbled noise output:
omnivoice-demo --ip 0.0.0.0 --port 7860Run with CPU to produce normal audio:
omnivoice-demo --device cpu --ip 0.0.0.0 --port 7860When above patch/fix is applied, MPS can be used, dramatically increaseing render speed.
✔️ Expected Behavior
Working voice clone audio.
❌ Actual Behavior
Garbled noise audio.