Bug Report: Significant Audio Jitter and Low Throughput on Apple Silicon (MPS)

### Description
When running the Chroma-4B model on Apple Silicon (MPS backend), we observe significant audio stuttering caused by high Inter-Token Latency (ITL) and inconsistent generation speeds. Even with greedy search and optimized pre-loaded prompts, the inference speed frequently drops to 0.2x - 0.3x real-time, making live conversation impossible.

### 💻 Environment
```yaml
Hardware: MacBook Pro (Mac14,5) - Apple M2 Max
OS: macOS 15.2 (26.2)
Backend: MPS (Metal Performance Shaders)
Versions:
  transformers: 5.0.0
  torch: 2.10.0
  torchaudio: 2.10.0
  torchcodec: 0.10.0
```

### ☁️ Cloud Context (Modal)

We also attempted to run the model in a cloud environment via Modal to rule out platform-specific bottlenecks.

```
Hardware: NVIDIA A100 (40GB/80GB)
Memory: 32GB
Runtime: CUDA 12.6
```
Results: While the throughput was higher than MPS, we still observed inconsistent inter-token latency (ITL) that causes audible jitter in a real-time speech-to-speech loop.

### 🔴 The Issue
When streaming Mimi tokens (80ms audio frames) for real-time speech-to-speech interaction, the model exhibits significant ITL (Inter-Token Latency) spikes. Even with optimizations like Greedy Search, Cached Speaker Prompts, and transformers==5.0.0, the throughput on MPS peaks at ~0.8x real-time but fluctuates frequently, falling behind the required 80ms/frame cadence.

```
[Chroma Trace] JITTER DETECTED: Frame 2 delay=157.3ms (Tokens arriving late)
[Chroma Trace] JITTER DETECTED: Frame 3 delay=133.3ms (Tokens arriving late)
[Chroma Streamer] Frame   10: ITL= 88.8ms, Decode= 23.1ms, Speed=0.33x (SLOW)
[Chroma Streamer] Frame   40: ITL= 75.1ms, Decode= 18.4ms, Speed=0.63x (SLOW)
[Chroma Streamer] Frame   80: ITL= 79.7ms, Decode= 21.0ms, Speed=0.76x (SLOW)
[Chroma Streamer] Frame  100: ITL= 72.1ms, Decode= 17.9ms, Speed=0.80x (SLOW)
[Chroma Trace] JITTER DETECTED: Frame 113 delay=111.3ms (Tokens arriving late)
```

### 🔍 Key Findings & Audit Results
Inference Latency Spikes: High-precision streamers show that token generation time is not consistent, with spikes reaching 150ms+ between Mimi frames.
MPS Synchronization: There appears to be significant overhead when synchronizing between the backbone and the audio decoder layers on the Metal backend.

### ❓ Requested Guidance
Are there specific MPS-optimized kernels recommended for the interleaved text-audio attention mechanism?
Is there a way to reduce synchronization points in the 
generate()
 loop for transformers==5.0.0 to achieve consistent <80ms ITL?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report: Significant Audio Jitter and Low Throughput on Apple Silicon (MPS) #21

Description

💻 Environment

☁️ Cloud Context (Modal)

🔴 The Issue

🔍 Key Findings & Audit Results

❓ Requested Guidance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bug Report: Significant Audio Jitter and Low Throughput on Apple Silicon (MPS) #21

Description

Description

💻 Environment

☁️ Cloud Context (Modal)

🔴 The Issue

🔍 Key Findings & Audit Results

❓ Requested Guidance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions