Skip to content

Bug Report: Significant Audio Jitter and Low Throughput on Apple Silicon (MPS) #21

@psharrma

Description

@psharrma

Description

When running the Chroma-4B model on Apple Silicon (MPS backend), we observe significant audio stuttering caused by high Inter-Token Latency (ITL) and inconsistent generation speeds. Even with greedy search and optimized pre-loaded prompts, the inference speed frequently drops to 0.2x - 0.3x real-time, making live conversation impossible.

💻 Environment

Hardware: MacBook Pro (Mac14,5) - Apple M2 Max
OS: macOS 15.2 (26.2)
Backend: MPS (Metal Performance Shaders)
Versions:
  transformers: 5.0.0
  torch: 2.10.0
  torchaudio: 2.10.0
  torchcodec: 0.10.0

☁️ Cloud Context (Modal)

We also attempted to run the model in a cloud environment via Modal to rule out platform-specific bottlenecks.

Hardware: NVIDIA A100 (40GB/80GB)
Memory: 32GB
Runtime: CUDA 12.6

Results: While the throughput was higher than MPS, we still observed inconsistent inter-token latency (ITL) that causes audible jitter in a real-time speech-to-speech loop.

🔴 The Issue

When streaming Mimi tokens (80ms audio frames) for real-time speech-to-speech interaction, the model exhibits significant ITL (Inter-Token Latency) spikes. Even with optimizations like Greedy Search, Cached Speaker Prompts, and transformers==5.0.0, the throughput on MPS peaks at ~0.8x real-time but fluctuates frequently, falling behind the required 80ms/frame cadence.

[Chroma Trace] JITTER DETECTED: Frame 2 delay=157.3ms (Tokens arriving late)
[Chroma Trace] JITTER DETECTED: Frame 3 delay=133.3ms (Tokens arriving late)
[Chroma Streamer] Frame   10: ITL= 88.8ms, Decode= 23.1ms, Speed=0.33x (SLOW)
[Chroma Streamer] Frame   40: ITL= 75.1ms, Decode= 18.4ms, Speed=0.63x (SLOW)
[Chroma Streamer] Frame   80: ITL= 79.7ms, Decode= 21.0ms, Speed=0.76x (SLOW)
[Chroma Streamer] Frame  100: ITL= 72.1ms, Decode= 17.9ms, Speed=0.80x (SLOW)
[Chroma Trace] JITTER DETECTED: Frame 113 delay=111.3ms (Tokens arriving late)

🔍 Key Findings & Audit Results

Inference Latency Spikes: High-precision streamers show that token generation time is not consistent, with spikes reaching 150ms+ between Mimi frames.
MPS Synchronization: There appears to be significant overhead when synchronizing between the backbone and the audio decoder layers on the Metal backend.

❓ Requested Guidance

Are there specific MPS-optimized kernels recommended for the interleaved text-audio attention mechanism?
Is there a way to reduce synchronization points in the
generate()
loop for transformers==5.0.0 to achieve consistent <80ms ITL?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions