- 
          
 - 
                Notifications
    
You must be signed in to change notification settings  - Fork 223
 
Open
Description
Summary
- Add support for passing an in-memory waveform to STT generate(…) in addition to the current filepath interface.
 - This removes the need to write temporary WAV files for each transcription chunk.
 
Current Behavior
- Parakeet STT models (e.g., mlx-community/parakeet-tdt_ctc-110m) are invoked via generate(path_to_audio).
 - This requires a disk-backed temp WAV per VAD segment.
 
Proposal
- Accept NumPy arrays directly:
- Signature: result = model.generate(audio: np.ndarray, sample_rate: int, …)
 - Array format: mono or interleaved stereo, dtype=float32, range [-1.0, 1.0]
 - Optional aliases: sr instead of sample_rate
 
 
Details
- Input validation:
- Reject unsupported dtypes with a clear error (suggest converting to float32).
 - Require sample_rate when audio is provided.
 
 - Channel handling:
- If 2D interleaved arrays are supported, document shape expectations.
 - If only mono is supported initially, downmix or raise a clear error.
 
 - Return shape:
- Keep current return type (AlignedResult or equivalent) unchanged.
 
 
Backwards Compatibility
- Preserve existing path-based signature: generate(path: str, …) continues to work.
 - Path-or-array dispatch: choose behavior based on argument type to avoid breaking downstream integrators.
 
Rationale
- Eliminates persistent or per-call temp files, reducing latency and CPU I/O overhead.
 - Avoids filesystem churn and potential sandbox/path issues.
 - Enables tighter integration with real-time VAD segmenters where audio is already in memory.
 
Example
import numpy as np
from mlx_audio.stt.generate import load_model
model = load_model("mlx-community/parakeet-tdt_ctc-110m", lazy=True)
16 kHz mono float32 in [-1, 1]
audio = np.random.randn(16000).astype(np.float32) * 0.03
sr = 16000
res = model.generate(audio=audio, sample_rate=sr)
print(getattr(res, "text", str(res)))
Questions
- Preferred shape for multichannel arrays (N, C) vs interleaved 1D?
 - Any constraints on max duration or chunked/streaming input plans we should be aware of?
 - Would you consider also supporting file-like objects (BytesIO) if ndarray is not feasible?
 
Metadata
Metadata
Assignees
Labels
No labels