Skip to content

Feature: allow generate(audio=np.ndarray, sample_rate=…) for STT (in-memory input) #230

@joshwhiton

Description

@joshwhiton

Summary

  • Add support for passing an in-memory waveform to STT generate(…) in addition to the current filepath interface.
  • This removes the need to write temporary WAV files for each transcription chunk.

Current Behavior

  • Parakeet STT models (e.g., mlx-community/parakeet-tdt_ctc-110m) are invoked via generate(path_to_audio).
  • This requires a disk-backed temp WAV per VAD segment.

Proposal

  • Accept NumPy arrays directly:
    • Signature: result = model.generate(audio: np.ndarray, sample_rate: int, …)
    • Array format: mono or interleaved stereo, dtype=float32, range [-1.0, 1.0]
    • Optional aliases: sr instead of sample_rate

Details

  • Input validation:
    • Reject unsupported dtypes with a clear error (suggest converting to float32).
    • Require sample_rate when audio is provided.
  • Channel handling:
    • If 2D interleaved arrays are supported, document shape expectations.
    • If only mono is supported initially, downmix or raise a clear error.
  • Return shape:
    • Keep current return type (AlignedResult or equivalent) unchanged.

Backwards Compatibility

  • Preserve existing path-based signature: generate(path: str, …) continues to work.
  • Path-or-array dispatch: choose behavior based on argument type to avoid breaking downstream integrators.

Rationale

  • Eliminates persistent or per-call temp files, reducing latency and CPU I/O overhead.
  • Avoids filesystem churn and potential sandbox/path issues.
  • Enables tighter integration with real-time VAD segmenters where audio is already in memory.

Example

import numpy as np
from mlx_audio.stt.generate import load_model

model = load_model("mlx-community/parakeet-tdt_ctc-110m", lazy=True)

16 kHz mono float32 in [-1, 1]

audio = np.random.randn(16000).astype(np.float32) * 0.03
sr = 16000

res = model.generate(audio=audio, sample_rate=sr)
print(getattr(res, "text", str(res)))

Questions

  • Preferred shape for multichannel arrays (N, C) vs interleaved 1D?
  • Any constraints on max duration or chunked/streaming input plans we should be aware of?
  • Would you consider also supporting file-like objects (BytesIO) if ndarray is not feasible?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions