Feature: allow generate(audio=np.ndarray, sample_rate=…) for STT (in-memory input)

Summary

  - Add support for passing an in-memory waveform to STT generate(…) in addition to the current filepath interface.
  - This removes the need to write temporary WAV files for each transcription chunk.

  Current Behavior

  - Parakeet STT models (e.g., mlx-community/parakeet-tdt_ctc-110m) are invoked via generate(path_to_audio).
  - This requires a disk-backed temp WAV per VAD segment.

  Proposal

  - Accept NumPy arrays directly:
      - Signature: result = model.generate(audio: np.ndarray, sample_rate: int, …)
      - Array format: mono or interleaved stereo, dtype=float32, range [-1.0, 1.0]
      - Optional aliases: sr instead of sample_rate

  Details

  - Input validation:
      - Reject unsupported dtypes with a clear error (suggest converting to float32).
      - Require sample_rate when audio is provided.
  - Channel handling:
      - If 2D interleaved arrays are supported, document shape expectations.
      - If only mono is supported initially, downmix or raise a clear error.
  - Return shape:
      - Keep current return type (AlignedResult or equivalent) unchanged.

  Backwards Compatibility

  - Preserve existing path-based signature: generate(path: str, …) continues to work.
  - Path-or-array dispatch: choose behavior based on argument type to avoid breaking downstream integrators.

  Rationale

  - Eliminates persistent or per-call temp files, reducing latency and CPU I/O overhead.
  - Avoids filesystem churn and potential sandbox/path issues.
  - Enables tighter integration with real-time VAD segmenters where audio is already in memory.

  Example

  import numpy as np
  from mlx_audio.stt.generate import load_model

  model = load_model("mlx-community/parakeet-tdt_ctc-110m", lazy=True)

  16 kHz mono float32 in [-1, 1]

  audio = np.random.randn(16000).astype(np.float32) * 0.03
  sr = 16000

  res = model.generate(audio=audio, sample_rate=sr)
  print(getattr(res, "text", str(res)))

  Questions

  - Preferred shape for multichannel arrays (N, C) vs interleaved 1D?
  - Any constraints on max duration or chunked/streaming input plans we should be aware of?
  - Would you consider also supporting file-like objects (BytesIO) if ndarray is not feasible?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Feature: allow generate(audio=np.ndarray, sample_rate=…) for STT (in-memory input) #230

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Feature: allow generate(audio=np.ndarray, sample_rate=…) for STT (in-memory input) #230

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions