Resilient STT

The only Speech-To-Text pipeline you need

A provider-agnostic speech-to-text pipeline that plugs into any OpenAI-compatible ASR endpoint — OpenAI, OpenRouter, vLLM, faster-whisper, or the bundled Qwen worker — and returns rich, timestamped transcripts with speaker labels and validated metadata. Resilient STT handles everything around inference: audio preprocessing, voice-activity detection, intelligent chunking, pyannote diarization, and LLM-powered transcript repair. Tested on English, Hindi, and Hinglish but it is designed to support all the languages the ASR model of your choice supports.

Features

Universal ASR — Connect to any OpenAI-compatible endpoint, switching models easily by flag.
Automatic Discovery — Finds local or remote ASR, retries if needed, starts Qwen fallback automatically.
Seamless Audio Prep — Effortless ffmpeg normalization, optional enhancement.
Smart Silence Skipping — Uses Silero, webrtcvad, or RMS to focus only on speech.
Chunking & Stitching — Handles long audio with intelligent segmentation, accurate timestamps.
Speaker Diarization — Word-level speakers with pyannote.
LLM Repair — Optional transcript refinement via your preferred chat endpoint.
Versatile Export — Output to JSON, SRT, and VTT including rich metadata.
Lightweight — Orchestrator only; no model weights required.
CLI & Python API — Use via terminal or integrate in code.

Architecture & design decisions: docs/design.md

CLI reference: docs/cli.md

Quickstart

Prerequisites: Python 3.11 or 3.12, ffmpeg on PATH.

pip install "resilient-stt[full]"   # minimal: pip install resilient-stt (no pyannote/Silero)

resilient-stt \
  --audio /path/to/audio.wav \
  --output /path/to/output-dir \
  --language hi

# Output (under --output):
#   transcript.json   — segments, words, speakers, repair metadata
#   transcript.srt    — subtitles
#   transcript.vtt    — WebVTT
#
# Example transcript.json (truncated):
# {
#   "audio_file": "/path/to/audio.wav",
#   "duration": 142.5,
#   "language": "hi",
#   "asr_provider": "qwen-asr-fallback",
#   "asr_model": "Qwen/Qwen3-ASR-0.6B",
#   "segments": [{
#     "speaker": "SPEAKER_00",
#     "start": 0.12,
#     "end": 4.85,
#     "raw_text": "Namaste, aaj hum meeting shuru karte hain.",
#     "clean_text": "Namaste, aaj hum meeting shuru karte hain.",
#     "repair_status": "unchanged",
#     "words": [{ "word": "Namaste,", "start": 0.12, "end": 0.58, "speaker": "SPEAKER_00" }]
#   }]
# }

No ASR setup required — the CLI auto-detects local workers and hosted APIs, or starts a bundled Qwen worker. Quick smoke test without pyannote or repair: add --skip-diarization --repair false. For OpenAI or OpenRouter, set OPENAI_API_KEY or OPENROUTER_API_KEY in .env and pass --no-asr-fallback. Full flags: docs/cli.md. Repository layout: docs/design.md.

Note

🚧 Active development – not production-ready.
Expect changes, incomplete features, and ongoing improvements.

Architecture

Audio input
  -> ffmpeg normalize (mono 16 kHz WAV; optional --enhance-audio)
  -> VAD (Silero if installed, else webrtcvad / RMS; skip silent regions)
  -> chunk speech regions (fixed: 60s/2s overlap; or pause-aligned: ~120s at onsets)
  -> ASR microservice calls (OpenAI-compatible)
  -> normalize + stitch global timestamps
  -> pyannote diarization on the full normalized audio
  -> speaker assignment (word IoU + segment fallback)
  -> optional forced alignment
  -> optional LLM transcript repair (two passes, validated)
  -> export JSON / SRT / VTT

The orchestrator never embeds ASR model weights. Point --asr-endpoint at any service that implements POST /v1/audio/transcriptions per the OpenAI contract.

Supported models

The orchestrator is provider-agnostic: any service that implements POST /v1/audio/transcriptions (OpenAI multipart shape) or OpenRouter STT (JSON + base64) works when you set the right endpoint, model id, and API key. The lists below are examples, not an exhaustive allowlist — use whatever model id your provider documents.

Configure via --asr-endpoint, --model, and env vars (see docs/cli.md and ASR auto-detection).

ASR — OpenAI API

Setting	Value
Endpoint	`https://api.openai.com/v1` or omit + `OPENAI_API_KEY` for auto-detect
API key	`OPENAI_API_KEY` or `ASR_API_KEY`
Default `--model`	`whisper-1`

Example transcription models on OpenAI (Audio API):

Model	Timestamps
`whisper-1`	Word + segment (`verbose_json`)
`gpt-4o-transcribe`	Text only
`gpt-4o-mini-transcribe`	Text only
`gpt-4o-transcribe-diarize`	Speaker labels in response

New OpenAI transcription models generally work with --model as long as they use the same endpoint.

ASR — OpenRouter

Setting	Value
Endpoint	`https://openrouter.ai/api/v1` or omit + `OPENROUTER_API_KEY` for auto-detect
API key	`OPENROUTER_API_KEY` or `ASR_API_KEY`
Default `--model`	`openai/whisper-1`

Example STT models on OpenRouter (STT docs; browse speech-to-text models):

Model slug	Notes
`openai/whisper-1`	Default auto-detect
`openai/whisper-large-v3`	Whisper via OpenRouter
`google/chirp-3`	Google Chirp
`mistralai/voxtral-mini-transcribe`	Mistral Voxtral

Any OpenRouter model with transcription output modality works — pass its slug to --model. Responses are text-only (no segment timestamps); the pipeline sets weak_timestamps and may run optional alignment.

ASR — Local OpenAI-compatible workers

Point --asr-endpoint at any local or remote server that speaks the worker contract (multipart upload, verbose_json preferred). The orchestrator sends 16 kHz mono WAV chunks.

Worker	Endpoint (default)	Example `--model`	Status
qwen_transformers_service	`http://127.0.0.1:8002/v1`	`Qwen/Qwen3-ASR-0.6B`	Bundled — auto-started when no other ASR is reachable
qwen_vllm_service	`http://127.0.0.1:8001/v1`	`Qwen/Qwen3-ASR-1.7B`	Bundled bootstrap — Linux + NVIDIA GPU
Custom (faster-whisper, vLLM, hosted proxy, etc.)	Your URL	Your model id	Supported — implement or deploy separately
whisper_openai_service	—	—	Planned Roadmap
parakeet_openai_service	—	—	Planned Roadmap

For a third-party server, only the API shape must match; the model name is whatever that server expects.

Diarization (orchestrator-local)

Not routed through OpenAI-compatible ASR — runs inside the pipeline on the full normalized file.

Model	Configure
`pyannote/speaker-diarization-community-1` (default)	`HF_TOKEN` or `--diarization-model-path`; skip with `--skip-diarization`

LLM transcript repair

Any OpenAI-compatible /chat/completions endpoint. Set REPAIR_BASE_URL, REPAIR_MODEL, and REPAIR_API_KEY (auto-filled from OPENAI_API_KEY or OPENROUTER_API_KEY).

Provider	Default `REPAIR_MODEL`	Example alternatives
OpenAI API	`gpt-4o-mini`	`gpt-4o`, `gpt-4.1-mini`, …
OpenRouter	`openai/gpt-4o-mini`	Any chat model slug your key can access
Other	—	Any id your endpoint accepts

Forced alignment

Component	Status
Qwen forced aligner (`src/resilient_stt/alignment/qwen_aligner.py`)	Planned Roadmap
WhisperX	Planned Roadmap

Today, alignment runs only when --align is set or ASR returned weak timestamps; the default aligner is a no-op pass-through.

Install (PyPI)

# Minimal orchestrator (webrtcvad VAD; ASR via API or bundled qwen worker)
pip install resilient-stt

# Recommended on Apple Silicon / Linux (Silero VAD + pyannote diarization + torch)
pip install "resilient-stt[full]"

# Contributors
pip install "resilient-stt[full,dev]"

Verify the CLI:

resilient-stt --help

Usage after `pip install`

Run from any directory (creates data/work/ under the current working dir unless you pass --work-root):

resilient-stt \
  --audio /path/to/audio.wav \
  --output /path/to/output-dir \
  --language hi

With diarization and Silero VAD you need the [full] extra and usually HF_TOKEN in .env (see Environment variables). Quick smoke test without pyannote:

resilient-stt \
  --audio /path/to/audio.wav \
  --output /path/to/output-dir \
  --language hi \
  --skip-diarization \
  --repair false

Hosted ASR (no local qwen worker) — set OPENAI_API_KEY or OPENROUTER_API_KEY in .env:

resilient-stt \
  --audio /path/to/audio.wav \
  --output /path/to/output-dir \
  --no-asr-fallback \
  --skip-diarization

Bundled qwen-asr worker: On first auto-start, the CLI creates an isolated venv at ~/.cache/resilient-stt/qwen-transformers-worker/.venv (requires network to download qwen-asr and model weights). If inference fails on Apple Silicon, start the worker manually with python scripts/bootstrap_qwen_asr_fallback.py --no-aligner (from a git checkout) or see the bundled worker README.

Platform notes for [full] match Platform notes (torch / diarization) below (Intel Mac: base install + --skip-diarization).

Install from source (uv)

Prerequisites: uv, ffmpeg on PATH. ASR is optional to configure manually — see ASR auto-detection below.

cd resilient-stt

# 1) Create a project virtualenv (recommended; uv also creates one on sync if missing)
uv venv

# 2) Install dependencies into .venv
#    --extra full = Silero VAD + diarization + torch (recommended on Apple Silicon / Linux)
uv sync --extra full --extra dev

# 3) Configure secrets (optional)
cp .env.example .env
# edit .env — loaded automatically on each run

ASR-only install (no torch / pyannote):

uv venv
uv sync --extra dev

Activate the venv if you prefer plain python without uv run:

source .venv/bin/activate   # Windows: .venv\Scripts\activate

Extras: silero (Qwen-aligned VAD), diarization (pyannote), full (both + torch). Without silero, VAD falls back to webrtcvad.

Platform notes (torch / diarization)

pyannote.audio 4.x requires torch ≥ 2.8. On Apple Silicon (M1–M4), PyTorch uses the Metal MPS backend (Accelerated PyTorch on Mac). Use a native arm64 Python/venv — not Rosetta x86_64.

Platform	Install
Apple Silicon (M1–M4)	Native arm64 venv, then `uv sync --extra full --extra dev`. Optional GPU: `--diarization-device mps`
Linux or Windows	`uv sync --extra full --extra dev`
Intel Mac (x86_64)	Diarization extra is not supported (no compatible torch wheels). Use `uv sync --extra dev` and `--skip-diarization`.

If uv errors mention x86_64 on an M-series Mac, your Python/venv is Rosetta. Recreate it:

rm -rf .venv uv.lock
uv venv --python 3.12
python -c "import platform; print(platform.machine())"   # must print arm64
uv sync --extra full --extra dev

Verify MPS after install:

uv run python -c "import torch; print('mps', torch.backends.mps.is_available())"

Usage

From a git checkout, prefix commands with uv run (or activate .venv and use resilient-stt directly). After pip install, use resilient-stt only.

Minimal run (no ASR setup — auto-starts local qwen-asr on CPU/MPS when needed):

uv run resilient-stt \
  --audio data/input/meeting.mp3 \
  --output data/output/meeting \
  --language hi

With an external ASR service (vLLM, hosted API, etc.) — the endpoint must already be running and respond to GET {base}/v1/models:

uv run resilient-stt \
  --audio data/input/meeting.mp3 \
  --asr-endpoint http://localhost:8001/v1 \
  --model Qwen/Qwen3-ASR-1.7B \
  --language hi \
  --repair true \
  --output data/output/meeting

OpenAI API (whisper-1) — set OPENAI_API_KEY in .env, then either auto-detect (no local ASR on :8001/:8002) or point at the API explicitly:

# Auto-detect when no local ASR is running (--no-asr-fallback avoids starting qwen-asr)
uv run resilient-stt \
  --audio data/input/speech.wav \
  --output data/output/openai \
  --no-asr-fallback \
  --skip-diarization

# Explicit endpoint
uv run resilient-stt \
  --audio data/input/speech.wav \
  --output data/output/openai \
  --asr-endpoint https://api.openai.com/v1 \
  --model whisper-1 \
  --skip-diarization

OpenRouter (google/chirp-3) — set OPENROUTER_API_KEY in .env or pass the key via ASR_API_KEY:

uv run resilient-stt \
  --audio data/input/speech.wav \
  --output data/output/openrouter/chirp \
  --asr-endpoint https://openrouter.ai/api/v1 \
  --model google/chirp-3 \
  --skip-diarization \
  --repair false

For music or other non-speech audio, add --no-vad so VAD does not skip the file. Full flag reference: docs/cli.md.

ASR auto-detection

When --asr-endpoint is omitted (and ASR_BASE_URL / ASR_ENDPOINT are unset):

Probe vLLM at http://127.0.0.1:8001/v1 (optional bootstrap)
Probe an existing qwen-asr worker at http://127.0.0.1:8002/v1
OpenRouter when OPENROUTER_API_KEY is set (no --model)
OpenAI when OPENAI_API_KEY is set (no --model)
Otherwise start the local qwen-asr fallback (bundled worker) — slow on CPU/MPS but needs no NVIDIA GPU

Use --no-asr-fallback to require an explicit or already-running ASR service.

If you pass --asr-endpoint explicitly, that URL must respond to GET …/v1/models before the pipeline runs (otherwise the CLI exits with “endpoint configured but unreachable”).

Useful flags:

Flag	Effect
`--no-asr-fallback`	Require explicit/running ASR; do not auto-start qwen-asr on :8002
`--no-vad`	Disable VAD; transcribe the entire timeline
`--vad-backend`	`auto`, `silero`, `webrtcvad`, or `rms` (default `auto`)
`--chunk-mode`	`fixed` (60s/2s overlap) or `pause-aligned` (~120s at onsets, max 180s)
`--enhance-audio`	High-pass + denoise + loudnorm during normalize
`--skip-diarization`	Skip pyannote; export without speaker labels
`--diarization-model-path PATH`	Load a local `git clone` of the pyannote model (offline)
`--num-speakers` / `--min-speakers` / `--max-speakers`	Hint pyannote speaker count
`--align`	Force the optional alignment stage even when ASR returned timestamps
`--repair true`	Run the LLM repair stage (needs `REPAIR_BASE_URL`/`REPAIR_MODEL`)
`--resume`	Reuse existing intermediates under `data/work/<job_id>/`

Environment variables

Copy [.env.example](.env.example) to .env and fill in what you need. The CLI loads .env on startup; shell exports take precedence.

Variable	Purpose
`ASR_BASE_URL` / `ASR_ENDPOINT`	Optional fixed ASR base URL (same as `--asr-endpoint`)
`ASR_API_KEY`	Optional Bearer token for the ASR endpoint
`OPENROUTER_API_KEY`	OpenRouter key; enables hosted ASR/repair presets
`OPENAI_API_KEY`	OpenAI key; enables hosted ASR/repair presets
`REPAIR_BASE_URL`	OpenAI-compatible chat endpoint (e.g. `https://api.openai.com/v1`)
`REPAIR_MODEL`	Repair model id (e.g. `gpt-4o-mini`)
`REPAIR_API_KEY`	Bearer token for the repair endpoint
`HF_TOKEN`	Used only to download gated pyannote weights. Skip with `--skip-diarization` or use `--diarization-model-path` after a local clone.

The default diarization model is [pyannote/speaker-diarization-community-1](https://huggingface.co/pyannote/speaker-diarization-community-1) (CC-BY-4.0). Accept the model card terms once on Hugging Face, then either provide HF_TOKEN for the first download or follow the model card's "Offline use" instructions to clone the repo and point --diarization-model-path at the local copy.

Privacy / telemetry

On startup the orchestrator disables optional usage metrics from dependencies (pyannote [PYANNOTE_METRICS_ENABLED=0](https://github.com/pyannote/pyannote-audio#telemetry), Hugging Face Hub HF_HUB_DISABLE_TELEMETRY=1). Shell exports and .env values take precedence — set PYANNOTE_METRICS_ENABLED=1 to opt back in.

Intermediate artifacts

Each run materializes everything under data/work/<job_id>/:

normalized.wav
speech_regions.json    # {regions, speech_onsets_samples}
chunks/                # per-chunk WAV slices
chunks.json
asr_raw/<chunk_id>.json
asr_normalized.json
diarization.json
speaker_segments_raw.json
speaker_segments_repaired.json   # only when --repair is on

Final exports land under --output: transcript.json, transcript.srt, transcript.vtt.

Constraints (what not to do)

Do not embed ASR models inside the orchestrator process.
Do not chunk inside the ASR microservice; the orchestrator owns chunking.
Do not diarize per chunk; pyannote runs on the full normalized audio.
Do not discard raw ASR responses — they live under asr_raw/.
Do not let LLM repair change timestamps, speaker labels, or segment count.

Tests

From source:

uv sync --extra dev
uv run pytest

After pip install "resilient-stt[dev]" from a checkout (with src/ on PYTHONPATH) or when developing in the repo, the same pytest command applies if src is configured in pyproject.toml (pythonpath = ["src"]).

The included tests use synthetic fixtures and mocked HTTP responses; they do not invoke ffmpeg, pyannote, or any LLM.

Publishing (maintainers)

Commits on main — use Conventional Commits in PR titles or squash messages (feat:, fix:, chore:, etc.).
release-please — manifest config in release-please-config.json (python strategy, package name). Last released version in .release-please-manifest.json. Opens a Release PR that bumps pyproject.toml, CHANGELOG.md, and the manifest.
Ship — merge the Release PR; release-please creates GitHub Release + tag resilient-stt-vX.Y.Z.
PyPI — runs on release: created when the tag matches version in pyproject.toml.

Do not set release-type on the GitHub Action — that disables manifest mode and can reset versions to 0.1.0. If a mistaken 0.1.0 release exists on GitHub, delete it and its tag before the next ship.

Retry a missed publish — re-run the failed Publish to PyPI workflow from Actions (the release event is preserved).

One-time setup:

PyPI — trusted publisher for workflow publish.yml on repo gitcommitshow/resilient-stt.
release-please — repo secret RELEASE_PLEASE_TOKEN: fine-grained PAT (or classic repo) with Contents and Pull requests read/write on this repo. Without it, releases are created with GITHUB_TOKEN and Publish to PyPI will not run.

In org/repo settings, allow GitHub Actions to create and approve pull requests if Release PRs do not appear.

See CHANGELOG.md for release history.

License

This project is licensed under the GNU General Public License v3.0 or later (GPL-3.0-or-later). See LICENSE for the full text.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Resilient STT

Features

Quickstart

Architecture

Supported models

ASR — OpenAI API

ASR — OpenRouter

ASR — Local OpenAI-compatible workers

Diarization (orchestrator-local)

LLM transcript repair

Forced alignment

Install (PyPI)

Usage after `pip install`

Install from source (uv)

Platform notes (torch / diarization)

Usage

ASR auto-detection

Environment variables

Privacy / telemetry

Intermediate artifacts

Constraints (what not to do)

Tests

Publishing (maintainers)

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
data		data
docs		docs
scripts		scripts
src/resilient_stt		src/resilient_stt
tests		tests
workers		workers
.env.example		.env.example
.gitignore		.gitignore
.release-please-manifest.json		.release-please-manifest.json
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
banner.png		banner.png
conftest.py		conftest.py
pyproject.toml		pyproject.toml
release-please-config.json		release-please-config.json
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Resilient STT

Features

Quickstart

Architecture

Supported models

ASR — OpenAI API

ASR — OpenRouter

ASR — Local OpenAI-compatible workers

Diarization (orchestrator-local)

LLM transcript repair

Forced alignment

Install (PyPI)

Usage after pip install

Install from source (uv)

Platform notes (torch / diarization)

Usage

ASR auto-detection

Environment variables

Privacy / telemetry

Intermediate artifacts

Constraints (what not to do)

Tests

Publishing (maintainers)

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Usage after `pip install`

Packages