eval-transcript

Simple benchmarking service for audio transcription for the French administration.

Initial scope:

manage benchmark audio files under data/audio/
keep source-of-truth transcriptions under data/source_truth/
store model outputs under data/transcriptions/
compare transcription quality for target models:
- Whisper via WhisperX
- Voxtral
- Kyutai STT (kyutai/stt-1b-en_fr)
- Cohere Transcribe
- Scribe v2

Getting started

This project uses uv with a Python src/ layout.

Prerequisites

uv for dependency management and command execution.
Git for source control and pre-commit hooks.
The configured Gitleaks hook is installed by pre-commit in its managed environment. A standalone gitleaks binary is only needed if you want to run full repository scans or use the gitleaks CLI directly outside pre-commit.

Development setup

uv sync
uv run eval-transcript --help

Install the pre-commit hook to scan staged changes for secrets with Gitleaks:

uv run pre-commit install

You can also run the staged-changes secret scan manually:

uv run pre-commit run gitleaks

To scan the existing repository contents and history, install the standalone Gitleaks binary and run a direct repository scan:

gitleaks git --redact --verbose

The CLI loads a .env file from the current working directory before reading provider configuration. Explicit environment variables already set in the process take precedence, and --base-url / --api-key flags override both. For local development, copy .env.example to .env and fill in the secrets:

cp .env.example .env

oMLX provider

If a local oMLX server is running with its OpenAI-compatible API on http://localhost:8000/v1, set OMLX_API_KEY and list available models:

uv run eval-transcript omlx models

Transcribe one audio file through a model alias exposed by /v1/models:

uv run eval-transcript omlx transcribe data/audio/sample.wav \
  --model whisper-large-v3-asr-fp16 \
  --language fr

The transcribe command prints text only by default for quick visual comparison against source-of-truth transcripts. Use --json to print the raw response with segment metadata.

For Cohere Transcribe on Apple Silicon, use the original Cohere model with oMLX's mlx-audio STT loader:

CohereLabs/cohere-transcribe-03-2026

A smoke test on oMLX 0.3.12 loaded this model and successfully transcribed short English and French WAV files. oMLX exposes the downloaded model as:

cohere-transcribe-03-2026

The converted MLX 8-bit candidates are currently not reliable oMLX targets. The beshkenadze 8-bit conversion is discovered and loaded, but fails during transcription with a convolution shape mismatch:

beshkenadze/cohere-transcribe-03-2026-mlx-8bit

The mlx-community mirror is useful for mlx-speech, but is not currently a drop-in oMLX candidate:

mlx-community/cohere-transcribe-03-2026-mlx-8bit

It stores its runnable files under mlx-int8/, so the current oMLX discovery fails to recognize it as a downloaded model. Moving or symlinking those files to the repository root makes oMLX discover the alias, but a smoke test on oMLX 0.3.12 failed during transcription with the same convolution shape mismatch. Treat it as incompatible with oMLX until the upstream conversion or loader changes.

After downloading a candidate locally and restarting or refreshing oMLX model discovery, check the model alias exposed by the local server:

uv run eval-transcript omlx models

Then pass that exact alias to the transcription command. If oMLX exposes the repository name as the alias, the command is:

uv run eval-transcript omlx transcribe data/audio/sample.wav \
  --model cohere-transcribe-03-2026

If oMLX exposes a different alias, use the alias printed by omlx models instead. Cohere Transcribe supports French, but on oMLX 0.3.12 the --language fr option is currently broken because oMLX maps fr to french before calling the Cohere loader. Omit --language for now; the smoke test transcribed French correctly without a language hint.

To save the text output for later comparison, use --save. The file is written to data/transcriptions/<audio-stem>/omlx__<model>.txt:

uv run eval-transcript omlx transcribe data/audio/sample.wav \
  --model whisper-large-v3-asr-fp16 \
  --language fr \
  --save

Kyutai STT (local, via the oMLX provider)

Kyutai STT ships kyutai/stt-1b-en_fr (English/French, with built-in semantic VAD) and kyutai/stt-2.6b-en. It is local-only: there is no hosted or OpenAI-compatible HTTP endpoint. File transcription runs through the moshi (PyTorch) or moshi_mlx (Apple Silicon) packages, and the only server Kyutai ships is a Rust WebSocket streaming server. See kyutai-labs/delayed-streams-modeling for the inference scripts.

To benchmark Kyutai alongside the other models, run it behind a small local OpenAI-compatible server that wraps moshi/moshi_mlx and exposes GET /v1/models plus POST /v1/audio/transcriptions returning {"text": ...} on http://localhost:8000/v1, then transcribe through the generic oMLX provider:

uv run eval-transcript omlx transcribe data/audio/sample.mp3 \
  --model kyutai/stt-1b-en_fr \
  --language fr \
  --save

This reuses the existing oMLX OpenAI-compatible client, so no Kyutai-specific provider code is needed. Use kyutai/stt-1b-en_fr for French.

ElevenLabs provider

Set ELEVENLABS_API_KEY to use ElevenLabs Speech to Text with Scribe v2. The optional ELEVENLABS_BASE_URL can point to a regional ElevenLabs API base URL if needed.

List documented ElevenLabs speech-to-text models:

uv run eval-transcript elevenlabs models

Transcribe one audio or video file through Scribe v2:

uv run eval-transcript elevenlabs transcribe data/audio/sample.wav \
  --model scribe_v2 \
  --language fr

ElevenLabs accepts either ISO-639-1 or ISO-639-3 language hints, so --language fr and --language fra are both valid French hints. The transcribe command prints text only by default. Use --json to print the serialized SDK response with metadata such as words and timestamps, or --save to write data/transcriptions/<audio-stem>/elevenlabs__<model>.txt.

Optional Scribe v2 controls include --timestamps-granularity none|word|character, --diarize, --num-speakers, --temperature, --seed, --no-verbatim, and --no-tag-audio-events.

Albert API provider

Set ALBERT_API_KEY and ALBERT_BASE_URL (for example https://albert.api.etalab.gouv.fr/v1) to use Albert API's audio transcription endpoint. List available models:

uv run eval-transcript albert models

Transcribe one audio file with Albert's Whisper model:

uv run eval-transcript albert transcribe data/audio/sample.wav \
  --model openai/whisper-large-v3 \
  --language fr

The transcribe command prints text only by default. Use --json to print the raw response, or --save to write data/transcriptions/<audio-stem>/albert__<model>.txt.

Scaleway provider

Scaleway Generative APIs expose Voxtral through an OpenAI-compatible chat completions endpoint. Set SCW_SECRET_KEY and SCW_DEFAULT_PROJECT_ID; the CLI derives the project-scoped Generative APIs URL from SCW_DEFAULT_PROJECT_ID. Both scaleway models and scaleway transcribe also accept --api-key and --project-id to override these without a .env (useful in a worktree or CI).

List Voxtral models available through Scaleway Generative APIs:

uv run eval-transcript scaleway models

The models command queries the same Generative APIs endpoint used for transcription, so every listed ID can be passed directly to --model.

Transcribe one local MP3 or WAV file through Voxtral:

uv run eval-transcript scaleway transcribe data/audio/sample.mp3 \
  --model voxtral-small-24b-2507 \
  --language fr

Voxtral follows the language of its prompt, so the CLI sends a French prompt that explicitly forbids translation. Use --language to pin a different target language (it shapes the prompt), or --prompt to override the prompt entirely. The transcribe command prints text only by default. Use --json to print the raw chat completion response, or --save to write data/transcriptions/<audio-stem>/scaleway__<model>.txt.

Data layout

The repository tracks the directory structure only. Audio and generated transcript artifacts are gitignored by default.

data/
├── manifest.md        # benchmark index generated from local data files
├── audio/             # input audio files
├── source_truth/      # human/source-of-truth transcripts
└── transcriptions/    # model-generated transcripts

Generate or refresh the global benchmark manifest after adding local data files:

uv run eval-transcript manifest sync

Scoring transcripts

Score generated transcripts against source truth with the jiwer-backed scoring engine:

uv run eval-transcript score all

Score all generated outputs for one sample:

uv run eval-transcript score sample sample

The scorer matches data/source_truth/<sample-id>.md (or .txt) with data/transcriptions/<sample-id>/*.txt and reports WER, CER, substitution/deletion/insertion counts, and the reference token count. Aggregate WER is computed from total edit counts across all scored transcripts, not by averaging per-transcript WER values. Text, Markdown, and JSON outputs also include provider/model grouped WER summaries for model comparison.

Use --json for machine-readable output, or --normalization raw to score exact text after Unicode normalization only. The default standard normalization is conservative for French: it normalizes Unicode, casing, apostrophe variants, punctuation/symbols, and whitespace while preserving accents.

Use --normalization standard_numbers to additionally fold numbers to a canonical form so spelled-out and digit notations match (cinq/5, premier/1er, deux mille cinq cents/2 500). This avoids penalizing a model only for writing numbers differently than the reference; it is useful on number-heavy material (budgets, statistics).

Text output includes top substitutions, insertions, and deletions by default. Use --top-errors 0 to hide these summaries, or --align to append normalized REF / HYP / ERR alignment blocks for each scored transcript.

Use --format markdown or --format csv for report-friendly output, and --output PATH to write the rendered scoring report to a file. --json remains available as a shortcut for --format json.

data/manifest.md uses Markdown with YAML frontmatter to index samples, source-truth paths, generated outputs, and placeholder metadata such as language, duration, domain, runtime, and real-time factor.

Source-of-truth transcripts are matched to a sample by basename and may be either .txt or .md (for example data/source_truth/sample.txt for data/audio/sample.wav).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval-transcript

Getting started

Prerequisites

Development setup

oMLX provider

Kyutai STT (local, via the oMLX provider)

ElevenLabs provider

Albert API provider

Scaleway provider

Data layout

Scoring transcripts

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

eval-transcript

Getting started

Prerequisites

Development setup

oMLX provider

Kyutai STT (local, via the oMLX provider)

ElevenLabs provider

Albert API provider

Scaleway provider

Data layout

Scoring transcripts