This repository provides a full end-to-end pipeline for speaker diarization, speech-to-text transcription, optional translation / correction & paraphrasing (C&P), and subtitle generation.
It integrates:
- Pyannote for Voice Activity Detection (VAD) or Speaker Diarization
- NVIDIA NeMo for ASR with word-level timestamps
- MarianMT (HuggingFace) for translation / C&P
- Multiple output formats:
.rttm,.xml,.json,.txt,.vtt,.srt,.pngwaveform overlays
| Mode | Description | Outputs |
|---|---|---|
diar |
Only diarization | .rttm, .png |
all |
Diarization + ASR + segmentation + C&P + subtitles | .xml, .rttm, .json, .txt, .srt, .vtt, and optionally _cp.* |
cp |
Only text C&P via MarianMT | text result |
- Speaker diarization using Pyannote with configurable YAML pipeline.
- ASR with timestamps, confidence, word-level metadata.
- Smart segmentation: split long utterances by duration + character thresholds.
- GPU‑safe inference with recursive fallback on OOM errors.
- Subtitle‑ready post‑processing: padding + overlap avoidance.
- Multiple formats: XML, RTTM, JSONL, TXT, SRT, VTT.
- Plotting: waveform + speaker‑bar visualization.
- Compression: automatically packages results into ZIP.
conda create -n audio-pipeline python=3.10
conda activate audio-pipelineInstall PyTorch according to your CUDA setup:
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu126Install NeMo:
pip install nemo_toolkit[asr]Install pyannote.audio:
pip install pyannote-audioInstall transformers:
pip install transformersOther Python packages used: yaml, json, shutil, time, os, xml, matplotlib, tqdm, omegaconf.
GPU strongly recommended for Pyannote ≥3.x and NeMo ASR.
pyannote_seg()— diarization or VADnemo_asr()/nemo_inference()— ASR with timestampsmarianmt_cp()— capitalization/punctuationdivide_segments()— split long utterancesadd_padding()— avoid subtitle overlapsmap_speaker_color()— assign colors- Writers:
to_rttm(),to_xml(),to_json(),to_vtt(),to_srt(),to_txt() - Runners:
run_diar(),run_cp(),run_all()
- Formats supported by
torchaudio:.wav,.mp3,.flac - Auto‑converted to mono 16 kHz (required by NeMo)
Plain text string or file.
| Format | Description |
|---|---|
.png |
Diarization visualization |
.rttm |
Standard diarization file |
.xml |
Segment + word timing export |
.json |
JSONL with metadata |
.txt |
Transcript |
_cp.txt |
C&P / translated text |
.srt |
Subtitles |
_cp.srt |
C&P subtitles |
.vtt |
WebVTT |
_cp.vtt |
C&P WebVTT |
.zip |
Packaged output |
python script.py --run_mode <mode> [arguments...]| Flag | Description |
|---|---|
--run_mode |
all, diar, or cp |
--audio_file |
Audio input |
--input_text |
Text input (C&P mode) |
--out_path |
Output directory |
--seg_model |
Pyannote checkpoint |
--seg_config_yml |
Pyannote pipeline YAML |
--seg_option |
diar or vad |
--stt_model |
NeMo ASR checkpoint |
--cp_model |
MarianMT model |
--device |
cuda or cpu |
python script.py --run_mode diar --audio_file input.wav --seg_model /models/pyannote/model.ckpt --seg_config_yml config.yaml --out_path results --device cudapython script.py --run_mode all --audio_file input.wav --seg_model /models/pyannote/model.ckpt --seg_config_yml config.yaml --seg_option diar --stt_model /models/nemo/stt_es.nemo --cp_model /models/mt/eu_norm-eu --out_path results --device cudapython script.py --run_mode cp --input_text "kaixo mundua" --cp_model /models/mt/eu_norm-eu --device cudaIf the ASR model filename contains:
"eu"→ Basque"es"→ Spanish
Results stored in:
<out_path>/<audio_filename>/
Generated ZIP:
result_<timestamp>.zip
plot_diarization_sample() outputs:
audio.png
- All models run on GPU when
--device cuda. - If ASR fails with OOM:
- Segment is split
- Retried
- Results merged with confidence weighting
This project is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) License.
You are free to:
- Share — copy and redistribute the material in any medium or format
- Adapt — remix, transform, and build upon the material for any purpose, even commercially
Under the following terms:
- Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made.
Full license text: https://creativecommons.org/licenses/by/4.0/