This repository provides tools for forced alignment, segmentation, and transcription of Pangloss XML-annotated audio data, using transformers and pyannote pipelines.
- Transcription: Transcribes audio or segments using a pretrained Wav2Vec2 model and diarization.
- Forced Alignment: Aligns words in Pangloss XML files with corresponding audio segments and writes word-level timestamps back to XML.
- Segmentation: Splits audio into speech segments using Speech Activity Detection (SAD).
- XML Parsing & Audio Chunk Extraction: Extracts sentences and corresponding audio chunks from Pangloss XML.
To set up the environment and install the package, follow these steps:
chmod +x setup.sh
./setup.sh-
Accept pyannote/segmentation-3.0, pyannote/speaker-diarization-3.1, and pyannote/voice-activity-detection on Hugging Face.
-
Create an access token at
hf.co/settings/tokens. -
Set your Hugging Face token in the environment variable:
$env:HF_TOKEN="your_token_here"
or use the
--tokenargument in the CLI commands. -
Install dependencies in a virtual environment using pixi:
pixi install
-
Install the package
This allows you to use the package as a CLI tool and import it in Python scripts.
pixi run python -m ensurepip --upgrade
pixi run python -m pip install --upgrade pip
pixi run pip install -e .-
Download or prepare WAV files (and Pangloss XML files) and place them in the
data/directory (recommended). -
Download or train a Wav2Vec2 model and use it in the
--modelargument, place ir inmodels/directory (recommended).
All main features are available as CLI commands thanks to the package structure and [project.scripts] entry points.
You do not need to use python ... directly.
Use the following commands from your project root (or with pixi run ...):
Uses diarization and ASR to create a TextGrid file handling multiple speaker tiers and distinguishing between human voice and silence/noise.
pixi run transcribe --model models/Na_best_model --audio_path data/235213.wav --num_speakers 1- Outputs a transcribed TextGrid file.
pixi run word_align --pangloss_xml data/235213.xml --wav data/235213.wav --model models/Na_best_model- Outputs an aligned Pangloss XML file with word-level
<AUDIO start="..." end="..."/>tags.
Simplistic transcription of short audio files using a pretrained Wav2Vec2 model.
pixi run simple_predict data/235213.wav --model models/Na_best_model- Outputs
data/235213.txtcontaining the transcription. - Does not handle multiple speakers.
For more details, see the docstrings in each module or run:
pixi run transcribe --help
pixi run word_align --help
pixi run simple_predict --help- All package code is in
src/nlp_pangloss/. - CLI commands are defined in the
[project.scripts]section ofpyproject.toml. - For advanced usage or development, you can still run scripts directly with:
but this is not necessary for typical use.
pixi run python src/nlp_pangloss/segment_and_transcribe.py ...