Tools for preparing, cleaning, analyzing, and evaluating speech corpora.
This repository provides utility scripts for processing audio--text
datasets, computing corpus statistics, normalizing text, and evaluating
ASR systems using Word Error Rate (WER).
The utilities are designed to be modular, language-aware (Spanish and
Basque), and compatible with JSONL manifests commonly used in speech
processing pipelines.
- Read/write manifests in JSON Lines format\
- Convert TSV datasets to structured manifest dictionaries\
- Pair
.txttranscript files with.wavaudio files\ - Compute hashes & deduplicate corpora\
- Reduce corpora using reference datasets\
- Compute duration statistics\
- Export statistics or WER results to Excel (.xlsx)
- Language-specific normalization (Spanish
es, Basqueeu)\ - Optional case/punctuation preservation\
- Removal of diacritics, unwanted characters, acronyms\
- Duration-based filtering\
- Blacklist-based filtering\
- Detailed logging of removed entries and character distributions
- Compute sentence-level and corpus-level WER\
- Optional evaluation with case-preserving (C&P) text\
- Uses the same normalization pipeline as training\
- Outputs both cleaned manifests and WER summaries\
- Compatible with JSONL ASR output manifests
Recommended dependencies:
pandas
openpyxl
soundfile
tqdm
jiwer
corpus_processing_utils/
│
├── corpus_utils.py
├── normalizer.py
├── wer_evaluator.py
└── README.md
import corpus_utils as cu
data = cu.read_manifest("data/train.json")
cu.write_manifest("out/train_clean.json", data)from corpus_utils import tsv2data
data = tsv2data(
"dataset.tsv",
clips_folder="clips/",
audio_field="path",
text_field="sentence",
calculate_duration=True
)from normalizer import TextNormalizer
normalizer = TextNormalizer(
lang="es",
keep_cp=False,
remove_acronyms=True
)
clean_data = normalizer(data)from wer_evaluator import calculate_wer
clean_data, wer_stats = calculate_wer(
"predictions.json",
lang="es",
cp_field=True,
return_wer=True
)from corpus_utils import manifest_time_stats, stats2xlsx
stats = manifest_time_stats("data.json", return_stats=True)
stats2xlsx([stats], "stats.xlsx")- Manifest reading/writing\
- TSV conversion\
- File pairing\
- Hashing & deduplication\
- Duration statistics\
- Excel exporting
- Language rules\
- Case normalization\
- Punctuation cleaning\
- Duration filtering\
- Acronym removal\
- Verbose mode
- WER calculation (mean & total)\
- Optional C&P analysis\
- Manifest cleaning\
- Summary export
PRs and issues are welcome!