Betthupferl: A multi-dialectal dataset for German dialect ASR and dialect-to-standard speech translation.
This repository contains supplementary material for the paper
Verena Blaschke, Miriam Winkler, Constantin Förster, Gabriele Wenger-Glemser, and Barbara Plank. A multi-dialectal dataset for German dialect ASR and dialect-to-standard speech translation. In Proc. Interspeech 2025, p. 913–917. ISCA. https://www.isca-archive.org/interspeech_2025/blaschke25_interspeech.html
Please cite the paper if you use any of this data/code.
This repository does not contain any audio data.
Access to the audio data must be granted by Bayerischer Rundfunk on a case-by-case basis due to copyright restrictions (contact Gabriele Wenger-Glemser).
When you obtain the audio files, place them inside a folder called audio.
All subfolders containing transcriptions are in zip archives with the password MaiNLP so as to prevent potential inclusion in web-scraped datasets (cf. Jacovi et al., 2023). Unzip them to get the subfolders with the same name.
Please do not re-distribute the transcriptions.
You can find the following contents:
- Reference transcriptions:
transcriptions_dialect: Dialectal and Standard German references for dialectal audios (converted from FOLKER annotations)transcriptions_standard_german: Standard German references for Standard German audios (converted from FOLKER annotations)data_processed/transcriptions: Converted versions of the files intranscriptions_*: one TSV file per story & language variety combination, also contains the corresponding audio filepathsdata_processed/pure_text: Converted versions of the files intranscriptions_*: one file per language variety (or group thereof) and transcription type (dialectal_dialor standardized_deu), one sentence per line. The files with_gold_in the name contain the original punctuation and capitalization, the files with_goldnorm_are lowercased and without punctuation.
- Word-level and sentence-level annotations of the references and/or ASR hypotheses are in the
analysisfolder (more information in the readme file there)/ - The code for ASR predictions and analyses is in
code(more details in the readme file there). The scripts should be run from within that folder. - The model predictions are in
predictions(one subfolder per model). The original predictions are in the TSV files; the TXT files contain the pure text so as to be directly comparable with the files indata_processed/pure_text. Important: The TSV files contain older versions of the reference transcriptions. These older references are ignored when calculating the ASR scores. - The ASR scores (WER, CER, BLEU) are in
scores. The files ending withzeroshot.tsvcontain the results tables (results averaged over sentences), and the files ending withzeroshot_detailed.tsvlist the scores for each sentence. - The annotation guidelines/descriptions and the data statement are in the PDF
data-statement_annotation-guidelines.pdf.
@inproceedings{blaschke-etal-2025-multi,
title = {A Multi-Dialectal Dataset for {German} Dialect {ASR} and Dialect-to-Standard Speech Translation},
author = {Blaschke, Verena and Winkler, Miriam and Förster, Constantin and Wenger-Glemser, Gabriele and Plank, Barbara},
booktitle = {Interspeech 2025},
pages = {913--917},
url = {https://www.isca-archive.org/interspeech_2025/blaschke25_interspeech.html},
doi = {10.21437/Interspeech.2025-318},
year = {2025},
month = aug,
}