Skip to content

i4Ds/SwissGPC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SwissGPC

DOI

This repository holds the official implementation of the SwissGPC (Swiss German Podcast Corpus) pipeline used to weakly label data collected from YouTube and the Swiss Broadcasting Corporation (SRG / SRF). As we do not possess any rights to the collected data, it is not possible to publish the annotated dataset itself. Instead, we publish the data pipeline that downloads, transcribes and prepares the data for usage in, for example, fine-tuning a model for Voice Adaptation TTS.

If you are interested in how we applied this data using the XTTSv2 architecture, check out our fork of the coqui-tts library here.

Podcasts

The podcasts of this dataset are provided here including links to the host-websites and information about raw and cleaned audio size in hours. As outlined above: We do not possess rights or have ownership of these podcasts, and as such, any changes on the platforms they are hosted on are out of our control. Meaning constant updates of changing hyperlinks, partial or complete removal, and similar changes do not fall within the scope of this repository. We will try to provide a general overview of the availability but cannot guarantee to do so in real-time. The podcasts were downloaded over a period of time spanning from September 2024 to March 2025, and as such my not reflect the actual audio lengths of the podcasts on time of download.

SRF Podcast Name Raw (h) Clean (h) vSwissGPC
#SRFglobal 36.97 33.63 v1.0
100 Sekunden Wissen 186.75 152.12 v1.0
BuchZeichen 365.10 305.62 v2.0
Debriefing 404 243.15 195.29 v1.0
Digital Podcast 434.56 396.59 v1.0
Dini Mundart 39.28 34.84 v1.0
Einfach Politik 40.69 38.07 v2.0
Espresso 565.84 500.50 v2.0
Focus 807.08 630.22 v2.0
Gast am Mittag 34.07 30.43 v1.0
Geek-Sofa 314.01 267.16 v1.0
Input 714.13 602.91 v2.0
SRF-Wissen 44.78 39.17 v1.0
Krimi 240.80 176.05 v2.0
Kultur-Talk 55.57 51.33 v1.0
Literaturclub - Zwei mit Buch 31.65 28.04 v1.0
Medientalk 68.77 62.16 v1.0
Persönlich 763.15 637.87 v2.0
Pipifax 9.04 7.66 v1.0
Podcast am Pistenrand 18.16 15.37 v1.0
Ratgeber 574.46 445.64 v2.0
Rehmann 213.87 182.79 v2.0
Samstagsrundschau 414.45 382.33 v1.0
Sternstunde Philosophie 158.67 136.70 v1.0
Sternstunde Religion 60.58 53.90 v1.0
Sykora Gisler 149.49 125.80 v1.0
Tagesgespräch 1688.26 1557.43 v1.0
Ufwärmrundi 60.72 54.95 v1.0
Vetters Töne 25.37 20.13 v1.0
Wetterfrage 65.52 59.02 v1.0
Wirtschaftswoche 126.23 115.31 v1.0
Wissenschaftsmagazin 403.10 347.52 v1.0
Zivadiliring 49.80 42.55 v1.0
Zytlupe 45.66 36.61 v1.0
Total 9041.28 7765.72
YouTube Podcast Name Raw (h) Clean (h) vSwissGPC
Auf Bewährung - Leben mit Gefängnis 3.00 2.70 v1.0
Berner Jugendtreff 127.80 89.61 v1.0
Ein Buch Ein Tee 3.73 3.26 v1.0
expectations - geplant und ungeplant kinderfrei 16.84 14.80 v1.0
Fadegrad 49.95 42.40 v1.0
Feel Good Podcast 319.60 261.43 v1.0
Finanz Fabio 58.44 49.29 v1.0
Scho ghört 23.45 20.47 v1.0
Sexologie - Wissen macht Lust 15.41 13.57 v1.0
SRF Dokumentationen 398.73 284.01 v2.0
SRF Reportagen 196.39 148.10 v2.0
Über den Bücherrand 14.53 12.59 v1.0
Ungerwegs Daheim 38.67 31.08 v1.0
Wir müssen reden - Public Eye spricht Klartext 17.52 15.54 v1.0
Total 1277.47 988.85

Data pipeline

The data from YouTube is downloaded using pytubefix while the SRF data was sourced via the official SRF API. Specifically for YT, the code expects a playlist of videos instead of just a video link. This is so that all episodes can be downloaded at once. SRF podcasts only require the podcast name without any additional information. The pipeline itself is built to download and transcribe the podcasts sequentially, i.e., one podcast after another. The code can be changed by you to do every step in batch and should not be too much effort to do so. Controlling the pipeline is done via the config.yaml, in which you can set what podcast should be downloaded from which source and which pipeline steps should run. See the table below for more information about the parameters. We utilized hdf5 files in our setup, and as such all data is put into hdf5 files on segmentation. This can be changed to your setup accordingly.

Config parameter Description Example value for SRF Example Value for YT
source Defines the source of the podcast (either YT or SRF) "srf" "yt"
youtube_url YouTube link to a Playlist containing the podcast episodes "" https://www.youtube.com/playlist?list=PLGJjtm2tSyhQXU-_N2YkfqCffXhY6UHNe
podcast_name Name of podcast as provided by authors "Zivadiliring" "Finanz Fabio"
write_attrs_to_hdf5 Should attributes (i.e. annotated data) be added to the hdf5 files false false
steps/download False/True: Should download step be executed true true
steps/diarization False/True: Should diarization step be executed true true
steps/segmentation False/True: Should segmentation step be executed true true
steps/phon_transcription False/True: Should phoneme transcription step be executed true true
steps/ch_transcription False/True: Should dialect classification step be executed false false
steps/mel_spectogram False/True: Should mel spectrogram generation step be executed false false
steps/move_into_dialect_5 False/True: Should audio be moved from podcast-based hdf5 to unified dialect hdf5 false false

About

Official implementation of the SwissGPC pipeline

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •