Open
Description
Hi @bmilde,
first of all, thanks for all the work you've done here! This is a great resource!
Unfortunately, as @sikoried mentioned via e-mail, a large portion of the resources can no longer be accessed via the scripts provided in this repo. This is due to website updates and/or removal of content.
Currently, we are unable to retrieve the speech corpora from the following sources:
- Tagesschau: https://www.tagesschau.de/archiv/meldungsarchiv100~_date-.html`
- All requests return a 404 error
- However, the archive can still be accessed via
https://www.tagesschau.de/archiv?datum=<DATE>
- Maybe updating the URL and some changes to the BeautifulSoup parser in
get_texts_tagesschau.py
would make the resource accessible again
- Subtitles: https://classic.ardmediathek.de/subtitle/1
- All requests get a timeout
- Unfortunately, we don't know whether the content has been moved to a different location or removed from the website entirely
At the moment, only 48mn of the 102mn sentences can be retrieved:
wc subs_norm1_filt de_wiki europarl tagesschau_news
0 0 0 subs_norm1_filt
46663267 803918603 5792695259 de_wiki
1887694 44184488 320753686 europarl
0 0 0 tagesschau_news
48550961 848103091 6113448945 total
Best,
Dominik
Metadata
Assignees
Labels
No labels