Tagesschau and Subtitle corpora are no longer available

Hi @bmilde, 

first of all, thanks for all the work you've done here! This is a great resource!
Unfortunately, as @sikoried mentioned via e-mail, a large portion of the resources can no longer be accessed via the scripts provided in this repo. This is due to website updates and/or removal of content. 

Currently, we are unable to retrieve the speech corpora from the following sources:

- Tagesschau: https://www.tagesschau.de/archiv/meldungsarchiv100~_date-<DATE>.html`
   - All requests return a 404 error
   - However, the archive can still be accessed via `https://www.tagesschau.de/archiv?datum=<DATE>`
   - Maybe updating the URL and some changes to the BeautifulSoup parser in `get_texts_tagesschau.py` would make the resource accessible again
- Subtitles: https://classic.ardmediathek.de/subtitle/1 
   - All requests get a timeout 
   - Unfortunately, we don't know whether the content has been moved to a different location or removed from the website entirely

At the moment, only 48mn of the 102mn sentences can be retrieved: 

```bash
wc subs_norm1_filt de_wiki europarl tagesschau_news
  0           0       0          subs_norm1_filt
  46663267  803918603 5792695259 de_wiki
   1887694   44184488  320753686 europarl
         0          0          0 tagesschau_news
  48550961  848103091 6113448945 total
```

Best,
Dominik 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tagesschau and Subtitle corpora are no longer available #7

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development