Skip to content

Tagesschau and Subtitle corpora are no longer available #7

Open
@dwgnr

Description

Hi @bmilde,

first of all, thanks for all the work you've done here! This is a great resource!
Unfortunately, as @sikoried mentioned via e-mail, a large portion of the resources can no longer be accessed via the scripts provided in this repo. This is due to website updates and/or removal of content.

Currently, we are unable to retrieve the speech corpora from the following sources:

  • Tagesschau: https://www.tagesschau.de/archiv/meldungsarchiv100~_date-.html`
    • All requests return a 404 error
    • However, the archive can still be accessed via https://www.tagesschau.de/archiv?datum=<DATE>
    • Maybe updating the URL and some changes to the BeautifulSoup parser in get_texts_tagesschau.py would make the resource accessible again
  • Subtitles: https://classic.ardmediathek.de/subtitle/1
    • All requests get a timeout
    • Unfortunately, we don't know whether the content has been moved to a different location or removed from the website entirely

At the moment, only 48mn of the 102mn sentences can be retrieved:

wc subs_norm1_filt de_wiki europarl tagesschau_news
  0           0       0          subs_norm1_filt
  46663267  803918603 5792695259 de_wiki
   1887694   44184488  320753686 europarl
         0          0          0 tagesschau_news
  48550961  848103091 6113448945 total

Best,
Dominik

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions