Skip to content

Bad behavior in wikiextractor #200

Open
@HarikalarKutusu

Description

Here is the original from this article:

Antik Yunanca Grekçe: matesis kelimesi matematik kelimesinin köküdür ve bilirim anlamına gelmektedir.

This is the related source:

Antik Yunanca ''{{dil|grc|matesis}}'' kelimesi matematik kelimesinin köküdür ve ''bilirim'' anlamına gelmektedir.

And this is what is extracted (from text/AA/wiki_00 file):

Antik Yunanca ' kelimesi matematik kelimesinin köküdür ve \"bilirim\" anlamına gelmektedir.

Somehow a ' is introduced and the Greek word is dropped. So the sentence has no meaning but except for the ' character, it is OK.
As the Greek word is also removed, we also cannot blacklist it.

I'm not sure how many such occurrences would drop into the random 3 selection, but a solution might be good.

PS: I'm aware this is NOT a cv-sentence-extractor issue, but the workflow includes wikiextractor, so...

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions