Skip to content

ARABIC: Wrong language or dialect or script #1

@alielfilali01

Description

@alielfilali01

Hey team,
As part of the Argilla FineWeb-C sprint, we are annotating arabic and its dialects. MSA, ARY and ARZ.
The problem for all of them is being miscallafied most of the time.
For example, annotators in arabic report most data is not arabic but rather dialects with a lot of Arabizi (usage of latin script). In dialects, people report that most of the samples are in fact in arabic MSA !
This mismatch leads to labeling most of the data as problematic.
cc: @nataliaElv

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions