Hey team,
As part of the Argilla FineWeb-C sprint, we are annotating arabic and its dialects. MSA, ARY and ARZ.
The problem for all of them is being miscallafied most of the time.
For example, annotators in arabic report most data is not arabic but rather dialects with a lot of Arabizi (usage of latin script). In dialects, people report that most of the samples are in fact in arabic MSA !
This mismatch leads to labeling most of the data as problematic.
cc: @nataliaElv