-
Notifications
You must be signed in to change notification settings - Fork 12
Description
Run the fineweb-2 pipeline in the Ray cluster using RayPipelineExecutor from datatrove, only for language = 'rus_Cyrl'.
It did run successfully, but the final amount of data is much lower that i expected.
For example the Common Crawl dump: CC-MAIN-2024-10 has around 10 million rows of 'rus_Cyrl' language data in the huggingface, whereas my personal run of fineweb-2 pipeline for the same dump: CC-MAIN-2024-10 generated only 10 thousand rows, lower by 1000 times.
During the research of the issue, found out that the first step of the job, where raw Common-Crawl dump is divided into english and non-english data, has only separated around 300 million rows as non-english. This amount seems too low, as per the Common-Crawl metadata, dumps has around 45% of english and 65% non-english data.
Furthemore, the other steps with language filtration are filtering out most of the russian data and leaving only 10 thousand at the end.
Is the data in the huggingface was calculated by the same fineweb-2 job pipeline given in the repository, and how i can tackle this issue?