The final data from the fineweb-2 pipeline has significantly low amount of rows

Run the fineweb-2 pipeline in the Ray cluster using RayPipelineExecutor from datatrove, only for language = 'rus_Cyrl'.
It did run successfully, but the final amount of data is much lower that i expected. 
For example the Common Crawl dump: CC-MAIN-2024-10 has around 10 million rows of 'rus_Cyrl' language data in the [huggingface](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2/viewer/rus_Cyrl), whereas my personal run of fineweb-2 pipeline for the same dump: CC-MAIN-2024-10 generated only 10 thousand rows, lower by 1000 times. 
During the research of the issue, found out that the first step of the job, where raw Common-Crawl dump is divided into english and non-english data, has only separated around 300 million rows as non-english. This amount seems too low, as per the Common-Crawl metadata, dumps has around 45% of english and 65% non-english data.
Furthemore, the other steps with language filtration are filtering out most of the russian data and leaving only 10 thousand at the end.

Is the data in the huggingface was calculated by the same fineweb-2 job pipeline given in the repository, and how i can tackle this issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The final data from the fineweb-2 pipeline has significantly low amount of rows #10

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The final data from the fineweb-2 pipeline has significantly low amount of rows #10

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions