Skip to content

The final data from the fineweb-2 pipeline has significantly low amount of rows #10

@RKurbanov95

Description

@RKurbanov95

Run the fineweb-2 pipeline in the Ray cluster using RayPipelineExecutor from datatrove, only for language = 'rus_Cyrl'.
It did run successfully, but the final amount of data is much lower that i expected.
For example the Common Crawl dump: CC-MAIN-2024-10 has around 10 million rows of 'rus_Cyrl' language data in the huggingface, whereas my personal run of fineweb-2 pipeline for the same dump: CC-MAIN-2024-10 generated only 10 thousand rows, lower by 1000 times.
During the research of the issue, found out that the first step of the job, where raw Common-Crawl dump is divided into english and non-english data, has only separated around 300 million rows as non-english. This amount seems too low, as per the Common-Crawl metadata, dumps has around 45% of english and 65% non-english data.
Furthemore, the other steps with language filtration are filtering out most of the russian data and leaving only 10 thousand at the end.

Is the data in the huggingface was calculated by the same fineweb-2 job pipeline given in the repository, and how i can tackle this issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions