Develop a notebook that creates a pipeline (recipe) for running new GneissWeb transforms in sequence on some data of your choosing. #983
Description
Search before asking
- I searched the issues and found no similar issues.
Component
Other
Feature
We now have the 4 new transforms that were used in creating GneissWeb as 3 PRs in the repo (one PR implements 2 of these transforms) These PRs are being tested currently, but they are all in reasonable shape and all have notebook examples that will be helpful in creating ONE notebook that sequentially runs the transforms (plus existing transforms such as Filter)The sequence is:
Read Data->Repetition Removal->Readability,Fasttext,DCLM etc Annotation->Extreme token Annotation-> Filter
The 3 PRs are:
#953 (rep removal) #965 (extreme tokenization and readability) and #974 (fasttext classification).
The relevant notebooks are all in the main transform directory, for example: https://github.com/swith005/data-prep-kit-outer/blob/rep_removal/transforms/universal/rep_removal/rep_removal.ipynb for rep removal.
@touma-I @yousafshah @Swanand-Kadhe @Hajar-Emami
Are you willing to submit a PR?
- Yes I am willing to submit a PR!