You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add datakit normalize steps for nsf_awards and nemotron v1/v2
Extend the convention established by common_corpus and starcoder2-extras
(#4626): expose a normalize_<dataset>_step factory in each datakit
download module and wire experiments through download -> normalize ->
tokenize.
Since normalize now processes a single directory (#4886), datasets with
multiple sub-datasets (nemotron v1/v2) get one normalize step per split:
- nsf_awards: one step, id_field="awd_id", file_extensions=(".parquet",)
- nemotron_v1: one step per quality/kind split (7 splits defined in
NEMOTRON_V1_SPLITS), file_extensions=(".jsonl.gz",)
- nemotron_v2: one step per (family, subset) pair,
file_extensions=(".parquet",)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0 commit comments