Add normalization and tokenization for starcoder2-extras by ravwojdyla-agent · Pull Request #4626 · marin-community/marin

ravwojdyla-agent · 2026-04-10T18:01:55Z

Summary

Add download, normalize, and tokenize pipeline for all 6 starcoder2data-extras subsets (ir_cpp, ir_python, ir_rust, ir_low_resource, documentation, kaggle)
Add file_extensions filter to normalize's file discovery to skip non-data files (e.g. provenance.json)
Expose levanter_batch_size through the full write pipeline (writers → plan → Dataset → TokenizeConfig → default_tokenize) to control memory for large-document datasets
Expose resources and worker_resources on default_tokenize for per-subset memory tuning
Remove reshard_starcoder2_extras_step (superseded by normalize)
Documentation subset gets 32GB worker memory — contains a single 64MB OpenJDK record that peaks at ~9GB RSS during tokenization

Test plan

All 6 subsets normalize and tokenize successfully on Iris in europe-west4
Benchmarked tokenization of 64MB doc: whole (8.8GB peak), paragraph-split (1.1GB peak)
Verified levanter_batch_size=128 prevents OOM on large-document shards

🤖 Generated with Claude Code

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ccbcd762a0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Adds download/reshard helpers for the bigcode/starcoder2data-extras subsets (ir_cpp, ir_python, ir_rust, ir_low_resource, documentation, kaggle) and an experiment script that tokenizes each subset with the marin tokenizer. ir_low_resource is resharded to even out its parquet files and given 80g worker RAM.

…starcoder2-extras - Add normalize step between download and tokenize for starcoder2-extras, replacing the reshard step with the standard normalize pipeline - Add max_record_size param to normalize to split oversized documents (documentation subset has records up to 64MB e.g. full OpenJDK docs) - Add file_extensions filter to normalize's file discovery to skip non-data files like provenance.json - Expose levanter_batch_size through the full write pipeline (writers → plan → Dataset → TokenizeConfig → default_tokenize) to control memory usage for large-document datasets - Remove reshard_starcoder2_extras_step (superseded by normalize) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove the record-splitting feature (max_record_size) from normalize — reverting _make_normalize_fn back to a simple map, removing the parameter from _build_pipeline, normalize_to_parquet, normalize_step, and the starcoder2_extras experiment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…entation to 32GB Add resources and worker_resources parameters to default_tokenize so callers can override the Fray container and Zephyr worker memory limits. Documentation subset gets 32GB for both — it contains a single 64MB OpenJDK record that peaks at ~9GB RSS during tokenization. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Reject batch_size < 1 in write_levanter_cache to prevent silent data loss (batch_size=0 would drop all records after the exemplar). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Helw150

lgtm

Extend the convention established by common_corpus and starcoder2-extras (#4626): expose a normalize_<dataset>_step factory in each datakit download module and wire experiments through download -> normalize -> tokenize. Since normalize now processes a single directory (#4886), datasets with multiple sub-datasets (nemotron v1/v2) get one normalize step per split: - nsf_awards: one step, id_field="awd_id", file_extensions=(".parquet",) - nemotron_v1: one step per quality/kind split (7 splits defined in NEMOTRON_V1_SPLITS), file_extensions=(".jsonl.gz",) - nemotron_v2: one step per (family, subset) pair, file_extensions=(".parquet",) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* add `normalize_<dataset>_step` factories for `nsf_awards`, `nemotron_v1`, `nemotron_v2` — extends the convention from #4626 * since normalize now processes a single directory (#4886), multi-split datasets get one normalize step per split * `nsf_awards`: one step, `id_field="awd_id"`, `.parquet` * `nemotron_v1`: one step per `quality`/`kind` split (7 in `NEMOTRON_V1_SPLITS`), `.jsonl.zst` [^1] * `nemotron_v2`: one step per `(family, subset)`, `.parquet` * wire `nsf_awards` and `nemotron_v2` experiments through download → normalize → tokenize * `nemotron_v1` experiment wiring deferred — existing hardcoded-path tokenize stays until the full normalize + dedup + consolidate chain is validated * validated `nemotron_v1` normalize end-to-end on `quality=medium-low/kind=actual` (1.24B records, 6299 shards, peak 14.47 GB on 16 GB workers); `nsf_awards` normalize completed (42 parquet files) [^1]: downloader writes `.jsonl.zst` (rewrites `jsonl.zstd` → `jsonl.zst` on write) Co-authored-by: Rafal Wojdyla <ravwojdyla@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ravwojdyla-agent added the agent-generated Created by automation/agent label Apr 10, 2026

chatgpt-codex-connector Bot reviewed Apr 10, 2026

View reviewed changes

Comment thread lib/zephyr/src/zephyr/writers.py

ravwojdyla mentioned this pull request Apr 10, 2026

Add starcoder2data-extras download and tokenization #4599

Closed

Helw150 and others added 6 commits April 10, 2026 18:16

Clean up redundant resources override for documentation subset

031e3bc

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Validate levanter_batch_size is positive

daf2a49

Reject batch_size < 1 in write_levanter_cache to prevent silent data loss (batch_size=0 would drop all records after the exemplar). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ravwojdyla force-pushed the rav-norm-coder branch from d0f5887 to daf2a49 Compare April 10, 2026 18:16

ravwojdyla requested a review from Helw150 April 10, 2026 21:16

Helw150 approved these changes Apr 10, 2026

View reviewed changes

ravwojdyla merged commit 87174b2 into main Apr 10, 2026
40 checks passed

ravwojdyla deleted the rav-norm-coder branch April 10, 2026 23:34

ravwojdyla-agent mentioned this pull request Apr 20, 2026

datakit: normalize steps for nsf_awards and nemotron v1/v2 #4892

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add normalization and tokenization for starcoder2-extras#4626

Add normalization and tokenization for starcoder2-extras#4626
ravwojdyla merged 6 commits intomainfrom
rav-norm-coder

ravwojdyla-agent commented Apr 10, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Helw150 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ravwojdyla-agent commented Apr 10, 2026

Summary

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Helw150 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants