Skip to content

Add normalization and tokenization for starcoder2-extras#4626

Merged
ravwojdyla merged 6 commits intomainfrom
rav-norm-coder
Apr 10, 2026
Merged

Add normalization and tokenization for starcoder2-extras#4626
ravwojdyla merged 6 commits intomainfrom
rav-norm-coder

Conversation

@ravwojdyla-agent
Copy link
Copy Markdown
Contributor

Summary

  • Add download, normalize, and tokenize pipeline for all 6 starcoder2data-extras subsets (ir_cpp, ir_python, ir_rust, ir_low_resource, documentation, kaggle)
  • Add file_extensions filter to normalize's file discovery to skip non-data files (e.g. provenance.json)
  • Expose levanter_batch_size through the full write pipeline (writers → plan → Dataset → TokenizeConfig → default_tokenize) to control memory for large-document datasets
  • Expose resources and worker_resources on default_tokenize for per-subset memory tuning
  • Remove reshard_starcoder2_extras_step (superseded by normalize)
  • Documentation subset gets 32GB worker memory — contains a single 64MB OpenJDK record that peaks at ~9GB RSS during tokenization

Test plan

  • All 6 subsets normalize and tokenize successfully on Iris in europe-west4
  • Benchmarked tokenization of 64MB doc: whole (8.8GB peak), paragraph-split (1.1GB peak)
  • Verified levanter_batch_size=128 prevents OOM on large-document shards

🤖 Generated with Claude Code

@ravwojdyla-agent ravwojdyla-agent added the agent-generated Created by automation/agent label Apr 10, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ccbcd762a0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread lib/zephyr/src/zephyr/writers.py
Helw150 and others added 6 commits April 10, 2026 18:16
Adds download/reshard helpers for the bigcode/starcoder2data-extras
subsets (ir_cpp, ir_python, ir_rust, ir_low_resource, documentation,
kaggle) and an experiment script that tokenizes each subset with the
marin tokenizer. ir_low_resource is resharded to even out its parquet
files and given 80g worker RAM.
…starcoder2-extras

- Add normalize step between download and tokenize for starcoder2-extras,
  replacing the reshard step with the standard normalize pipeline
- Add max_record_size param to normalize to split oversized documents
  (documentation subset has records up to 64MB e.g. full OpenJDK docs)
- Add file_extensions filter to normalize's file discovery to skip
  non-data files like provenance.json
- Expose levanter_batch_size through the full write pipeline
  (writers → plan → Dataset → TokenizeConfig → default_tokenize)
  to control memory usage for large-document datasets
- Remove reshard_starcoder2_extras_step (superseded by normalize)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove the record-splitting feature (max_record_size) from normalize —
reverting _make_normalize_fn back to a simple map, removing the parameter
from _build_pipeline, normalize_to_parquet, normalize_step, and the
starcoder2_extras experiment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…entation to 32GB

Add resources and worker_resources parameters to default_tokenize so
callers can override the Fray container and Zephyr worker memory limits.
Documentation subset gets 32GB for both — it contains a single 64MB
OpenJDK record that peaks at ~9GB RSS during tokenization.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reject batch_size < 1 in write_levanter_cache to prevent silent data
loss (batch_size=0 would drop all records after the exemplar).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@Helw150 Helw150 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ravwojdyla ravwojdyla merged commit 87174b2 into main Apr 10, 2026
40 checks passed
@ravwojdyla ravwojdyla deleted the rav-norm-coder branch April 10, 2026 23:34
ravwojdyla added a commit that referenced this pull request Apr 18, 2026
Extend the convention established by common_corpus and starcoder2-extras
(#4626): expose a normalize_<dataset>_step factory in each datakit
download module and wire experiments through download -> normalize ->
tokenize.

Since normalize now processes a single directory (#4886), datasets with
multiple sub-datasets (nemotron v1/v2) get one normalize step per split:

- nsf_awards: one step, id_field="awd_id", file_extensions=(".parquet",)
- nemotron_v1: one step per quality/kind split (7 splits defined in
  NEMOTRON_V1_SPLITS), file_extensions=(".jsonl.gz",)
- nemotron_v2: one step per (family, subset) pair,
  file_extensions=(".parquet",)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ravwojdyla added a commit that referenced this pull request Apr 19, 2026
Extend the convention established by common_corpus and starcoder2-extras
(#4626): expose a normalize_<dataset>_step factory in each datakit
download module and wire experiments through download -> normalize ->
tokenize.

Since normalize now processes a single directory (#4886), datasets with
multiple sub-datasets (nemotron v1/v2) get one normalize step per split:

- nsf_awards: one step, id_field="awd_id", file_extensions=(".parquet",)
- nemotron_v1: one step per quality/kind split (7 splits defined in
  NEMOTRON_V1_SPLITS), file_extensions=(".jsonl.gz",)
- nemotron_v2: one step per (family, subset) pair,
  file_extensions=(".parquet",)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ravwojdyla added a commit that referenced this pull request Apr 20, 2026
Extend the convention established by common_corpus and starcoder2-extras
(#4626): expose a normalize_<dataset>_step factory in each datakit
download module and wire experiments through download -> normalize ->
tokenize.

Since normalize now processes a single directory (#4886), datasets with
multiple sub-datasets (nemotron v1/v2) get one normalize step per split:

- nsf_awards: one step, id_field="awd_id", file_extensions=(".parquet",)
- nemotron_v1: one step per quality/kind split (7 splits defined in
  NEMOTRON_V1_SPLITS), file_extensions=(".jsonl.gz",)
- nemotron_v2: one step per (family, subset) pair,
  file_extensions=(".parquet",)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ravwojdyla added a commit that referenced this pull request Apr 20, 2026
Extend the convention established by common_corpus and starcoder2-extras
(#4626): expose a normalize_<dataset>_step factory in each datakit
download module and wire experiments through download -> normalize ->
tokenize.

Since normalize now processes a single directory (#4886), datasets with
multiple sub-datasets (nemotron v1/v2) get one normalize step per split:

- nsf_awards: one step, id_field="awd_id", file_extensions=(".parquet",)
- nemotron_v1: one step per quality/kind split (7 splits defined in
  NEMOTRON_V1_SPLITS), file_extensions=(".jsonl.gz",)
- nemotron_v2: one step per (family, subset) pair,
  file_extensions=(".parquet",)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ravwojdyla added a commit that referenced this pull request Apr 20, 2026
Extend the convention established by common_corpus and starcoder2-extras
(#4626): expose a normalize_<dataset>_step factory in each datakit
download module and wire experiments through download -> normalize ->
tokenize.

Since normalize now processes a single directory (#4886), datasets with
multiple sub-datasets (nemotron v1/v2) get one normalize step per split:

- nsf_awards: one step, id_field="awd_id", file_extensions=(".parquet",)
- nemotron_v1: one step per quality/kind split (7 splits defined in
  NEMOTRON_V1_SPLITS), file_extensions=(".jsonl.gz",)
- nemotron_v2: one step per (family, subset) pair,
  file_extensions=(".parquet",)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ravwojdyla added a commit that referenced this pull request Apr 20, 2026
* add `normalize_<dataset>_step` factories for `nsf_awards`,
`nemotron_v1`, `nemotron_v2` — extends the convention from #4626
* since normalize now processes a single directory (#4886), multi-split
datasets get one normalize step per split
  * `nsf_awards`: one step, `id_field="awd_id"`, `.parquet`
* `nemotron_v1`: one step per `quality`/`kind` split (7 in
`NEMOTRON_V1_SPLITS`), `.jsonl.zst` [^1]
  * `nemotron_v2`: one step per `(family, subset)`, `.parquet`
* wire `nsf_awards` and `nemotron_v2` experiments through download →
normalize → tokenize
* `nemotron_v1` experiment wiring deferred — existing hardcoded-path
tokenize stays until the full normalize + dedup + consolidate chain is
validated
* validated `nemotron_v1` normalize end-to-end on
`quality=medium-low/kind=actual` (1.24B records, 6299 shards, peak 14.47
GB on 16 GB workers); `nsf_awards` normalize completed (42 parquet
files)

[^1]: downloader writes `.jsonl.zst` (rewrites `jsonl.zstd` →
`jsonl.zst` on write)

Co-authored-by: Rafal Wojdyla <ravwojdyla@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants