[datakit] Add full OpenMathInstruct-2 midtraining dataset by taivu1998 · Pull Request #6254 · marin-community/marin

taivu1998 · 2026-06-07T13:53:22Z

Register nvidia/OpenMathInstruct-2 as a Datakit source backed by the full train split rather than the 1M, 2M, or 5M subsets. The new transform downloads the pinned parquet shards, renders problem/generated_solution rows as tagged user/assistant transcripts, preserves problem source and answer metadata, and marks the corpus as synthetic and benchmark-adjacent for downstream mixture analysis.

Expose the normalized source in the Datakit registry and add a Llama 3 midtraining tokenization step so experiments can include the full synthetic math corpus explicitly. Focused tests cover row rendering, invalid-row drops, expected problem sources, full-train download selection, and parquet transform output.

Register the full OpenMathInstruct-2 train split as a Datakit source and midtraining tokenization input. The transform preserves source metadata and renders problem-solution rows as tagged transcripts so the synthetic math corpus can be mixed intentionally.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 48b3f0d7c9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-07T21:17:08Z

+openmathinstruct2_full = openmathinstruct2_normalize_steps()[-1].as_executor_step()
+openmathinstruct2_full_tokenized = default_tokenize(
+    name="openmathinstruct2_full",
+    dataset=openmathinstruct2_full,


Tokenize only normalized main shards

When this normalized StepSpec is passed as the dataset, default_tokenize treats it as a directory and expand_tokenize_paths expands directories to recursive **/*.parquet globs; the normalize step writes both outputs/main and outputs/dups parquet shards by default. For any OpenMathInstruct-2 duplicate rows, this tokenization step will read the duplicate side-output too and put data that normalization intentionally removed back into the training cache; point the dataset at openmathinstruct2_full / "outputs/main/*.parquet" instead.

Useful? React with 👍 / 👎.

penfever · 2026-06-10T20:13:29Z

@taivu1998 thanks for your interest in contributing to Marin! I am leading our post training efforts on the project.

A couple questions if you don't mind --

Are you already on the Marin Discord? If not, why not join and introduce yourself there! If you are, I recommend joining the relevant subchannels for your various PRs and posting about why you think they may be important, as this is more likely to get a response.
On the topic of midtraining and posttraining data, we are currently not handling those through datakit. Also, we are aware of most of the popular public HF repos, particularly for SFT :) Would recommend you DM me on Discord as there are places we could use help on the data side, mostly curating brand new open source RL environments designed to challenge frontier models.

taivu1998 marked this pull request as ready for review June 7, 2026 21:14

chatgpt-codex-connector Bot reviewed Jun 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[datakit] Add full OpenMathInstruct-2 midtraining dataset#6254

[datakit] Add full OpenMathInstruct-2 midtraining dataset#6254
taivu1998 wants to merge 1 commit into
marin-community:mainfrom
taivu1998:tdv/openmathinstruct2-full-midtraining

taivu1998 commented Jun 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 7, 2026

Uh oh!

penfever commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

taivu1998 commented Jun 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

penfever commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants