Skip to content

[datakit] Add full OpenMathInstruct-2 midtraining dataset#6254

Open
taivu1998 wants to merge 1 commit into
marin-community:mainfrom
taivu1998:tdv/openmathinstruct2-full-midtraining
Open

[datakit] Add full OpenMathInstruct-2 midtraining dataset#6254
taivu1998 wants to merge 1 commit into
marin-community:mainfrom
taivu1998:tdv/openmathinstruct2-full-midtraining

Conversation

@taivu1998

Copy link
Copy Markdown
Contributor

Register nvidia/OpenMathInstruct-2 as a Datakit source backed by the full train split rather than the 1M, 2M, or 5M subsets. The new transform downloads the pinned parquet shards, renders problem/generated_solution rows as tagged user/assistant transcripts, preserves problem source and answer metadata, and marks the corpus as synthetic and benchmark-adjacent for downstream mixture analysis.

Expose the normalized source in the Datakit registry and add a Llama 3 midtraining tokenization step so experiments can include the full synthetic math corpus explicitly. Focused tests cover row rendering, invalid-row drops, expected problem sources, full-train download selection, and parquet transform output.

Register the full OpenMathInstruct-2 train split as a Datakit source and midtraining tokenization input. The transform preserves source metadata and renders problem-solution rows as tagged transcripts so the synthetic math corpus can be mixed intentionally.
@taivu1998 taivu1998 marked this pull request as ready for review June 7, 2026 21:14

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 48b3f0d7c9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

openmathinstruct2_full = openmathinstruct2_normalize_steps()[-1].as_executor_step()
openmathinstruct2_full_tokenized = default_tokenize(
name="openmathinstruct2_full",
dataset=openmathinstruct2_full,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Tokenize only normalized main shards

When this normalized StepSpec is passed as the dataset, default_tokenize treats it as a directory and expand_tokenize_paths expands directories to recursive **/*.parquet globs; the normalize step writes both outputs/main and outputs/dups parquet shards by default. For any OpenMathInstruct-2 duplicate rows, this tokenization step will read the duplicate side-output too and put data that normalization intentionally removed back into the training cache; point the dataset at openmathinstruct2_full / "outputs/main/*.parquet" instead.

Useful? React with 👍 / 👎.

@penfever

Copy link
Copy Markdown

@taivu1998 thanks for your interest in contributing to Marin! I am leading our post training efforts on the project.

A couple questions if you don't mind --

  1. Are you already on the Marin Discord? If not, why not join and introduce yourself there! If you are, I recommend joining the relevant subchannels for your various PRs and posting about why you think they may be important, as this is more likely to get a response.
  2. On the topic of midtraining and posttraining data, we are currently not handling those through datakit. Also, we are aware of most of the popular public HF repos, particularly for SFT :) Would recommend you DM me on Discord as there are places we could use help on the data side, mostly curating brand new open source RL environments designed to challenge frontier models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants