add stem sdg pipeline #1010

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

dgtm777 wants to merge 65 commits into main from stem_sdg_pipeline

Collaborator

dgtm777 commented Oct 30, 2025

No description provided.


          add stem sdg pipeline

780ec66

Signed-off-by: dgitman <[email protected]>

dgtm777 requested review from Kipok and rimashahbazyan

October 30, 2025 07:12

dgtm777 and others added 6 commits

October 30, 2025 12:40


          fix configs

9c1ac1e

Signed-off-by: dgitman <[email protected]>


          adding only solutions

c2f6b51

Signed-off-by: Rima Shahbazyan <[email protected]>


          fix dependencies

56c9a16

Signed-off-by: dgitman <[email protected]>


          fix naming

ba3aade

Signed-off-by: dgitman <[email protected]>


          Merge branch 'stem_sdg_pipeline' of https://github.com/NVIDIA/NeMo-Sk…

808dfe2

…ills into stem_sdg_pipeline


          fix config

5a39020

Signed-off-by: dgitman <[email protected]>

Kipok reviewed

View reviewed changes

Collaborator

Kipok left a comment

consider adding slurm tests for a simplified use-case of this pipeline to ensure nothing is broken in the future

nemo_skills/training/data_preparation_utils/config/stem_sft.yaml Outdated Show resolved Hide resolved

nemo_skills/prompt/config/generic/search-boxed.yaml Show resolved Hide resolved

recipes/opensciencereasoning/configs/SDG_pipeline/gpt-oss-seed-data_with_gt.yaml Outdated

+                    - ["mmlu", "test"]
+                    - ["mmlu-pro", "test"]
+                    - ["gpqa", "diamond"]
+                  model: /hf_models/Qwen2.5-32B-Instruct

Collaborator

Kipok Oct 31, 2025

consider using Qwen/Qwen2.5-32B-Instruct here and everywhere else to avoid extra step of manually downloading the model

Collaborator Author

dgtm777 Nov 7, 2025

I am constantly getting a HuggingFace rate limit when using hf model name instead of a local path

recipes/opensciencereasoning/configs/SDG_pipeline/gpt-oss-seed-data_with_gt.yaml Outdated Show resolved Hide resolved

recipes/opensciencereasoning/configs/SDG_pipeline/gpt-oss_with_gt_with_tool.yaml Outdated Show resolved Hide resolved

recipes/opensciencereasoning/sdg_pipeline/run_pipeline.py Show resolved Hide resolved

recipes/opensciencereasoning/sdg_pipeline/scripts/decontaminate.py Show resolved Hide resolved

recipes/opensciencereasoning/README.md Outdated Show resolved Hide resolved

nemo_skills/prompt/config/eval/aai/search-mcq-4choices.yaml Show resolved Hide resolved

dgtm777 and others added 6 commits

November 3, 2025 17:44


          restructure files

2ba171e

Signed-off-by: dgitman <[email protected]>


          Delete nemo_skills/prompt/config/eval/aai/search-mcq-4choices.yaml

9c9f1a5

Signed-off-by: Jiacheng Xu <[email protected]>


          Delete nemo_skills/prompt/config/generic/search-boxed.yaml

a1b913f

Signed-off-by: Jiacheng Xu <[email protected]>


          restructure and add configs settings

706cb48

Signed-off-by: dgitman <[email protected]>


          fix scripts paths

cdf9b97

Signed-off-by: dgitman <[email protected]>


          fixing the generation vs _full_generation difference

e737749

Signed-off-by: Rima Shahbazyan <[email protected]>

rimashahbazyan force-pushed the stem_sdg_pipeline branch from 60aec91 to e737749 Compare

November 6, 2025 18:33

dgtm777 added 13 commits

November 7, 2025 10:04


          fix configs

addf3fe

Signed-off-by: dgitman <[email protected]>


          fix configs

c0e27fe

Signed-off-by: dgitman <[email protected]>


          add command-line overrides

c89266e

Signed-off-by: dgitman <[email protected]>


          add test datasets

bfb1b58

Signed-off-by: dgitman <[email protected]>


          add predicted_answer_regex_field

bf261c2

Signed-off-by: dgitman <[email protected]>


          update tests

b5dcade

Signed-off-by: dgitman <[email protected]>


          fix typo

47fb877

Signed-off-by: dgitman <[email protected]>


          add tests

f2e7323

Signed-off-by: dgitman <[email protected]>


          make all stages to have output_dir

9c4c478

Signed-off-by: dgitman <[email protected]>


          fix bugs

5152f88

Signed-off-by: dgitman <[email protected]>


          fix bugs

7eae7ce

Signed-off-by: dgitman <[email protected]>


          uncomment tests

466aca6

Signed-off-by: dgitman <[email protected]>


          Merge branch 'stem_sdg_pipeline' of https://github.com/NVIDIA/NeMo-Sk…

a35ee6d

…ills into stem_sdg_pipeline

dgtm777 commented

View reviewed changes

recipes/opensciencereasoning/sdg_pipeline/configs/pipelines/populate_configs.py Outdated Show resolved Hide resolved

dgtm777 commented

View reviewed changes

recipes/opensciencereasoning/sdg_pipeline/pipeline/sdg_pipeline.py Outdated Show resolved Hide resolved

dgtm777 commented

View reviewed changes

recipes/opensciencereasoning/sdg_pipeline/scripts/convert_to_qwen.py Outdated Show resolved Hide resolved

dgtm777 commented

View reviewed changes

recipes/opensciencereasoning/sdg_pipeline/pipeline/sdg_pipeline.py Outdated Show resolved Hide resolved

dgtm777 commented

View reviewed changes

recipes/opensciencereasoning/sdg_pipeline/pipeline/sdg_pipeline.py Outdated Show resolved Hide resolved

dgtm777 commented

View reviewed changes

recipes/opensciencereasoning/sdg_pipeline/pipeline/sdg_pipeline.py Outdated Show resolved Hide resolved

dgtm777 commented

View reviewed changes

recipes/opensciencereasoning/sdg_pipeline/scripts/convert_to_qwen.py Outdated Show resolved Hide resolved

ekmb reviewed

View reviewed changes

nemo_skills/training/data_preparation_utils/config/stem_sft.yaml

+                      solution_key: ${output_key}
+                      test_cases:
+                        - { input: { generation: "1 + 2 + 3 + 4 = 10" }, output: { generation: "1 + 2 + 3 + 4 = 10" } }
+                        # TODO: implement fractional arithmetic

Collaborator

ekmb Nov 18, 2025

Is this needed?

Collaborator Author

dgtm777 Nov 22, 2025

recipes/opensciencereasoning/sdg_pipeline/README.md Outdated

+              This folder provides templates, prompts, and scripts for the automated pipeline that powers the OpenScience data refresh. The pipeline launches distributed jobs through [`pipeline/sdg_pipeline.py`](pipeline/sdg_pipeline.py) and covers the full lifecycle: solution generation, ground-truth extraction, difficulty scoring, and topic labeling.
+              ## Config Layout
+              - **Base pipeline**: [`configs/pipelines/base.yaml`](configs/pipelines/base.yaml) describes the default open-question flow with ground-truth answers available, no tool usage, and the boxed prompt.

Collaborator

ekmb Nov 18, 2025

could you link the boxed prompt? will the pipeline handle update from boxed to smth like hle prompt as the default, i.e. no boxed and requires a judge?

Collaborator Author

dgtm777 Nov 22, 2025

For the difficulty estimation, it currently supports only boxed-like prompts. For solution generations, it should work (with proper modification of the config), but the predicted_answer for every sample will be empty, and the majority voting part (in cases where the expected_answer is not set) will consequently work incorrectly

recipes/opensciencereasoning/sdg_pipeline/README.md Outdated

+                - `python_enabled` — enable python-tool prompting and sandbox execution.
+                - `mcq_4_options` — switch to the [`eval/aai/mcq-4choices`](../../../../nemo_skills/prompt/config/eval/aai/mcq-4choices.yaml) prompt for generation.
+                - `mcq_10_options` — switch to the [`eval/aai/mcq-10choices`](../../../../nemo_skills/prompt/config/eval/aai/mcq-10choices.yaml) prompt for generation.
+                - `seed_data` — trim the pipeline to the metadata-only flow used for seed datasets with GT answers.

Collaborator

ekmb Nov 18, 2025

what does the "metadata-only flow" include?

Collaborator Author

dgtm777 Nov 22, 2025

Decontamination, topics, difficulty

Collaborator Author

dgtm777 Nov 22, 2025

Renamed and added a reference to the section with the description

recipes/opensciencereasoning/sdg_pipeline/README.md Outdated

+                - `mcq_4_options` — switch to the [`eval/aai/mcq-4choices`](../../../../nemo_skills/prompt/config/eval/aai/mcq-4choices.yaml) prompt for generation.
+                - `mcq_10_options` — switch to the [`eval/aai/mcq-10choices`](../../../../nemo_skills/prompt/config/eval/aai/mcq-10choices.yaml) prompt for generation.
+                - `seed_data` — trim the pipeline to the metadata-only flow used for seed datasets with GT answers.
+                - `seed_data_postprocess` — keep only the generation → filtering → SFT preparation stages for reasoning above existing seed data.

Collaborator

ekmb Nov 18, 2025

not entirely clear what you mean here. How about listing all possible steps/stages that the pipeline can do somewhere on top? It might make defining these flags easier

Collaborator Author

dgtm777 Nov 22, 2025

I moved the description of stages on top

recipes/opensciencereasoning/sdg_pipeline/README.md Outdated

+                - `mcq_10_options` — switch to the [`eval/aai/mcq-10choices`](../../../../nemo_skills/prompt/config/eval/aai/mcq-10choices.yaml) prompt for generation.
+                - `seed_data` — trim the pipeline to the metadata-only flow used for seed datasets with GT answers.
+                - `seed_data_postprocess` — keep only the generation → filtering → SFT preparation stages for reasoning above existing seed data.
+                - `multiple_prompts` - allow the usage of multiple prompts for the generation.

Collaborator

ekmb Nov 18, 2025

could you elaborate here, e.g.

enables the use of multiple prompts (including distinct preambles and varied output formats) during the generation process to address prompt sensitivity.

Not clear what additional prompts/output format are used here? how to specify them?

Collaborator Author

dgtm777 Nov 22, 2025

This part of the readme specifies it on the high level; there is a detailed explanation below - https://github.com/NVIDIA-NeMo/Skills/blob/02d1aff80650b677add866c9e9b76ac110f532ac/recipes/opensciencereasoning/sdg_pipeline/README.md#using-the-multiple_prompts-setting

Collaborator Author

dgtm777 Nov 22, 2025

Added the reference to the section

recipes/opensciencereasoning/sdg_pipeline/README.md Outdated

+              - **Base pipeline**: [`configs/pipelines/base.yaml`](configs/pipelines/base.yaml) describes the default open-question flow with ground-truth answers available, no tool usage, and the boxed prompt.
+              - **Settings overrides** (under [`configs/settings/`](configs/settings/)) layer small, reusable tweaks. Reference them with or without the `.yaml` suffix:
+                - `without_gt` — route the pipeline through solution generation + majority voting to estimate ground truth answer.
+                - `python_enabled` — enable python-tool prompting and sandbox execution.

Collaborator

ekmb Nov 18, 2025

any specifics here? e.g. what is python-tool prompt? where is it defined?

Collaborator Author

dgtm777 Nov 22, 2025

Here is the link to all the parameters for each setting - configs/settings/) (it is in the readme)

recipes/opensciencereasoning/sdg_pipeline/README.md

+              # OpenScienceReasoning Pipeline Quickstart
+              This folder provides templates, prompts, and scripts for the automated pipeline that powers the OpenScience data refresh. The pipeline launches distributed jobs through [`pipeline/sdg_pipeline.py`](pipeline/sdg_pipeline.py) and covers the full lifecycle: solution generation, ground-truth extraction, difficulty scoring, and topic labeling.
+              ## Config Layout

Collaborator

ekmb Nov 18, 2025

any assumptions on the incoming files, like .jsonl format? any particular fields needed?

Collaborator Author

dgtm777 Nov 22, 2025 •

edited

Loading

Yes, it is described here - (How to use section )https://github.com/NVIDIA-NeMo/Skills/blob/02d1aff80650b677add866c9e9b76ac110f532ac/recipes/opensciencereasoning/sdg_pipeline/README.md#how-to-use

Collaborator Author

dgtm777 Nov 22, 2025

I have added the reference to the section in the filter_problem stage description

recipes/opensciencereasoning/sdg_pipeline/README.md Outdated Show resolved Hide resolved

nemo_skills/prompt/config/generic/search-boxed.yaml

		print(page.content[:500]) # First 500 characters of content

		print(page.links[:10]) # First 10 linked pages

Collaborator

ekmb Nov 22, 2025

@jiacheng-xu - let's update this one to the final version you're working on

dgtm777 and others added 13 commits

November 22, 2025 17:09


          update readme

ba72c0c

Signed-off-by: dgitman <[email protected]>


          qwen conversion, GPT OSS gen config, metadata (info IO), search prompts

d19d07d

Signed-off-by: Jiacheng Xu <[email protected]>


          services like exa wiki

f3b9f05

Signed-off-by: Jiacheng Xu <[email protected]>


          update populate configs

1c7dc1f

Signed-off-by: Jiacheng Xu <[email protected]>


          restructure

b3e172e

Signed-off-by: dgitman <[email protected]>


          fix bug

c1c3e05

Signed-off-by: dgitman <[email protected]>


          fix bug

42d0c38

Signed-off-by: dgitman <[email protected]>


          fix bug

4c51f3d

Signed-off-by: dgitman <[email protected]>


          Merge branch 'main' of https://github.com/NVIDIA/NeMo-Skills into ste…

…m_sdg_pipeline


          fix validation script

e786e55

Signed-off-by: dgitman <[email protected]>


          fix bug

e1d6bfe

Signed-off-by: dgitman <[email protected]>


          fix bugs

7cb234e

Signed-off-by: dgitman <[email protected]>


          add header

014944d

Signed-off-by: dgitman <[email protected]>

dgtm777 requested a review from ekmb

December 2, 2025 08:41

dgtm777 added 2 commits

December 2, 2025 13:15


          update readme

f1c9ecc

Signed-off-by: dgitman <[email protected]>


          fix naming

e90abca

Signed-off-by: dgitman <[email protected]>

Collaborator

Kipok commented Jan 8, 2026

@dgtm777 should we merge this?

Jiacheng Xu added 2 commits

January 16, 2026 16:42


          remove search-x.yaml since there are better ways to config

1447a3d

Signed-off-by: Jiacheng Xu <[email protected]>


          remove services folder and search related prompts as they are not used

10e8788

Signed-off-by: Jiacheng Xu <[email protected]>

jiacheng-xu approved these changes

View reviewed changes

Collaborator

jiacheng-xu left a comment

Looks good to me.

recipes/opensciencereasoning/sdg_pipeline/configs/pipelines/base.yaml Outdated Show resolved Hide resolved

recipes/opensciencereasoning/sdg_pipeline/configs/pipelines/base.yaml Show resolved Hide resolved

Contributor

greptile-apps bot commented Jan 17, 2026

Greptile Summary

This PR adds a comprehensive STEM Synthetic Data Generation (SDG) pipeline to the OpenScienceReasoning project. The pipeline automates the full lifecycle of generating, processing, and preparing high-quality training data for STEM reasoning tasks.

Key additions:

Complete pipeline orchestration system (run_pipeline.py) with 12 configurable stages supporting distributed execution on Slurm clusters
Modular configuration system with base pipeline and composable settings overrides for different scenarios (with/without ground truth, MCQ formats, Python tools, Qwen conversion)
13+ processing scripts handling filtering, decontamination, topic labeling, solution generation, difficulty estimation, answer extraction, and format conversion
Comprehensive validation framework with automated health checks across all pipeline stages
Extensive test coverage with 10 pipeline variants testing different configurations
Well-documented README with stage references, usage examples, and detailed explanations of data flows

Pipeline capabilities:

Supports both open-ended questions and multiple-choice questions (4 or 10 options)
Handles datasets with or without ground truth answers (using majority voting when needed)
Enables Python code execution with sandbox support
Performs contamination checks against benchmark datasets
Labels topics and subtopics hierarchically
Estimates difficulty via model pass rates
Converts to multiple output formats (SFT, messages, Qwen-specific)
Applies sophisticated filtering for code errors, verification code, and quality thresholds

The implementation follows the existing nemo-skills architecture patterns and integrates cleanly with the existing pipeline CLI utilities.

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The code is well-structured with comprehensive documentation, extensive test coverage (10 pipeline variants), proper error handling, and follows existing project patterns. All components are new additions with no modifications to existing functionality, minimizing regression risk. The validation framework ensures pipeline integrity.
No files require special attention

Important Files Changed

Filename	Overview
recipes/opensciencereasoning/sdg_pipeline/run_pipeline.py	Main orchestrator that launches distributed jobs for the complete SDG pipeline lifecycle including filtering, decontamination, topic labeling, solution generation, difficulty estimation, and validation
recipes/opensciencereasoning/sdg_pipeline/README.md	Comprehensive documentation covering pipeline stages, configuration system, usage examples, and validation approach - well-structured with clear stage references and settings explanation
recipes/opensciencereasoning/sdg_pipeline/configs/pipelines/base.yaml	Base pipeline configuration defining full stage sequence with directory structure, stage-specific settings, and model parameters for the default open-question flow
recipes/opensciencereasoning/sdg_pipeline/scripts/filter_problems.py	Entry point that normalizes field names, handles deduplication, removes image references, enforces MCQ option counts, and structures data for downstream stages
recipes/opensciencereasoning/sdg_pipeline/scripts/convert_to_qwen.py	Converts message-formatted data to Qwen-style multi-turn format with parallel processing, tool metadata embedding for Python tools, and custom chat template
recipes/opensciencereasoning/sdg_pipeline/scripts/validate_pipeline.py	Automated validation checking artifact existence, record counts, field presence, and enforcing constraints between stages with soft assertions
tests/slurm-tests/stem_sdg_pipeline/run_test.py	Comprehensive test suite launching 10 pipeline variants covering base config, seed data modes, Python tools, Qwen conversion, MCQ options, and multiple prompt handling
nemo_skills/training/data_preparation_utils/config/stem_sft.yaml	SFT data preparation config with comprehensive filters for code errors, matplotlib, verification code, majority voting, arithmetic validation, and length-based filtering

Sequence Diagram

sequenceDiagram
    participant User
    participant Pipeline as run_pipeline.py
    participant Filter as filter_problems
    participant Decon as decontaminate
    participant Topics as topics_labeling
    participant GenSol as generate_solutions
    participant Diff as difficulty_estimation
    participant Agg as aggregate
    participant FiltSol as filter_solutions
    participant SFT as prepare_for_sft
    participant Msgs as convert_to_messages
    participant Bucket as bucket
    participant Qwen as convert_to_qwen
    participant Valid as validate

    User->>Pipeline: Launch with config + settings
    Pipeline->>Filter: Normalize fields, deduplicate, validate MCQ
    Filter-->>Pipeline: final_result.jsonl
    
    Pipeline->>Decon: Retrieve similar + check contamination
    Decon->>Decon: Model-based contamination check
    Decon-->>Pipeline: Cleaned final_result.jsonl
    
    Pipeline->>Topics: Multi-round topic labeling
    Topics->>Topics: Prepare inputs with few-shots
    Topics->>Topics: Generate topic labels
    Topics->>Topics: Generate subtopic labels
    Topics-->>Pipeline: final_result.jsonl with topics
    
    Pipeline->>GenSol: Generate solutions (multiple seeds)
    GenSol->>GenSol: Extract predictions via regex/boxed
    GenSol->>GenSol: Optional majority voting for GT
    GenSol->>GenSol: Judge correctness (optional)
    GenSol->>GenSol: Aggregate per-problem metrics
    GenSol-->>Pipeline: final_result.jsonl with solutions
    
    Pipeline->>Diff: Estimate difficulty
    Diff->>Diff: Generate boxed solutions
    Diff->>Diff: Judge solutions
    Diff->>Diff: Calculate pass rates
    Diff-->>Pipeline: final_result.jsonl with difficulty
    
    Pipeline->>Agg: Merge all metadata
    Agg-->>Pipeline: Consolidated final_result.jsonl
    
    Pipeline->>FiltSol: Apply quality filters
    FiltSol->>FiltSol: Filter by correctness/pass-rate/metadata
    FiltSol-->>Pipeline: Filtered final_result.jsonl
    
    Pipeline->>SFT: Format for supervised fine-tuning
    SFT->>SFT: Apply filters (code errors, matplotlib, etc)
    SFT-->>Pipeline: SFT-formatted JSONL
    
    Pipeline->>Msgs: Convert to messages format
    Msgs->>Msgs: Handle tool calls, code/think tags
    Msgs-->>Pipeline: Messages-formatted JSONL
    
    Pipeline->>Bucket: Calculate token lengths
    Bucket->>Bucket: Shard by token buckets
    Bucket-->>Pipeline: Bucketed outputs
    
    Pipeline->>Qwen: Convert to Qwen format (optional)
    Qwen->>Qwen: Apply Qwen chat template
    Qwen->>Qwen: Embed tool metadata
    Qwen-->>Pipeline: Qwen-formatted JSONL
    
    Pipeline->>Valid: Validate pipeline health
    Valid->>Valid: Check artifacts exist
    Valid->>Valid: Verify record counts
    Valid->>Valid: Check required fields
    Valid-->>Pipeline: Validation report
    
    Pipeline-->>User: Complete pipeline with deliverables

Contributor

greptile-apps bot commented Jan 17, 2026

Greptile found no issues!

From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section.

_{This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet