-
Notifications
You must be signed in to change notification settings - Fork 140
add stem sdg pipeline #1010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
add stem sdg pipeline #1010
Conversation
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: Rima Shahbazyan <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
…ills into stem_sdg_pipeline
Signed-off-by: dgitman <[email protected]>
Kipok
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider adding slurm tests for a simplified use-case of this pipeline to ensure nothing is broken in the future
nemo_skills/training/data_preparation_utils/config/stem_sft.yaml
Outdated
Show resolved
Hide resolved
| - ["mmlu", "test"] | ||
| - ["mmlu-pro", "test"] | ||
| - ["gpqa", "diamond"] | ||
| model: /hf_models/Qwen2.5-32B-Instruct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider using Qwen/Qwen2.5-32B-Instruct here and everywhere else to avoid extra step of manually downloading the model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am constantly getting a HuggingFace rate limit when using hf model name instead of a local path
recipes/opensciencereasoning/configs/SDG_pipeline/gpt-oss-seed-data_with_gt.yaml
Outdated
Show resolved
Hide resolved
recipes/opensciencereasoning/configs/SDG_pipeline/gpt-oss_with_gt_with_tool.yaml
Outdated
Show resolved
Hide resolved
Signed-off-by: dgitman <[email protected]>
Signed-off-by: Jiacheng Xu <[email protected]>
Signed-off-by: Jiacheng Xu <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: Rima Shahbazyan <[email protected]>
60aec91 to
e737749
Compare
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
…ills into stem_sdg_pipeline
recipes/opensciencereasoning/sdg_pipeline/configs/pipelines/populate_configs.py
Outdated
Show resolved
Hide resolved
recipes/opensciencereasoning/sdg_pipeline/pipeline/sdg_pipeline.py
Outdated
Show resolved
Hide resolved
recipes/opensciencereasoning/sdg_pipeline/scripts/convert_to_qwen.py
Outdated
Show resolved
Hide resolved
recipes/opensciencereasoning/sdg_pipeline/pipeline/sdg_pipeline.py
Outdated
Show resolved
Hide resolved
recipes/opensciencereasoning/sdg_pipeline/pipeline/sdg_pipeline.py
Outdated
Show resolved
Hide resolved
recipes/opensciencereasoning/sdg_pipeline/pipeline/sdg_pipeline.py
Outdated
Show resolved
Hide resolved
recipes/opensciencereasoning/sdg_pipeline/scripts/convert_to_qwen.py
Outdated
Show resolved
Hide resolved
| solution_key: ${output_key} | ||
| test_cases: | ||
| - { input: { generation: "1 + 2 + 3 + 4 = 10" }, output: { generation: "1 + 2 + 3 + 4 = 10" } } | ||
| # TODO: implement fractional arithmetic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| This folder provides templates, prompts, and scripts for the automated pipeline that powers the OpenScience data refresh. The pipeline launches distributed jobs through [`pipeline/sdg_pipeline.py`](pipeline/sdg_pipeline.py) and covers the full lifecycle: solution generation, ground-truth extraction, difficulty scoring, and topic labeling. | ||
|
|
||
| ## Config Layout | ||
| - **Base pipeline**: [`configs/pipelines/base.yaml`](configs/pipelines/base.yaml) describes the default open-question flow with ground-truth answers available, no tool usage, and the boxed prompt. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you link the boxed prompt? will the pipeline handle update from boxed to smth like hle prompt as the default, i.e. no boxed and requires a judge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the difficulty estimation, it currently supports only boxed-like prompts. For solution generations, it should work (with proper modification of the config), but the predicted_answer for every sample will be empty, and the majority voting part (in cases where the expected_answer is not set) will consequently work incorrectly
| - `python_enabled` — enable python-tool prompting and sandbox execution. | ||
| - `mcq_4_options` — switch to the [`eval/aai/mcq-4choices`](../../../../nemo_skills/prompt/config/eval/aai/mcq-4choices.yaml) prompt for generation. | ||
| - `mcq_10_options` — switch to the [`eval/aai/mcq-10choices`](../../../../nemo_skills/prompt/config/eval/aai/mcq-10choices.yaml) prompt for generation. | ||
| - `seed_data` — trim the pipeline to the metadata-only flow used for seed datasets with GT answers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does the "metadata-only flow" include?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decontamination, topics, difficulty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed and added a reference to the section with the description
| - `mcq_4_options` — switch to the [`eval/aai/mcq-4choices`](../../../../nemo_skills/prompt/config/eval/aai/mcq-4choices.yaml) prompt for generation. | ||
| - `mcq_10_options` — switch to the [`eval/aai/mcq-10choices`](../../../../nemo_skills/prompt/config/eval/aai/mcq-10choices.yaml) prompt for generation. | ||
| - `seed_data` — trim the pipeline to the metadata-only flow used for seed datasets with GT answers. | ||
| - `seed_data_postprocess` — keep only the generation → filtering → SFT preparation stages for reasoning above existing seed data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not entirely clear what you mean here. How about listing all possible steps/stages that the pipeline can do somewhere on top? It might make defining these flags easier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the description of stages on top
| - `mcq_10_options` — switch to the [`eval/aai/mcq-10choices`](../../../../nemo_skills/prompt/config/eval/aai/mcq-10choices.yaml) prompt for generation. | ||
| - `seed_data` — trim the pipeline to the metadata-only flow used for seed datasets with GT answers. | ||
| - `seed_data_postprocess` — keep only the generation → filtering → SFT preparation stages for reasoning above existing seed data. | ||
| - `multiple_prompts` - allow the usage of multiple prompts for the generation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you elaborate here, e.g.
enables the use of multiple prompts (including distinct preambles and varied output formats) during the generation process to address prompt sensitivity.
Not clear what additional prompts/output format are used here? how to specify them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of the readme specifies it on the high level; there is a detailed explanation below - https://github.com/NVIDIA-NeMo/Skills/blob/02d1aff80650b677add866c9e9b76ac110f532ac/recipes/opensciencereasoning/sdg_pipeline/README.md#using-the-multiple_prompts-setting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the reference to the section
| - **Base pipeline**: [`configs/pipelines/base.yaml`](configs/pipelines/base.yaml) describes the default open-question flow with ground-truth answers available, no tool usage, and the boxed prompt. | ||
| - **Settings overrides** (under [`configs/settings/`](configs/settings/)) layer small, reusable tweaks. Reference them with or without the `.yaml` suffix: | ||
| - `without_gt` — route the pipeline through solution generation + majority voting to estimate ground truth answer. | ||
| - `python_enabled` — enable python-tool prompting and sandbox execution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any specifics here? e.g. what is python-tool prompt? where is it defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the link to all the parameters for each setting - configs/settings/) (it is in the readme)
| # OpenScienceReasoning Pipeline Quickstart | ||
| This folder provides templates, prompts, and scripts for the automated pipeline that powers the OpenScience data refresh. The pipeline launches distributed jobs through [`pipeline/sdg_pipeline.py`](pipeline/sdg_pipeline.py) and covers the full lifecycle: solution generation, ground-truth extraction, difficulty scoring, and topic labeling. | ||
|
|
||
| ## Config Layout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any assumptions on the incoming files, like .jsonl format? any particular fields needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is described here - (How to use section )https://github.com/NVIDIA-NeMo/Skills/blob/02d1aff80650b677add866c9e9b76ac110f532ac/recipes/opensciencereasoning/sdg_pipeline/README.md#how-to-use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the reference to the section in the filter_problem stage description
| print(page.content[:500]) # First 500 characters of content | ||
|
|
||
| print(page.links[:10]) # First 10 linked pages | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiacheng-xu - let's update this one to the final version you're working on
Signed-off-by: dgitman <[email protected]>
Signed-off-by: Jiacheng Xu <[email protected]>
Signed-off-by: Jiacheng Xu <[email protected]>
Signed-off-by: Jiacheng Xu <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
…m_sdg_pipeline
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
|
@dgtm777 should we merge this? |
Signed-off-by: Jiacheng Xu <[email protected]>
Signed-off-by: Jiacheng Xu <[email protected]>
jiacheng-xu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
recipes/opensciencereasoning/sdg_pipeline/configs/pipelines/base.yaml
Outdated
Show resolved
Hide resolved
Greptile SummaryThis PR adds a comprehensive STEM Synthetic Data Generation (SDG) pipeline to the OpenScienceReasoning project. The pipeline automates the full lifecycle of generating, processing, and preparing high-quality training data for STEM reasoning tasks. Key additions:
Pipeline capabilities:
The implementation follows the existing nemo-skills architecture patterns and integrates cleanly with the existing pipeline CLI utilities. Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Pipeline as run_pipeline.py
participant Filter as filter_problems
participant Decon as decontaminate
participant Topics as topics_labeling
participant GenSol as generate_solutions
participant Diff as difficulty_estimation
participant Agg as aggregate
participant FiltSol as filter_solutions
participant SFT as prepare_for_sft
participant Msgs as convert_to_messages
participant Bucket as bucket
participant Qwen as convert_to_qwen
participant Valid as validate
User->>Pipeline: Launch with config + settings
Pipeline->>Filter: Normalize fields, deduplicate, validate MCQ
Filter-->>Pipeline: final_result.jsonl
Pipeline->>Decon: Retrieve similar + check contamination
Decon->>Decon: Model-based contamination check
Decon-->>Pipeline: Cleaned final_result.jsonl
Pipeline->>Topics: Multi-round topic labeling
Topics->>Topics: Prepare inputs with few-shots
Topics->>Topics: Generate topic labels
Topics->>Topics: Generate subtopic labels
Topics-->>Pipeline: final_result.jsonl with topics
Pipeline->>GenSol: Generate solutions (multiple seeds)
GenSol->>GenSol: Extract predictions via regex/boxed
GenSol->>GenSol: Optional majority voting for GT
GenSol->>GenSol: Judge correctness (optional)
GenSol->>GenSol: Aggregate per-problem metrics
GenSol-->>Pipeline: final_result.jsonl with solutions
Pipeline->>Diff: Estimate difficulty
Diff->>Diff: Generate boxed solutions
Diff->>Diff: Judge solutions
Diff->>Diff: Calculate pass rates
Diff-->>Pipeline: final_result.jsonl with difficulty
Pipeline->>Agg: Merge all metadata
Agg-->>Pipeline: Consolidated final_result.jsonl
Pipeline->>FiltSol: Apply quality filters
FiltSol->>FiltSol: Filter by correctness/pass-rate/metadata
FiltSol-->>Pipeline: Filtered final_result.jsonl
Pipeline->>SFT: Format for supervised fine-tuning
SFT->>SFT: Apply filters (code errors, matplotlib, etc)
SFT-->>Pipeline: SFT-formatted JSONL
Pipeline->>Msgs: Convert to messages format
Msgs->>Msgs: Handle tool calls, code/think tags
Msgs-->>Pipeline: Messages-formatted JSONL
Pipeline->>Bucket: Calculate token lengths
Bucket->>Bucket: Shard by token buckets
Bucket-->>Pipeline: Bucketed outputs
Pipeline->>Qwen: Convert to Qwen format (optional)
Qwen->>Qwen: Apply Qwen chat template
Qwen->>Qwen: Embed tool metadata
Qwen-->>Pipeline: Qwen-formatted JSONL
Pipeline->>Valid: Validate pipeline health
Valid->>Valid: Check artifacts exist
Valid->>Valid: Verify record counts
Valid->>Valid: Check required fields
Valid-->>Pipeline: Validation report
Pipeline-->>User: Complete pipeline with deliverables
|
Greptile found no issues!From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section. This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR". |
No description provided.