Skip to content

Conversation

@dgtm777
Copy link
Collaborator

@dgtm777 dgtm777 commented Oct 30, 2025

No description provided.

Signed-off-by: dgitman <[email protected]>
dgtm777 and others added 6 commits October 30, 2025 12:40
Signed-off-by: dgitman <[email protected]>
Signed-off-by: Rima Shahbazyan <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider adding slurm tests for a simplified use-case of this pipeline to ensure nothing is broken in the future

- ["mmlu", "test"]
- ["mmlu-pro", "test"]
- ["gpqa", "diamond"]
model: /hf_models/Qwen2.5-32B-Instruct
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider using Qwen/Qwen2.5-32B-Instruct here and everywhere else to avoid extra step of manually downloading the model

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am constantly getting a HuggingFace rate limit when using hf model name instead of a local path

Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
solution_key: ${output_key}
test_cases:
- { input: { generation: "1 + 2 + 3 + 4 = 10" }, output: { generation: "1 + 2 + 3 + 4 = 10" } }
# TODO: implement fractional arithmetic
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This folder provides templates, prompts, and scripts for the automated pipeline that powers the OpenScience data refresh. The pipeline launches distributed jobs through [`pipeline/sdg_pipeline.py`](pipeline/sdg_pipeline.py) and covers the full lifecycle: solution generation, ground-truth extraction, difficulty scoring, and topic labeling.

## Config Layout
- **Base pipeline**: [`configs/pipelines/base.yaml`](configs/pipelines/base.yaml) describes the default open-question flow with ground-truth answers available, no tool usage, and the boxed prompt.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you link the boxed prompt? will the pipeline handle update from boxed to smth like hle prompt as the default, i.e. no boxed and requires a judge?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the difficulty estimation, it currently supports only boxed-like prompts. For solution generations, it should work (with proper modification of the config), but the predicted_answer for every sample will be empty, and the majority voting part (in cases where the expected_answer is not set) will consequently work incorrectly

- `python_enabled` — enable python-tool prompting and sandbox execution.
- `mcq_4_options` — switch to the [`eval/aai/mcq-4choices`](../../../../nemo_skills/prompt/config/eval/aai/mcq-4choices.yaml) prompt for generation.
- `mcq_10_options` — switch to the [`eval/aai/mcq-10choices`](../../../../nemo_skills/prompt/config/eval/aai/mcq-10choices.yaml) prompt for generation.
- `seed_data` — trim the pipeline to the metadata-only flow used for seed datasets with GT answers.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does the "metadata-only flow" include?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decontamination, topics, difficulty

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed and added a reference to the section with the description

- `mcq_4_options` — switch to the [`eval/aai/mcq-4choices`](../../../../nemo_skills/prompt/config/eval/aai/mcq-4choices.yaml) prompt for generation.
- `mcq_10_options` — switch to the [`eval/aai/mcq-10choices`](../../../../nemo_skills/prompt/config/eval/aai/mcq-10choices.yaml) prompt for generation.
- `seed_data` — trim the pipeline to the metadata-only flow used for seed datasets with GT answers.
- `seed_data_postprocess` — keep only the generation → filtering → SFT preparation stages for reasoning above existing seed data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not entirely clear what you mean here. How about listing all possible steps/stages that the pipeline can do somewhere on top? It might make defining these flags easier

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the description of stages on top

- `mcq_10_options` — switch to the [`eval/aai/mcq-10choices`](../../../../nemo_skills/prompt/config/eval/aai/mcq-10choices.yaml) prompt for generation.
- `seed_data` — trim the pipeline to the metadata-only flow used for seed datasets with GT answers.
- `seed_data_postprocess` — keep only the generation → filtering → SFT preparation stages for reasoning above existing seed data.
- `multiple_prompts` - allow the usage of multiple prompts for the generation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you elaborate here, e.g.

enables the use of multiple prompts (including distinct preambles and varied output formats) during the generation process to address prompt sensitivity.

Not clear what additional prompts/output format are used here? how to specify them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the reference to the section

- **Base pipeline**: [`configs/pipelines/base.yaml`](configs/pipelines/base.yaml) describes the default open-question flow with ground-truth answers available, no tool usage, and the boxed prompt.
- **Settings overrides** (under [`configs/settings/`](configs/settings/)) layer small, reusable tweaks. Reference them with or without the `.yaml` suffix:
- `without_gt` — route the pipeline through solution generation + majority voting to estimate ground truth answer.
- `python_enabled` — enable python-tool prompting and sandbox execution.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any specifics here? e.g. what is python-tool prompt? where is it defined?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the link to all the parameters for each setting - configs/settings/) (it is in the readme)

# OpenScienceReasoning Pipeline Quickstart
This folder provides templates, prompts, and scripts for the automated pipeline that powers the OpenScience data refresh. The pipeline launches distributed jobs through [`pipeline/sdg_pipeline.py`](pipeline/sdg_pipeline.py) and covers the full lifecycle: solution generation, ground-truth extraction, difficulty scoring, and topic labeling.

## Config Layout
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any assumptions on the incoming files, like .jsonl format? any particular fields needed?

Copy link
Collaborator Author

@dgtm777 dgtm777 Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the reference to the section in the filter_problem stage description

print(page.content[:500]) # First 500 characters of content

print(page.links[:10]) # First 10 linked pages

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiacheng-xu - let's update this one to the final version you're working on

dgtm777 and others added 13 commits November 22, 2025 17:09
Signed-off-by: dgitman <[email protected]>
Signed-off-by: Jiacheng Xu <[email protected]>
Signed-off-by: Jiacheng Xu <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
@dgtm777 dgtm777 requested a review from ekmb December 2, 2025 08:41
Signed-off-by: dgitman <[email protected]>
Signed-off-by: dgitman <[email protected]>
@Kipok
Copy link
Collaborator

Kipok commented Jan 8, 2026

@dgtm777 should we merge this?

Copy link
Collaborator

@jiacheng-xu jiacheng-xu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 17, 2026

Greptile Summary

This PR adds a comprehensive STEM Synthetic Data Generation (SDG) pipeline to the OpenScienceReasoning project. The pipeline automates the full lifecycle of generating, processing, and preparing high-quality training data for STEM reasoning tasks.

Key additions:

  • Complete pipeline orchestration system (run_pipeline.py) with 12 configurable stages supporting distributed execution on Slurm clusters
  • Modular configuration system with base pipeline and composable settings overrides for different scenarios (with/without ground truth, MCQ formats, Python tools, Qwen conversion)
  • 13+ processing scripts handling filtering, decontamination, topic labeling, solution generation, difficulty estimation, answer extraction, and format conversion
  • Comprehensive validation framework with automated health checks across all pipeline stages
  • Extensive test coverage with 10 pipeline variants testing different configurations
  • Well-documented README with stage references, usage examples, and detailed explanations of data flows

Pipeline capabilities:

  • Supports both open-ended questions and multiple-choice questions (4 or 10 options)
  • Handles datasets with or without ground truth answers (using majority voting when needed)
  • Enables Python code execution with sandbox support
  • Performs contamination checks against benchmark datasets
  • Labels topics and subtopics hierarchically
  • Estimates difficulty via model pass rates
  • Converts to multiple output formats (SFT, messages, Qwen-specific)
  • Applies sophisticated filtering for code errors, verification code, and quality thresholds

The implementation follows the existing nemo-skills architecture patterns and integrates cleanly with the existing pipeline CLI utilities.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The code is well-structured with comprehensive documentation, extensive test coverage (10 pipeline variants), proper error handling, and follows existing project patterns. All components are new additions with no modifications to existing functionality, minimizing regression risk. The validation framework ensures pipeline integrity.
  • No files require special attention

Important Files Changed

Filename Overview
recipes/opensciencereasoning/sdg_pipeline/run_pipeline.py Main orchestrator that launches distributed jobs for the complete SDG pipeline lifecycle including filtering, decontamination, topic labeling, solution generation, difficulty estimation, and validation
recipes/opensciencereasoning/sdg_pipeline/README.md Comprehensive documentation covering pipeline stages, configuration system, usage examples, and validation approach - well-structured with clear stage references and settings explanation
recipes/opensciencereasoning/sdg_pipeline/configs/pipelines/base.yaml Base pipeline configuration defining full stage sequence with directory structure, stage-specific settings, and model parameters for the default open-question flow
recipes/opensciencereasoning/sdg_pipeline/scripts/filter_problems.py Entry point that normalizes field names, handles deduplication, removes image references, enforces MCQ option counts, and structures data for downstream stages
recipes/opensciencereasoning/sdg_pipeline/scripts/convert_to_qwen.py Converts message-formatted data to Qwen-style multi-turn format with parallel processing, tool metadata embedding for Python tools, and custom chat template
recipes/opensciencereasoning/sdg_pipeline/scripts/validate_pipeline.py Automated validation checking artifact existence, record counts, field presence, and enforcing constraints between stages with soft assertions
tests/slurm-tests/stem_sdg_pipeline/run_test.py Comprehensive test suite launching 10 pipeline variants covering base config, seed data modes, Python tools, Qwen conversion, MCQ options, and multiple prompt handling
nemo_skills/training/data_preparation_utils/config/stem_sft.yaml SFT data preparation config with comprehensive filters for code errors, matplotlib, verification code, majority voting, arithmetic validation, and length-based filtering

Sequence Diagram

sequenceDiagram
    participant User
    participant Pipeline as run_pipeline.py
    participant Filter as filter_problems
    participant Decon as decontaminate
    participant Topics as topics_labeling
    participant GenSol as generate_solutions
    participant Diff as difficulty_estimation
    participant Agg as aggregate
    participant FiltSol as filter_solutions
    participant SFT as prepare_for_sft
    participant Msgs as convert_to_messages
    participant Bucket as bucket
    participant Qwen as convert_to_qwen
    participant Valid as validate

    User->>Pipeline: Launch with config + settings
    Pipeline->>Filter: Normalize fields, deduplicate, validate MCQ
    Filter-->>Pipeline: final_result.jsonl
    
    Pipeline->>Decon: Retrieve similar + check contamination
    Decon->>Decon: Model-based contamination check
    Decon-->>Pipeline: Cleaned final_result.jsonl
    
    Pipeline->>Topics: Multi-round topic labeling
    Topics->>Topics: Prepare inputs with few-shots
    Topics->>Topics: Generate topic labels
    Topics->>Topics: Generate subtopic labels
    Topics-->>Pipeline: final_result.jsonl with topics
    
    Pipeline->>GenSol: Generate solutions (multiple seeds)
    GenSol->>GenSol: Extract predictions via regex/boxed
    GenSol->>GenSol: Optional majority voting for GT
    GenSol->>GenSol: Judge correctness (optional)
    GenSol->>GenSol: Aggregate per-problem metrics
    GenSol-->>Pipeline: final_result.jsonl with solutions
    
    Pipeline->>Diff: Estimate difficulty
    Diff->>Diff: Generate boxed solutions
    Diff->>Diff: Judge solutions
    Diff->>Diff: Calculate pass rates
    Diff-->>Pipeline: final_result.jsonl with difficulty
    
    Pipeline->>Agg: Merge all metadata
    Agg-->>Pipeline: Consolidated final_result.jsonl
    
    Pipeline->>FiltSol: Apply quality filters
    FiltSol->>FiltSol: Filter by correctness/pass-rate/metadata
    FiltSol-->>Pipeline: Filtered final_result.jsonl
    
    Pipeline->>SFT: Format for supervised fine-tuning
    SFT->>SFT: Apply filters (code errors, matplotlib, etc)
    SFT-->>Pipeline: SFT-formatted JSONL
    
    Pipeline->>Msgs: Convert to messages format
    Msgs->>Msgs: Handle tool calls, code/think tags
    Msgs-->>Pipeline: Messages-formatted JSONL
    
    Pipeline->>Bucket: Calculate token lengths
    Bucket->>Bucket: Shard by token buckets
    Bucket-->>Pipeline: Bucketed outputs
    
    Pipeline->>Qwen: Convert to Qwen format (optional)
    Qwen->>Qwen: Apply Qwen chat template
    Qwen->>Qwen: Embed tool metadata
    Qwen-->>Pipeline: Qwen-formatted JSONL
    
    Pipeline->>Valid: Validate pipeline health
    Valid->>Valid: Check artifacts exist
    Valid->>Valid: Verify record counts
    Valid->>Valid: Check required fields
    Valid-->>Pipeline: Validation report
    
    Pipeline-->>User: Complete pipeline with deliverables
Loading

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 17, 2026

Greptile found no issues!

From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section.

This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants