update simpleqa by jiacheng-xu · Pull Request #959 · NVIDIA-NeMo/Skills

jiacheng-xu · 2025-10-17T07:55:01Z

fix field name of simpleqa and evaluator type.

Summary by CodeRabbit

Chores
- Updated SimpleQA dataset evaluation configuration and processing pipeline
- Standardized field naming conventions across dataset preparation and prompt templates for consistency
- Modified default evaluation parameters for dataset handling

Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>

coderabbitai · 2025-10-17T07:55:11Z

Walkthrough

The SimpleQA dataset module is refactored to rename the primary data field from "problem" to "question" across preparation and evaluation components. The default evaluation type is also updated from "simpleqa" to "math" in configuration constants.

Changes

Cohort / File(s)	Summary
Configuration constants `nemo_skills/dataset/simpleqa/__init__.py`	`EVAL_ARGS` default value changed from `"++eval_type=simpleqa "` to `"++eval_type=math "`
Data field renaming `nemo_skills/dataset/simpleqa/prepare.py`	Both `format_entry()` and `format_entry_verified()` functions now return dictionaries with output key `"question"` instead of `"problem"` to align with new field naming convention
Prompt template updates `nemo_skills/prompt/config/judge/simpleqa.yaml`	Placeholder variable in example prompt changed from `{problem}` to `{question}` to match renamed data field

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

The changes follow a consistent field-renaming pattern across tightly-coupled components with straightforward, predictable edits. Configuration updates and template variable replacements are homogeneous and low-risk.

Poem

🐰 A question once wore the name of a problem so old,
Now "question" it rises, more elegant, bold!
From prepare to the judge, the prompt echoes through,
A gentle refactor makes all things anew. ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title Check	❓ Inconclusive	The title "update simpleqa" is related to the changeset since all modifications are within the simpleqa module. However, the title is vague and non-descriptive, using a generic term ("update") that doesn't convey meaningful information about the specific changes being made. The actual changes include renaming dictionary keys from "problem" to "question" and updating the evaluator type from "simpleqa" to "math", but none of these specifics are evident from the title alone. While the description provides more detail ("fix field name of simpleqa and evaluator type"), the title fails to capture these key distinctions that would help reviewers quickly understand the primary intent of the changes.	Consider revising the title to be more specific and descriptive of the actual changes. A title such as "Fix simpleqa field names and evaluator type" or "Rename simpleqa problem field to question and update eval type" would better communicate the main changes and help teammates quickly understand the purpose of this pull request.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix-simpleqa

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

jiacheng-xu · 2025-10-20T17:41:03Z

Ready to review and merge.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cd862d0 and ec76f69.

📒 Files selected for processing (3)

nemo_skills/dataset/simpleqa/__init__.py (1 hunks)
nemo_skills/dataset/simpleqa/prepare.py (2 hunks)
nemo_skills/prompt/config/judge/simpleqa.yaml (1 hunks)

🔇 Additional comments (3)

nemo_skills/dataset/simpleqa/prepare.py (2)

27-34: LGTM! Field rename correctly implemented.

The update correctly maps the source field entry["problem"] to the new output key "question", maintaining consistency with the template and configuration changes across the PR.

37-44: LGTM! Field rename correctly implemented.

The update correctly maps the source field entry["problem"] to the new output key "question", maintaining consistency with the template and configuration changes across the PR.

nemo_skills/prompt/config/judge/simpleqa.yaml (1)

73-73: LGTM! Change correctly aligns with SimpleQA data preparation.

The template variable update from {problem} to {question} is verified as correct. The data preparation in nemo_skills/dataset/simpleqa/prepare.py maps the incoming entry["problem"] field to a "question" key in the output dictionary, which the template now correctly references. There is only one SimpleQA template file (nemo_skills/prompt/config/judge/simpleqa.yaml), so no additional updates are needed.

coderabbitai · 2025-10-20T17:45:48Z

 DATASET_GROUP = "math"
 METRICS_TYPE = "simpleqa"
-EVAL_ARGS = "++eval_type=simpleqa "
+EVAL_ARGS = "++eval_type=math "


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify the evaluation type change from "simpleqa" to "math".

The change from ++eval_type=simpleqa to ++eval_type=math appears semantically significant. Note that line 17 still has METRICS_TYPE = "simpleqa", creating a potential inconsistency. Please confirm:

Is this change intentional and correct for SimpleQA evaluation?

Should METRICS_TYPE also be updated to "math" for consistency?

Run the following script to check for any documentation or configuration that defines what eval_type=math means versus eval_type=simpleqa:

🏁 Script executed:

#!/bin/bash # Search for eval_type usage and definitions across the codebase rg -n -C3 'eval_type.*=.*(math|simpleqa)' --type=py # Search for METRICS_TYPE usage to understand the relationship rg -n -C3 'METRICS_TYPE' --type=py

Length of output: 43059

Fix inconsistent evaluation configuration: SimpleQA should not use math evaluation.

SimpleQA is a factuality benchmark measuring the ability for language models to answer short, fact-seeking questions, not mathematical problem-solving. Using ++eval_type=math is incorrect for this dataset type. Either:

Ensure a simpleqa evaluator exists and revert EVAL_ARGS to "++eval_type=simpleqa "; or

Update METRICS_TYPE to align with the actual evaluator being used (if math evaluation is intentional, clarify why with a comment explaining the mismatch like the hle dataset does).

Currently, METRICS_TYPE = "simpleqa" conflicts with EVAL_ARGS = "++eval_type=math ", creating an evaluation mismatch that will likely cause incorrect metrics calculation.

🤖 Prompt for AI Agents

In nemo_skills/dataset/simpleqa/__init__.py around line 18, EVAL_ARGS is incorrectly set to "++eval_type=math " while METRICS_TYPE = "simpleqa"; change EVAL_ARGS to "++eval_type=simpleqa " to match METRICS_TYPE and the dataset purpose; if math evaluation was intentional instead, update METRICS_TYPE to "math" and add a clarifying comment explaining why a math evaluator is being used for SimpleQA so the mismatch is explicit.

Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com>

Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com> Signed-off-by: dgitman <dgitman@nvidia.com>

update simpleqa

ec76f69

Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>

ekmb approved these changes Oct 20, 2025

View reviewed changes

jiacheng-xu marked this pull request as ready for review October 20, 2025 17:40

jiacheng-xu requested a review from ekmb October 20, 2025 17:41

coderabbitai Bot reviewed Oct 20, 2025

View reviewed changes

ekmb approved these changes Oct 20, 2025

View reviewed changes

Merge branch 'main' into fix-simpleqa

c31ff4b

ekmb enabled auto-merge (squash) October 20, 2025 18:20

ekmb merged commit a33e05e into main Oct 20, 2025
7 checks passed

ekmb deleted the fix-simpleqa branch October 20, 2025 18:34

dgtm777 pushed a commit that referenced this pull request Oct 29, 2025

update simpleqa (#959)

1a4b48a

Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update simpleqa#959

update simpleqa#959
ekmb merged 2 commits into
mainfrom
fix-simpleqa

jiacheng-xu commented Oct 17, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Oct 17, 2025 •

edited

Loading

Uh oh!

jiacheng-xu commented Oct 20, 2025

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Oct 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jiacheng-xu commented Oct 17, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

jiacheng-xu commented Oct 20, 2025

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jiacheng-xu commented Oct 17, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Oct 17, 2025 •

edited

Loading