Skip to content

update simpleqa#959

Merged
ekmb merged 2 commits into
mainfrom
fix-simpleqa
Oct 20, 2025
Merged

update simpleqa#959
ekmb merged 2 commits into
mainfrom
fix-simpleqa

Conversation

@jiacheng-xu

@jiacheng-xu jiacheng-xu commented Oct 17, 2025

Copy link
Copy Markdown
Collaborator

fix field name of simpleqa and evaluator type.

Summary by CodeRabbit

  • Chores
    • Updated SimpleQA dataset evaluation configuration and processing pipeline
    • Standardized field naming conventions across dataset preparation and prompt templates for consistency
    • Modified default evaluation parameters for dataset handling

Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
@coderabbitai

coderabbitai Bot commented Oct 17, 2025

Copy link
Copy Markdown
Contributor

Walkthrough

The SimpleQA dataset module is refactored to rename the primary data field from "problem" to "question" across preparation and evaluation components. The default evaluation type is also updated from "simpleqa" to "math" in configuration constants.

Changes

Cohort / File(s) Summary
Configuration constants
nemo_skills/dataset/simpleqa/__init__.py
EVAL_ARGS default value changed from "++eval_type=simpleqa " to "++eval_type=math "
Data field renaming
nemo_skills/dataset/simpleqa/prepare.py
Both format_entry() and format_entry_verified() functions now return dictionaries with output key "question" instead of "problem" to align with new field naming convention
Prompt template updates
nemo_skills/prompt/config/judge/simpleqa.yaml
Placeholder variable in example prompt changed from {problem} to {question} to match renamed data field

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

The changes follow a consistent field-renaming pattern across tightly-coupled components with straightforward, predictable edits. Configuration updates and template variable replacements are homogeneous and low-risk.

Poem

🐰 A question once wore the name of a problem so old,
Now "question" it rises, more elegant, bold!
From prepare to the judge, the prompt echoes through,
A gentle refactor makes all things anew. ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title Check ❓ Inconclusive The title "update simpleqa" is related to the changeset since all modifications are within the simpleqa module. However, the title is vague and non-descriptive, using a generic term ("update") that doesn't convey meaningful information about the specific changes being made. The actual changes include renaming dictionary keys from "problem" to "question" and updating the evaluator type from "simpleqa" to "math", but none of these specifics are evident from the title alone. While the description provides more detail ("fix field name of simpleqa and evaluator type"), the title fails to capture these key distinctions that would help reviewers quickly understand the primary intent of the changes. Consider revising the title to be more specific and descriptive of the actual changes. A title such as "Fix simpleqa field names and evaluator type" or "Rename simpleqa problem field to question and update eval type" would better communicate the main changes and help teammates quickly understand the purpose of this pull request.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix-simpleqa

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jiacheng-xu jiacheng-xu marked this pull request as ready for review October 20, 2025 17:40
@jiacheng-xu

Copy link
Copy Markdown
Collaborator Author

Ready to review and merge.

@jiacheng-xu jiacheng-xu requested a review from ekmb October 20, 2025 17:41

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cd862d0 and ec76f69.

📒 Files selected for processing (3)
  • nemo_skills/dataset/simpleqa/__init__.py (1 hunks)
  • nemo_skills/dataset/simpleqa/prepare.py (2 hunks)
  • nemo_skills/prompt/config/judge/simpleqa.yaml (1 hunks)
🔇 Additional comments (3)
nemo_skills/dataset/simpleqa/prepare.py (2)

27-34: LGTM! Field rename correctly implemented.

The update correctly maps the source field entry["problem"] to the new output key "question", maintaining consistency with the template and configuration changes across the PR.


37-44: LGTM! Field rename correctly implemented.

The update correctly maps the source field entry["problem"] to the new output key "question", maintaining consistency with the template and configuration changes across the PR.

nemo_skills/prompt/config/judge/simpleqa.yaml (1)

73-73: LGTM! Change correctly aligns with SimpleQA data preparation.

The template variable update from {problem} to {question} is verified as correct. The data preparation in nemo_skills/dataset/simpleqa/prepare.py maps the incoming entry["problem"] field to a "question" key in the output dictionary, which the template now correctly references. There is only one SimpleQA template file (nemo_skills/prompt/config/judge/simpleqa.yaml), so no additional updates are needed.

DATASET_GROUP = "math"
METRICS_TYPE = "simpleqa"
EVAL_ARGS = "++eval_type=simpleqa "
EVAL_ARGS = "++eval_type=math "

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify the evaluation type change from "simpleqa" to "math".

The change from ++eval_type=simpleqa to ++eval_type=math appears semantically significant. Note that line 17 still has METRICS_TYPE = "simpleqa", creating a potential inconsistency. Please confirm:

  1. Is this change intentional and correct for SimpleQA evaluation?
  2. Should METRICS_TYPE also be updated to "math" for consistency?

Run the following script to check for any documentation or configuration that defines what eval_type=math means versus eval_type=simpleqa:


🏁 Script executed:

#!/bin/bash
# Search for eval_type usage and definitions across the codebase
rg -n -C3 'eval_type.*=.*(math|simpleqa)' --type=py

# Search for METRICS_TYPE usage to understand the relationship
rg -n -C3 'METRICS_TYPE' --type=py

Length of output: 43059


Fix inconsistent evaluation configuration: SimpleQA should not use math evaluation.

SimpleQA is a factuality benchmark measuring the ability for language models to answer short, fact-seeking questions, not mathematical problem-solving. Using ++eval_type=math is incorrect for this dataset type. Either:

  1. Ensure a simpleqa evaluator exists and revert EVAL_ARGS to "++eval_type=simpleqa "; or
  2. Update METRICS_TYPE to align with the actual evaluator being used (if math evaluation is intentional, clarify why with a comment explaining the mismatch like the hle dataset does).

Currently, METRICS_TYPE = "simpleqa" conflicts with EVAL_ARGS = "++eval_type=math ", creating an evaluation mismatch that will likely cause incorrect metrics calculation.

🤖 Prompt for AI Agents
In nemo_skills/dataset/simpleqa/__init__.py around line 18, EVAL_ARGS is
incorrectly set to "++eval_type=math " while METRICS_TYPE = "simpleqa"; change
EVAL_ARGS to "++eval_type=simpleqa " to match METRICS_TYPE and the dataset
purpose; if math evaluation was intentional instead, update METRICS_TYPE to
"math" and add a clarifying comment explaining why a math evaluator is being
used for SimpleQA so the mismatch is explicit.

@ekmb ekmb enabled auto-merge (squash) October 20, 2025 18:20
@ekmb ekmb merged commit a33e05e into main Oct 20, 2025
7 checks passed
@ekmb ekmb deleted the fix-simpleqa branch October 20, 2025 18:34
SeanNaren pushed a commit to SeanNaren/NeMo-Skills that referenced this pull request Oct 22, 2025
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com>
Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Oct 29, 2025
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com>
Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com>
Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants