update simpleqa#959
Conversation
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
WalkthroughThe SimpleQA dataset module is refactored to rename the primary data field from "problem" to "question" across preparation and evaluation components. The default evaluation type is also updated from "simpleqa" to "math" in configuration constants. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes The changes follow a consistent field-renaming pattern across tightly-coupled components with straightforward, predictable edits. Configuration updates and template variable replacements are homogeneous and low-risk. Poem
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Ready to review and merge. |
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
nemo_skills/dataset/simpleqa/__init__.py(1 hunks)nemo_skills/dataset/simpleqa/prepare.py(2 hunks)nemo_skills/prompt/config/judge/simpleqa.yaml(1 hunks)
🔇 Additional comments (3)
nemo_skills/dataset/simpleqa/prepare.py (2)
27-34: LGTM! Field rename correctly implemented.The update correctly maps the source field
entry["problem"]to the new output key"question", maintaining consistency with the template and configuration changes across the PR.
37-44: LGTM! Field rename correctly implemented.The update correctly maps the source field
entry["problem"]to the new output key"question", maintaining consistency with the template and configuration changes across the PR.nemo_skills/prompt/config/judge/simpleqa.yaml (1)
73-73: LGTM! Change correctly aligns with SimpleQA data preparation.The template variable update from
{problem}to{question}is verified as correct. The data preparation innemo_skills/dataset/simpleqa/prepare.pymaps the incomingentry["problem"]field to a"question"key in the output dictionary, which the template now correctly references. There is only one SimpleQA template file (nemo_skills/prompt/config/judge/simpleqa.yaml), so no additional updates are needed.
| DATASET_GROUP = "math" | ||
| METRICS_TYPE = "simpleqa" | ||
| EVAL_ARGS = "++eval_type=simpleqa " | ||
| EVAL_ARGS = "++eval_type=math " |
There was a problem hiding this comment.
🧩 Analysis chain
Verify the evaluation type change from "simpleqa" to "math".
The change from ++eval_type=simpleqa to ++eval_type=math appears semantically significant. Note that line 17 still has METRICS_TYPE = "simpleqa", creating a potential inconsistency. Please confirm:
- Is this change intentional and correct for SimpleQA evaluation?
- Should
METRICS_TYPEalso be updated to "math" for consistency?
Run the following script to check for any documentation or configuration that defines what eval_type=math means versus eval_type=simpleqa:
🏁 Script executed:
#!/bin/bash
# Search for eval_type usage and definitions across the codebase
rg -n -C3 'eval_type.*=.*(math|simpleqa)' --type=py
# Search for METRICS_TYPE usage to understand the relationship
rg -n -C3 'METRICS_TYPE' --type=pyLength of output: 43059
Fix inconsistent evaluation configuration: SimpleQA should not use math evaluation.
SimpleQA is a factuality benchmark measuring the ability for language models to answer short, fact-seeking questions, not mathematical problem-solving. Using ++eval_type=math is incorrect for this dataset type. Either:
- Ensure a
simpleqaevaluator exists and revertEVAL_ARGSto"++eval_type=simpleqa "; or - Update
METRICS_TYPEto align with the actual evaluator being used (if math evaluation is intentional, clarify why with a comment explaining the mismatch like thehledataset does).
Currently, METRICS_TYPE = "simpleqa" conflicts with EVAL_ARGS = "++eval_type=math ", creating an evaluation mismatch that will likely cause incorrect metrics calculation.
🤖 Prompt for AI Agents
In nemo_skills/dataset/simpleqa/__init__.py around line 18, EVAL_ARGS is
incorrectly set to "++eval_type=math " while METRICS_TYPE = "simpleqa"; change
EVAL_ARGS to "++eval_type=simpleqa " to match METRICS_TYPE and the dataset
purpose; if math evaluation was intentional instead, update METRICS_TYPE to
"math" and add a clarifying comment explaining why a math evaluator is being
used for SimpleQA so the mismatch is explicit.
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com> Signed-off-by: dgitman <dgitman@nvidia.com>
fix field name of simpleqa and evaluator type.
Summary by CodeRabbit