Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion nemo_skills/dataset/simpleqa/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
# settings that define how evaluation should be done by default (all can be changed from cmdline)
DATASET_GROUP = "math"
METRICS_TYPE = "simpleqa"
EVAL_ARGS = "++eval_type=simpleqa "
EVAL_ARGS = "++eval_type=math "

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify the evaluation type change from "simpleqa" to "math".

The change from ++eval_type=simpleqa to ++eval_type=math appears semantically significant. Note that line 17 still has METRICS_TYPE = "simpleqa", creating a potential inconsistency. Please confirm:

  1. Is this change intentional and correct for SimpleQA evaluation?
  2. Should METRICS_TYPE also be updated to "math" for consistency?

Run the following script to check for any documentation or configuration that defines what eval_type=math means versus eval_type=simpleqa:


🏁 Script executed:

#!/bin/bash
# Search for eval_type usage and definitions across the codebase
rg -n -C3 'eval_type.*=.*(math|simpleqa)' --type=py

# Search for METRICS_TYPE usage to understand the relationship
rg -n -C3 'METRICS_TYPE' --type=py

Length of output: 43059


Fix inconsistent evaluation configuration: SimpleQA should not use math evaluation.

SimpleQA is a factuality benchmark measuring the ability for language models to answer short, fact-seeking questions, not mathematical problem-solving. Using ++eval_type=math is incorrect for this dataset type. Either:

  1. Ensure a simpleqa evaluator exists and revert EVAL_ARGS to "++eval_type=simpleqa "; or
  2. Update METRICS_TYPE to align with the actual evaluator being used (if math evaluation is intentional, clarify why with a comment explaining the mismatch like the hle dataset does).

Currently, METRICS_TYPE = "simpleqa" conflicts with EVAL_ARGS = "++eval_type=math ", creating an evaluation mismatch that will likely cause incorrect metrics calculation.

🤖 Prompt for AI Agents
In nemo_skills/dataset/simpleqa/__init__.py around line 18, EVAL_ARGS is
incorrectly set to "++eval_type=math " while METRICS_TYPE = "simpleqa"; change
EVAL_ARGS to "++eval_type=simpleqa " to match METRICS_TYPE and the dataset
purpose; if math evaluation was intentional instead, update METRICS_TYPE to
"math" and add a clarifying comment explaining why a math evaluator is being
used for SimpleQA so the mismatch is explicit.

GENERATION_ARGS = "++prompt_config=generic/default "
EVAL_SPLIT = "verified"

Expand Down
4 changes: 2 additions & 2 deletions nemo_skills/dataset/simpleqa/prepare.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ def format_entry(entry: dict, idx: int) -> dict:
return {
"id": entry.get("id", f"simpleqa_{idx}"),
"metadata": eval(entry["metadata"]),
"problem": entry["problem"],
"question": entry["problem"],
"expected_answer": entry["answer"],
}

Expand All @@ -39,7 +39,7 @@ def format_entry_verified(entry: dict, idx: int) -> dict:
return {
"id": entry.get("original_index", f"simpleqa_{idx}"),
"metadata": entry.to_dict(),
"problem": entry["problem"],
"question": entry["problem"],
"expected_answer": entry["answer"],
}

Expand Down
2 changes: 1 addition & 1 deletion nemo_skills/prompt/config/judge/simpleqa.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ user: |-

Here is a new example. Simply reply with either CORRECT, INCORRECT, NOT ATTEMPTED. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
```
Question: {problem}
Question: {question}
Gold target: {expected_answer}
Predicted answer: {predicted_answer}
```
Expand Down
Loading