Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions lmms_eval/tasks/medqa/medqa.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
dataset_path: lmms-lab/MEDQA
dataset_kwargs:
token: True

task: "medqa"
test_split: test
doc_to_target: !function utils.medqa_doc_to_target
doc_to_visual: null
doc_to_text: !function utils.medqa_doc_to_text
doc_to_choice: !function utils.medqa_doc_to_choice

lmms_eval_specific_kwargs:
default:
pre_prompt: ""
post_prompt: "\nAnswer with the option's letter from the given choices directly: "
metric_list:
- metric: accuracy
aggregation: mean
higher_is_better: true

process_results: !function utils.medqa_process_results

metadata:
version: 0.0
122 changes: 122 additions & 0 deletions lmms_eval/tasks/medqa/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
import random
from typing import Any, Dict, List

import numpy as np


def medqa_doc_to_text(doc: Dict[str, Any], lmms_eval_specific_kwargs: Dict[str, Any]):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add return type annotation for public API

Type hints are required. Annotate return type.

-def medqa_doc_to_text(doc: Dict[str, Any], lmms_eval_specific_kwargs: Dict[str, Any]):
+def medqa_doc_to_text(
+    doc: Dict[str, Any],
+    lmms_eval_specific_kwargs: Dict[str, Any],
+) -> str:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def medqa_doc_to_text(doc: Dict[str, Any], lmms_eval_specific_kwargs: Dict[str, Any]):
def medqa_doc_to_text(
doc: Dict[str, Any],
lmms_eval_specific_kwargs: Dict[str, Any],
) -> str:
🤖 Prompt for AI Agents
In lmms_eval/tasks/medqa/utils.py around line 7, the function signature def
medqa_doc_to_text(doc: Dict[str, Any], lmms_eval_specific_kwargs: Dict[str,
Any]): is missing a return type annotation; update the signature to include an
explicit return type (e.g., -> str) that matches the function's actual return
value, and add any needed typing imports (from typing import Any, Dict,
Optional, Union) if you choose a more complex type.

"""
Build the MCQ prompt from MEDQA sample.

Expected doc fields (from `lmms-lab/MEDQA` parquet):
- "question": str
- "options": dict mapping letters to option strings (e.g., {"A": "...", "B": "..."})
- Some samples may also expose choices as list-like; we normalize to a lettered block.
- We do not use visuals for MEDQA.
"""
question = doc.get("question", "").strip()

# Normalize options into A..E style lines
options = doc.get("options")
if isinstance(options, dict):
# Keep only A-E in sorted letter order if present
ordered_keys = [k for k in ["A", "B", "C", "D", "E"] if k in options]
options_block = "\n".join([f"{k}. {str(options[k]).strip()}" for k in ordered_keys])
elif isinstance(options, list):
letters = ["A", "B", "C", "D", "E"]
options_block = "\n".join([f"{letters[i]}. {str(opt).strip()}" for i, opt in enumerate(options)])
else:
# Fallback: try to format if already string-like
options_block = str(options) if options is not None else ""

Comment on lines +20 to +31

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix potential IndexError when options is a list; wrap long lines

Enumerating beyond 5 options will raise IndexError. Use zip with fixed letters. Also wrap long joins to respect line length.

-    # Normalize options into A..E style lines
+    # Normalize options into A..E style lines
     options = doc.get("options")
     if isinstance(options, dict):
         # Keep only A-E in sorted letter order if present
         ordered_keys = [k for k in ["A", "B", "C", "D", "E"] if k in options]
-        options_block = "\n".join([f"{k}. {str(options[k]).strip()}" for k in ordered_keys])
+        options_block = "\n".join(
+            f"{k}. {str(options[k]).strip()}" for k in ordered_keys
+        )
     elif isinstance(options, list):
         letters = ["A", "B", "C", "D", "E"]
-        options_block = "\n".join([f"{letters[i]}. {str(opt).strip()}" for i, opt in enumerate(options)])
+        options_block = "\n".join(
+            f"{letter}. {str(opt).strip()}" for letter, opt in zip(letters, options)
+        )
     else:
         # Fallback: try to format if already string-like
         options_block = str(options) if options is not None else ""
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
options = doc.get("options")
if isinstance(options, dict):
# Keep only A-E in sorted letter order if present
ordered_keys = [k for k in ["A", "B", "C", "D", "E"] if k in options]
options_block = "\n".join([f"{k}. {str(options[k]).strip()}" for k in ordered_keys])
elif isinstance(options, list):
letters = ["A", "B", "C", "D", "E"]
options_block = "\n".join([f"{letters[i]}. {str(opt).strip()}" for i, opt in enumerate(options)])
else:
# Fallback: try to format if already string-like
options_block = str(options) if options is not None else ""
options = doc.get("options")
if isinstance(options, dict):
# Keep only A-E in sorted letter order if present
ordered_keys = [k for k in ["A", "B", "C", "D", "E"] if k in options]
options_block = "\n".join(
f"{k}. {str(options[k]).strip()}" for k in ordered_keys
)
elif isinstance(options, list):
letters = ["A", "B", "C", "D", "E"]
options_block = "\n".join(
f"{letter}. {str(opt).strip()}" for letter, opt in zip(letters, options)
)
else:
# Fallback: try to format if already string-like
options_block = str(options) if options is not None else ""
🤖 Prompt for AI Agents
In lmms_eval/tasks/medqa/utils.py around lines 20 to 31, the branch that formats
options when options is a list can raise IndexError if there are more than five
options and also contains long unwrapped joins; replace the enumerate-based
indexing with a safe zip between a fixed list of letters ["A","B","C","D","E"]
and the options list so extra options are ignored without throwing, and when
constructing the options_block join, wrap or truncate long option strings (e.g.,
use a textwrap.fill or truncate each option to a max width before joining) to
keep line length reasonable.

pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
post_prompt = lmms_eval_specific_kwargs["post_prompt"]
prompt = f"{question}\n{options_block}"
return f"{pre_prompt}{prompt}{post_prompt}"


def medqa_doc_to_target(doc: Dict[str, Any]):
"""
Return the ground-truth answer letter.

MEDQA on HF commonly provides either:
- "answer_idx": a letter like "A"/"B"/... OR
- "answer": a full string like "C" or the option text. We prioritize letter if available.
"""
# Prefer explicit answer letter field when present
Comment on lines +38 to +46

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add return type; normalize answer letter casing

Guarantee uppercase to match choice letters consistently.

-def medqa_doc_to_target(doc: Dict[str, Any]):
+def medqa_doc_to_target(doc: Dict[str, Any]) -> str:
@@
-    if "answer_idx" in doc and isinstance(doc["answer_idx"], str) and len(doc["answer_idx"]) == 1:
-        return doc["answer_idx"].strip()
+    if "answer_idx" in doc and isinstance(doc["answer_idx"], str) and len(doc["answer_idx"]) == 1:
+        return doc["answer_idx"].strip().upper()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def medqa_doc_to_target(doc: Dict[str, Any]):
"""
Return the ground-truth answer letter.
MEDQA on HF commonly provides either:
- "answer_idx": a letter like "A"/"B"/... OR
- "answer": a full string like "C" or the option text. We prioritize letter if available.
"""
# Prefer explicit answer letter field when present
def medqa_doc_to_target(doc: Dict[str, Any]) -> str:
"""
Return the ground-truth answer letter.
MEDQA on HF commonly provides either:
- "answer_idx": a letter like "A"/"B"/... OR
- "answer": a full string like "C" or the option text. We prioritize letter if available.
"""
# Prefer explicit answer letter field when present
if "answer_idx" in doc and isinstance(doc["answer_idx"], str) and len(doc["answer_idx"]) == 1:
return doc["answer_idx"].strip().upper()
🤖 Prompt for AI Agents
In lmms_eval/tasks/medqa/utils.py around lines 38-46, add an explicit return
type for medqa_doc_to_target (str) and ensure the returned answer letter is
always uppercase: prefer doc.get("answer_idx") and if present return its
.upper(); otherwise take doc.get("answer"), strip whitespace, extract the first
non-empty character (or first token) and return it uppercased; if neither exists
return an empty string. Also ensure you handle non-string inputs by converting
to str before stripping/uppercasing.

if "answer_idx" in doc and isinstance(doc["answer_idx"], str) and len(doc["answer_idx"]) == 1:
return doc["answer_idx"].strip()

# Some variants store the letter in "answer" directly
ans = doc.get("answer")
if isinstance(ans, str) and len(ans.strip()) == 1 and ans.strip().upper() in ["A", "B", "C", "D", "E"]:
return ans.strip().upper()

# If answer is provided as text, try to map back to a letter via options
options = doc.get("options")
if isinstance(options, dict) and isinstance(ans, str):
for k, v in options.items():
if isinstance(v, str) and v.strip() == ans.strip():
return k

# Fallback: unknown -> choose a dummy; evaluation will mark as incorrect
return "A"


def medqa_doc_to_choice(doc: Dict[str, Any]) -> List[str]:
# Detect how many choices are present and return corresponding letters
if isinstance(doc.get("options"), dict):
present = [k for k in ["A", "B", "C", "D", "E"] if k in doc["options"]]
if present:
return present
if isinstance(doc.get("options"), list):
n = min(len(doc["options"]), 5)
return ["A", "B", "C", "D", "E"][:n]
# Default to 5-way if uncertain
return ["A", "B", "C", "D", "E"]


def medqa_process_results(doc: Dict[str, Any], result: List[str]):
"""
Parse model output and compute accuracy against the gold letter.
We robustly extract a single letter from the response.
"""
response = result[0].strip()
all_choices = medqa_doc_to_choice(doc)
pred = _parse_multi_choice_response(response, all_choices)
gt_ans = medqa_doc_to_target(doc)
score = 1.0 if pred == gt_ans else 0.0
return {"accuracy": score}

Comment on lines +79 to +90

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Type hints + guard empty result; ensure deterministic behavior

Add return type; handle empty result defensively.

-def medqa_process_results(doc: Dict[str, Any], result: List[str]):
+def medqa_process_results(doc: Dict[str, Any], result: List[str]) -> Dict[str, float]:
@@
-    response = result[0].strip()
+    response = str(result[0]).strip() if result else ""
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def medqa_process_results(doc: Dict[str, Any], result: List[str]):
"""
Parse model output and compute accuracy against the gold letter.
We robustly extract a single letter from the response.
"""
response = result[0].strip()
all_choices = medqa_doc_to_choice(doc)
pred = _parse_multi_choice_response(response, all_choices)
gt_ans = medqa_doc_to_target(doc)
score = 1.0 if pred == gt_ans else 0.0
return {"accuracy": score}
def medqa_process_results(doc: Dict[str, Any], result: List[str]) -> Dict[str, float]:
"""
Parse model output and compute accuracy against the gold letter.
We robustly extract a single letter from the response.
"""
response = str(result[0]).strip() if result else ""
all_choices = medqa_doc_to_choice(doc)
pred = _parse_multi_choice_response(response, all_choices)
gt_ans = medqa_doc_to_target(doc)
score = 1.0 if pred == gt_ans else 0.0
return {"accuracy": score}
🤖 Prompt for AI Agents
In lmms_eval/tasks/medqa/utils.py around lines 79-90, the function lacks a
return type and doesn’t guard against an empty result list; add a return type
annotation (-> Dict[str, float]) and make the function defensive by checking if
not result or not result[0]. If the result is empty, return a deterministic
{"accuracy": 0.0}; otherwise proceed to compute response = result[0].strip(),
parse prediction, compare to gold, and return {"accuracy": score}.


def _parse_multi_choice_response(response: str, all_choices: List[str]) -> str:
# Clean punctuation around the response
for ch in [",", ".", "!", "?", ";", ":", "'"]:
response = response.strip(ch)
response = " " + response + " "

candidates = []
# (A) style
for c in all_choices:
if f"({c})" in response:
candidates.append(c)

# plain letter surrounded by spaces
if len(candidates) == 0:
for c in all_choices:
if f" {c} " in response:
candidates.append(c)

# A., B., etc.
if len(candidates) == 0:
for c in all_choices:
if f"{c}." in response:
candidates.append(c)

if len(candidates) == 0:
return random.choice(all_choices)
if len(candidates) > 1:
# choose the last occurrence to mitigate explanations mentioning multiple letters
start_indexes = [response.rfind(f" {can} ") for can in candidates]
return candidates[int(np.argmax(start_indexes))]
return candidates[0]
24 changes: 24 additions & 0 deletions lmms_eval/tasks/scibench/scibench.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
dataset_path: lmms-lab/SciBench
dataset_kwargs:
token: True
test_split: train
task: "scibench"

doc_to_text: !function utils.scibench_doc_to_text
doc_to_target: "answer_number"

lmms_eval_specific_kwargs:
default:
pre_prompt: |
Please provide a clear and step-by-step solution for a scientific problem in the categories of Chemistry, Physics, or Mathematics. The problem will specify the unit of measurement, which should not be included in the answer. Express the final answer as a decimal number with three digits after the decimal point. Conclude the answer by stating "The answer is therefore \boxed{[ANSWER]}."
post_prompt: "\nLet's think step by step."

metric_list:
- metric: accuracy
aggregation: mean
higher_is_better: true

process_results: !function utils.scibench_process_results

metadata:
version: 0.0
23 changes: 23 additions & 0 deletions lmms_eval/tasks/scibench/scibench_multishot.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
dataset_path: lmms-lab/SuperGPQA

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Wrong dataset referenced; points to SuperGPQA instead of SciBench.

This will evaluate the wrong corpus.

-dataset_path: lmms-lab/SuperGPQA
+dataset_path: lmms-lab/SciBench
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
dataset_path: lmms-lab/SuperGPQA
dataset_path: lmms-lab/SciBench
🤖 Prompt for AI Agents
In lmms_eval/tasks/scibench/scibench_multishot.yaml around line 1, the
dataset_path is set to lmms-lab/SuperGPQA which references the wrong corpus;
replace this value with the correct SciBench dataset path (e.g., the proper
repository/dataset identifier for SciBench used elsewhere in the project) so the
task evaluates the SciBench corpus instead of SuperGPQA, and verify the dataset
name matches other configs and is accessible.

dataset_kwargs:
token: True
test_split: test
task: "scibench_multishot"

doc_to_text: !function utils.scibench_multishot_doc_to_text
doc_to_target: "answer_number"

lmms_eval_specific_kwargs:
default:
pre_prompt: ""
post_prompt: ""

metric_list:
- metric: accuracy
aggregation: mean
higher_is_better: true

process_results: !function utils.scibench_process_results

metadata:
version: 0.0
131 changes: 131 additions & 0 deletions lmms_eval/tasks/scibench/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
import re
from math import isclose
from typing import Dict, List, Tuple

Comment on lines +1 to +4

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Silence RUF001 or replace ambiguous unicode; add Optional for types

Ruff flags the Unicode ×/− in literals. Easiest: per‑file suppression. Also prep for Optional return types.

+ # ruff: noqa: RUF001  # allow ×/− in prompt examples
 import re
 from math import isclose
-from typing import Dict, List, Tuple
+from typing import Dict, List, Optional
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import re
from math import isclose
from typing import Dict, List, Tuple
# ruff: noqa: RUF001 # allow ×/− in prompt examples
import re
from math import isclose
from typing import Dict, List, Optional
🤖 Prompt for AI Agents
In lmms_eval/tasks/scibench/utils.py lines 1-4, Ruff warns about ambiguous
Unicode characters (×/−) and the comment requests preparing return types as
Optional; to fix, add a per-file Ruff suppression directive at the top (e.g., a
single-line ruff noqa comment targeting RUF001) or replace any ambiguous Unicode
multiplication/division/minus characters in literals with ASCII equivalents (x,
*, -) throughout the file, and update any function return type hints that may be
nullable to use typing.Optional[...] (import Optional if not already) so
signatures reflect possible None returns.

FEWSHOT_PROMPT = """Problem:
Suppose that $10.0 \mathrm{~mol} \mathrm{C}_2 \mathrm{H}_6(\mathrm{~g})$ is confined to $4.860 \mathrm{dm}^3$ at $27^{\circ} \mathrm{C}$. Predict the pressure exerted by the ethane from the perfect gas.

Solution:
To predict ethane's pressure using the ideal gas law (PV = nRT), I'll convert temperature from 27°C to Kelvin (300.15 K, rounded to 300 K for significant figures), then substitute the given values: 10.0 mol of ethane, volume of 4.860 L, gas constant R = 0.0821 L·atm/(mol·K), and T = 300 K. Rearranging the ideal gas equation to P = nRT/V and calculating: P = (10.0 mol × 0.0821 L·atm/(mol·K) × 300 K) ÷ 4.860 L = 246.3 L·atm ÷ 4.860 L ≈ 50.7 atm. Since temperature has two significant figures, the final pressure is \(\boxed{50.7}\) atm.
Final Answer: The final answer is \(\boxed{50.7}\). I hope it is correct.

Problem:
Assume that all gases are perfect and that data refer to 298.15 K unless otherwise stated. Calculate the change in chemical potential of a perfect gas when its pressure is increased isothermally from $1.8 \mathrm{~atm}$ to $29.5 \mathrm{~atm}$ at $40^{\circ} \mathrm{C}$.

Solution:
To determine the change in chemical potential (Δμ) of a perfect gas during pressure change, I begin with the fundamental relation μ = μ° + RT ln(P/P°), which yields Δμ = RT ln(Pf/Pi) for changes between two states. Converting the given temperature of 40°C to 313.15K and using R = 8.314 J/(mol·K), I calculate Δμ = (8.314 J/(mol·K))(313.15K)ln(29.5/1.8). The pressure ratio 29.5/1.8 ≈ 16.39 gives ln(16.39) ≈ 2.797, so Δμ = 8.314 × 313.15 × 2.797 ≈ 7274.5 J/mol, which rounds to 7.3 kJ/mol.
Final Answer: The final answer is \(\boxed{7.3}\). I hope it is correct.

Problem:
Show that the small angular deviation of $\epsilon$ of a plumb line from the true vertical (i.e., toward the center of Earth) at a point on Earth's surface at a latitude $\lambda$ is $\epsilon = \frac{R\omega^2sin\lambda cos\lambda}{g_0 - R\omega^2 cos^2\lambda}$ where R is the radius of Earth. What is the value (in seconds of arc) of the maximum deviation? Note that the entire denominator in the answer is actually the effective $g$, and $g_0$ denotes the pure gravitational component.

Solution:
To determine the small angular deviation (ε) of a plumb line from true vertical due to Earth's rotation, we analyze the balance of forces at latitude λ: gravitational force (Fg = mg0) toward Earth's center and centrifugal force (Fc = mRω²cosλ) perpendicular to the rotation axis. The centrifugal force resolves into a vertical component (Fc,v = mRω²cos²λ) that reduces effective gravity to g = g0 - Rω²cos²λ, and a horizontal component (Fc,h = mRω²sinλcosλ) pulling toward the equator. The angular deviation equals the ratio of horizontal force to effective gravity: ε = Rω²sinλcosλ/(g0 - Rω²cos²λ). To find maximum deviation, we differentiate with respect to λ and find it occurs at λ = 45°. Using Earth values (R = 6.371×10⁶ m, ω = 7.292×10⁻⁵ rad/s, g0 ≈ 9.81 m/s²), we calculate the numerator at 45° as 1.697×10⁻² m/s² and denominator as 9.793 m/s², yielding εmax = 1.733×10⁻³ rad or approximately 357 arcseconds (6 arcminutes).
Final Answer: The final answer is \(\boxed{6}\). I hope it is correct."""


def scibench_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Dict) -> str:
pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
post_prompt = lmms_eval_specific_kwargs["post_prompt"]
question = doc["problem_text"]
if doc["unit"].strip():
question = question + " The unit of the answer is " + doc["unit"] + "."
return f"{pre_prompt}{question}{post_prompt}"


def extract_boxed_answers(text):
# Find all boxed contents
matches = re.findall(r"boxed{([^}]*)}", text)
for m in matches:
# Strip spaces
candidate = m.strip()
# Keep only the numeric ones (int or decimal, with optional sign)
if re.fullmatch(r"[-+]?\d*\.?\d+", candidate):
return candidate
return None
Comment on lines +36 to +45

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix boxed-answer extraction: current regex never matches \boxed{} and breaks grading.

Pattern is missing the backslash and rejects scientific notation; accuracy will be 0 for valid outputs.

Apply this diff:

-def extract_boxed_answers(text):
-    # Find all boxed contents
-    matches = re.findall(r'boxed{([^}]*)}', text)
-    for m in matches:
-        # Strip spaces
-        candidate = m.strip()
-        # Keep only the numeric ones (int or decimal, with optional sign)
-        if re.fullmatch(r'[-+]?\d*\.?\d+', candidate):
-            return candidate
-    return None
+from typing import Optional
+
+def extract_boxed_answers(text: str) -> Optional[str]:
+    """Return the first numeric token inside a \boxed{...} span, if any."""
+    matches = re.findall(r'\\boxed\{([^}]*)\}', text)
+    for m in matches:
+        candidate = m.strip()
+        # Supports ints/decimals and optional scientific notation (e/E).
+        mnum = re.search(
+            r'[-+]?(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][-+]?\d+)?', candidate
+        )
+        if mnum:
+            return mnum.group(0)
+    return None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def extract_boxed_answers(text):
# Find all boxed contents
matches = re.findall(r'boxed{([^}]*)}', text)
for m in matches:
# Strip spaces
candidate = m.strip()
# Keep only the numeric ones (int or decimal, with optional sign)
if re.fullmatch(r'[-+]?\d*\.?\d+', candidate):
return candidate
return None
from typing import Optional
def extract_boxed_answers(text: str) -> Optional[str]:
"""Return the first numeric token inside a \boxed{...} span, if any."""
matches = re.findall(r'\\boxed\{([^}]*)\}', text)
for m in matches:
candidate = m.strip()
# Supports ints/decimals and optional scientific notation (e/E).
mnum = re.search(
r'[-+]?(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][-+]?\d+)?', candidate
)
if mnum:
return mnum.group(0)
return None
🤖 Prompt for AI Agents
In lmms_eval/tasks/scibench/utils.py around lines 36 to 45, the regex used to
find boxed answers fails because it omits the backslash in `\boxed{}` and its
numeric check rejects scientific notation; update the extraction pattern to
match LaTeX boxed macros (e.g. use a raw pattern that searches for a literal
backslash like \\boxed\{([^}]*)\}) and replace the numeric validation with a
regex that accepts integers, decimals and scientific notation (e.g. allow
optional sign, digits with optional decimal point or leading decimal, and
optional exponent like [eE][+-]?\d+); keep trimming whitespace from the captured
group and return the numeric string when it matches the improved numeric
pattern, otherwise continue or return None as before.



def remove_not(x):
match_number = re.compile("[\$]?\ *10\^[{]?\ *-?[0-9]+\ *[}]?\ *[\$]?")
result = re.findall(match_number, x)
if len(result) != 0:
return re.split(match_number, x)[-1]
return None


def cal_not(inputs):
try:
x, ab = list(inputs)
match_number = re.compile("10\^[{]?\ *-?[0-9]+\ *[}]?")
ab = re.findall(match_number, ab)[0]
ab = ab[ab.find("^") + 1 :]
if "{" in ab:
ab = ab[ab.find("{") + 1 :]
if "}" in ab:
ab = ab[: ab.find("}")]
x = x.strip()
out = float(x) * 10 ** float(ab)
# print(float(x)*10**float(ab))
return str(out)
except:
print("error")
return inputs


def parse_not(inputs):
try:
if not inputs:
return "", ""
if "\\times" in inputs:
x, ab = inputs.split("\\times")
elif "\times" in inputs:
x, ab = inputs.split("\times")
elif "*" in inputs:
x, ab = inputs.split("*")
else:
return inputs
return x, ab
except:
return "", ""


Comment on lines +48 to +91

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

🧩 Analysis chain

Remove or fix unused helpers and bare except blocks.

These helpers are unused here and contain except: (E722). Prefer removal to reduce surface area.

Run to confirm they’re unused before removal:

If unused, delete them; otherwise, add type hints and narrow exceptions.


🏁 Script executed:

#!/bin/bash
rg -nP '\b(remove_not|cal_not|parse_not|equiv_with_unit)\b' -g '!**/test/**' -S -C2

Length of output: 148


🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "rg version: $(rg --version 2>/dev/null || true)"
rg -nP '\b(remove_not|cal_not|parse_not|equiv_with_unit)\b' -S -C2 || true
echo "---- definitions ----"
rg -nP '^def\s+(remove_not|cal_not|parse_not)\s*\(' -n -S || true
echo "---- show lmms_eval/tasks/scibench/utils.py (if present) ----"
if [ -f "lmms_eval/tasks/scibench/utils.py" ]; then
  sed -n '1,240p' lmms_eval/tasks/scibench/utils.py || true
else
  echo "lmms_eval/tasks/scibench/utils.py not found"
fi

Length of output: 8084


Remove unused helpers and replace bare except blocks

remove_not, cal_not, parse_not are only defined in lmms_eval/tasks/scibench/utils.py (≈ lines 47–87) and contain bare except: handlers — delete them to reduce surface area. If you keep them, add type hints and replace bare except with specific exceptions (e.g., ValueError/IndexError) and return/raise explicit errors instead of printing.

🧰 Tools
🪛 Ruff (0.13.1)

68-68: Do not use bare except

(E722)


84-84: Consider moving this statement to an else block

(TRY300)


85-85: Do not use bare except

(E722)

🤖 Prompt for AI Agents
In lmms_eval/tasks/scibench/utils.py around lines 47–87 the helper functions
remove_not, cal_not, and parse_not are unused and contain bare except: handlers;
either delete these three functions to reduce surface area, or (if you intend to
keep them) add precise type hints for inputs/returns, replace bare except
clauses with specific exceptions (e.g., ValueError, IndexError, AttributeError)
and handle errors explicitly by returning well-typed sentinel values or raising
informative exceptions instead of printing, ensure regex lookups check for
non-empty matches before indexing, and remove any silent prints so behavior is
deterministic and testable.

def equiv_with_unit(model_output, answer, unit):
model_output = model_output.replace(",", "")
print("Model_output: ", model_output)
try:
ans = float(answer.strip())
first = isclose(float(model_output.strip()), ans, rel_tol=0.05)
except:
first = False
try:
model = model_output.strip().split()[0]
second = isclose(float(model.strip()), ans, rel_tol=0.05)
except:
second = False
if first or second:
return True
return False


def clean_number_string(s):
return s.replace(",", "").replace("−", "-").strip()


def scibench_process_results(doc: Dict, result: List[str]) -> Dict[str, float]:
pred = result[0]
pred = extract_boxed_answers(pred)
if pred:
res_equiv = isclose(float(clean_number_string(pred)), float(clean_number_string(doc["answer_number"])), rel_tol=0.05)
score = 1 if res_equiv else 0
else:
score = 0
return {"accuracy": score}
Comment on lines +114 to +122

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Harden result processing: handle empty results and non‑string targets safely.

Avoids IndexError and type issues when answer_number is numeric.

-def scibench_process_results(doc: Dict, result: List[str]) -> Dict[str, float]:
-    pred = result[0]
-    pred = extract_boxed_answers(pred)
-    if pred:
-        res_equiv = isclose(float(clean_number_string(pred)), float(clean_number_string(doc["answer_number"])), rel_tol=0.05)
-        score = 1 if res_equiv else 0
-    else:
-        score = 0
-    return {"accuracy": score}
+def scibench_process_results(doc: Dict, result: List[str]) -> Dict[str, float]:
+    """Compute accuracy with 5% tolerance using the first \boxed{} number."""
+    pred_text = result[0] if result else ""
+    boxed = extract_boxed_answers(pred_text)
+    if boxed is None:
+        return {"accuracy": 0}
+    try:
+        gold = float(clean_number_string(str(doc["answer_number"])))
+        pred_val = float(clean_number_string(boxed))
+        return {"accuracy": int(isclose(pred_val, gold, rel_tol=0.05))}
+    except (TypeError, ValueError):
+        return {"accuracy": 0}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def scibench_process_results(doc: Dict, result: List[str]) -> Dict[str, float]:
pred = result[0]
pred = extract_boxed_answers(pred)
if pred:
res_equiv = isclose(float(clean_number_string(pred)), float(clean_number_string(doc["answer_number"])), rel_tol=0.05)
score = 1 if res_equiv else 0
else:
score = 0
return {"accuracy": score}
def scibench_process_results(doc: Dict, result: List[str]) -> Dict[str, float]:
"""Compute accuracy with 5% tolerance using the first \boxed{} number."""
pred_text = result[0] if result else ""
boxed = extract_boxed_answers(pred_text)
if boxed is None:
return {"accuracy": 0}
try:
gold = float(clean_number_string(str(doc["answer_number"])))
pred_val = float(clean_number_string(boxed))
return {"accuracy": int(isclose(pred_val, gold, rel_tol=0.05))}
except (TypeError, ValueError):
return {"accuracy": 0}
🤖 Prompt for AI Agents
In lmms_eval/tasks/scibench/utils.py around lines 108 to 116, the function
assumes result[0] exists and doc["answer_number"] is a string; update it to
first check that result is a non-empty list and that result[0] is a string
before processing, and coerce doc["answer_number"] to a string (or handle
numeric types) before cleaning/converting; wrap the float conversions in a
try/except (catch ValueError/TypeError) and return {"accuracy": 0} on any
parsing error so no IndexError or type errors can propagate.



def scibench_multishot_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Dict) -> str:
pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
post_prompt = lmms_eval_specific_kwargs["post_prompt"]
question = doc["problem_text"]
if doc["unit"].strip():
question = question + " The unit of the answer is " + doc["unit"] + "."
return FEWSHOT_PROMPT + "\n" + question + "\nAnswer: Let's think step by step."
Comment on lines +125 to +131

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use pre/post prompts; add docstring; avoid unused vars (F841)

Also keeps lines within 88 chars.

-def scibench_multishot_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Dict) -> str:
-    pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
-    post_prompt = lmms_eval_specific_kwargs["post_prompt"]
-    question = doc["problem_text"]
-    if doc["unit"].strip():
-        question = question + " The unit of the answer is " + doc["unit"] + "."
-    return FEWSHOT_PROMPT + "\n" + question + "\nAnswer: Let's think step by step."
+def scibench_multishot_doc_to_text(
+    doc: Dict, lmms_eval_specific_kwargs: Dict
+) -> str:
+    """Build the multi-shot prompt for SciBench."""
+    pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
+    post_prompt = lmms_eval_specific_kwargs["post_prompt"]
+    question = str(doc.get("problem_text", ""))
+    unit = str(doc.get("unit") or "").strip()
+    if unit:
+        question = f"{question} The unit of the answer is {unit}."
+    return (
+        f"{pre_prompt}{FEWSHOT_PROMPT}\n{question}\n"
+        "Answer: Let's think step by step."
+        f"{post_prompt}"
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def scibench_multishot_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Dict) -> str:
pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
post_prompt = lmms_eval_specific_kwargs["post_prompt"]
question = doc["problem_text"]
if doc["unit"].strip():
question = question + " The unit of the answer is " + doc["unit"] + "."
return FEWSHOT_PROMPT + "\n" + question + "\nAnswer: Let's think step by step."
def scibench_multishot_doc_to_text(
doc: Dict, lmms_eval_specific_kwargs: Dict
) -> str:
"""Build the multi-shot prompt for SciBench."""
pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
post_prompt = lmms_eval_specific_kwargs["post_prompt"]
question = str(doc.get("problem_text", ""))
unit = str(doc.get("unit") or "").strip()
if unit:
question = f"{question} The unit of the answer is {unit}."
return (
f"{pre_prompt}{FEWSHOT_PROMPT}\n{question}\n"
"Answer: Let's think step by step."
f"{post_prompt}"
)
🧰 Tools
🪛 Ruff (0.13.1)

126-126: Local variable pre_prompt is assigned to but never used

Remove assignment to unused variable pre_prompt

(F841)


127-127: Local variable post_prompt is assigned to but never used

Remove assignment to unused variable post_prompt

(F841)

🤖 Prompt for AI Agents
In lmms_eval/tasks/scibench/utils.py around lines 125-131, the function
currently defines pre_prompt and post_prompt but never uses them, lacks a
docstring, and has long concatenated strings; update the function to include a
short docstring describing parameters and return value, use pre_prompt before
the question and post_prompt after the answer (combine with FEWSHOT_PROMPT),
replace string concatenation with f-strings to avoid unused variable lint
(F841), and reflow string construction so no line exceeds ~88 characters.