Title
Benchmark answer extraction on climate finance QA datasets
Goals
Build a benchmarking tool that, given Climate Finance Bench–style QA data (and similar climate/ESG QA or extraction datasets) and a file of model predictions, computes standard answer-extraction metrics (Exact Match, token-level F1, numeric tolerance metrics).
Acceptance Criteria
Input
Service takes:
- Ground truth file with at least:
question_id, question, document/report, one or more gold_answers, and optional answer_type (extractive, numeric, logical, etc.).
- Predictions file with at least:
question_id, predicted_answer, optional model_name / run_id.
Supports configuration of:
- Dataset schema / column mapping (so different QA datasets can be plugged in).
- Text normalization (lowercase, strip punctuation, whitespace cleanup).
- Numeric tolerance settings (absolute and/or relative).
Metrics
For each question, compute:
- Exact Match (EM) – 1 if normalized prediction matches any gold answer; else 0.
- Token-level F1 – compute precision/recall/F1 over word tokens; final score is the maximum F1 over all gold answers.
- For numeric answers:
- numeric_match@tol – 1 if predicted numeric value is within tolerance of a gold numeric value; else 0.
- Optional: absolute and relative error.
Aggregated metrics:
- Macro-average EM and F1 over all questions.
- Macro-average EM/F1 by
answer_type (if present).
- Macro-average numeric_match@tol for numeric questions.
Correctness
- If the predicted answer matches any gold answer after normalization →
EM = 1 and F1 = 1.
- If prediction has no token overlap →
F1 = 0.
- For multiple gold answers → use the best-scoring gold answer.
- Numeric answers: if within tolerance →
numeric_match@tol = 1.
Output
- Per-question metrics for EM, F1, numeric_match@tol.
- Macro averages across all questions.
- Optional breakdown by answer_type.
- Output in machine-readable JSON or CSV.
Implementation
- Base schema compatibility on Climate Finance Bench, but keep evaluation logic dataset-agnostic.
- Functions should be pure, functional, stateless.
- Reuse normalization utilities from the IR benchmark where possible.
- Include a small synthetic dataset with hand-calculated metrics and tests asserting equality.
Title
Benchmark answer extraction on climate finance QA datasets
Goals
Build a benchmarking tool that, given Climate Finance Bench–style QA data (and similar climate/ESG QA or extraction datasets) and a file of model predictions, computes standard answer-extraction metrics (Exact Match, token-level F1, numeric tolerance metrics).
Acceptance Criteria
Input
Service takes:
question_id,question,document/report, one or moregold_answers, and optionalanswer_type(extractive, numeric, logical, etc.).question_id,predicted_answer, optionalmodel_name/run_id.Supports configuration of:
Metrics
For each question, compute:
Aggregated metrics:
answer_type(if present).Correctness
EM = 1andF1 = 1.F1 = 0.numeric_match@tol = 1.Output
Implementation