Skip to content

Build function that benchmarks against ClimateFinanceBench #9

@suung

Description

@suung

Title

Benchmark answer extraction on climate finance QA datasets

Goals

Build a benchmarking tool that, given Climate Finance Bench–style QA data (and similar climate/ESG QA or extraction datasets) and a file of model predictions, computes standard answer-extraction metrics (Exact Match, token-level F1, numeric tolerance metrics).

Acceptance Criteria

Input

Service takes:

  • Ground truth file with at least:
    question_id, question, document/report, one or more gold_answers, and optional answer_type (extractive, numeric, logical, etc.).
  • Predictions file with at least:
    question_id, predicted_answer, optional model_name / run_id.

Supports configuration of:

  • Dataset schema / column mapping (so different QA datasets can be plugged in).
  • Text normalization (lowercase, strip punctuation, whitespace cleanup).
  • Numeric tolerance settings (absolute and/or relative).

Metrics

For each question, compute:

  • Exact Match (EM) – 1 if normalized prediction matches any gold answer; else 0.
  • Token-level F1 – compute precision/recall/F1 over word tokens; final score is the maximum F1 over all gold answers.
  • For numeric answers:
    • numeric_match@tol – 1 if predicted numeric value is within tolerance of a gold numeric value; else 0.
    • Optional: absolute and relative error.

Aggregated metrics:

  • Macro-average EM and F1 over all questions.
  • Macro-average EM/F1 by answer_type (if present).
  • Macro-average numeric_match@tol for numeric questions.

Correctness

  • If the predicted answer matches any gold answer after normalization → EM = 1 and F1 = 1.
  • If prediction has no token overlap → F1 = 0.
  • For multiple gold answers → use the best-scoring gold answer.
  • Numeric answers: if within tolerance → numeric_match@tol = 1.

Output

  • Per-question metrics for EM, F1, numeric_match@tol.
  • Macro averages across all questions.
  • Optional breakdown by answer_type.
  • Output in machine-readable JSON or CSV.

Implementation

  • Base schema compatibility on Climate Finance Bench, but keep evaluation logic dataset-agnostic.
  • Functions should be pure, functional, stateless.
  • Reuse normalization utilities from the IR benchmark where possible.
  • Include a small synthetic dataset with hand-calculated metrics and tests asserting equality.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions