Build function that benchmarks against ClimateFinanceBench

## Title
Benchmark answer extraction on climate finance QA datasets

## Goals
Build a benchmarking tool that, given Climate Finance Bench–style QA data (and similar climate/ESG QA or extraction datasets) and a file of model predictions, computes standard answer-extraction metrics (Exact Match, token-level F1, numeric tolerance metrics).

## Acceptance Criteria

### Input

Service takes:

- **Ground truth file** with at least:  
  `question_id`, `question`, `document/report`, one or more `gold_answers`, and optional `answer_type` (extractive, numeric, logical, etc.).
- **Predictions file** with at least:  
  `question_id`, `predicted_answer`, optional `model_name` / `run_id`.

Supports configuration of:

- Dataset schema / column mapping (so different QA datasets can be plugged in).  
- Text normalization (lowercase, strip punctuation, whitespace cleanup).  
- Numeric tolerance settings (absolute and/or relative).

### Metrics

For each **question**, compute:

- **Exact Match (EM)** – 1 if normalized prediction matches *any* gold answer; else 0.  
- **Token-level F1** – compute precision/recall/F1 over word tokens; final score is the **maximum** F1 over all gold answers.  
- For **numeric** answers:  
  - **numeric_match@tol** – 1 if predicted numeric value is within tolerance of a gold numeric value; else 0.  
  - Optional: absolute and relative error.

Aggregated metrics:

- Macro-average EM and F1 over all questions.  
- Macro-average EM/F1 by `answer_type` (if present).  
- Macro-average numeric_match@tol for numeric questions.

### Correctness

- If the predicted answer matches any gold answer after normalization → `EM = 1` and `F1 = 1`.  
- If prediction has no token overlap → `F1 = 0`.  
- For multiple gold answers → use the **best-scoring** gold answer.  
- Numeric answers: if within tolerance → `numeric_match@tol = 1`.

### Output

- Per-question metrics for EM, F1, numeric_match@tol.  
- Macro averages across all questions.  
- Optional breakdown by answer_type.  
- Output in machine-readable JSON or CSV.

## Implementation

- Base schema compatibility on Climate Finance Bench, but keep evaluation logic dataset-agnostic.  
- Functions should be **pure, functional, stateless**.  
- Reuse normalization utilities from the IR benchmark where possible.  
- Include a small synthetic dataset with hand-calculated metrics and tests asserting equality.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build function that benchmarks against ClimateFinanceBench #9

Title

Goals

Acceptance Criteria

Input

Metrics

Correctness

Output

Implementation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Build function that benchmarks against ClimateFinanceBench #9

Description

Title

Goals

Acceptance Criteria

Input

Metrics

Correctness

Output

Implementation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions