-
Notifications
You must be signed in to change notification settings - Fork 437
Open
Description
Summary
I'd like to add jfinqa — a Japanese financial numerical reasoning QA benchmark — as a new task in lighteval.
About jfinqa
- 1,000 questions across 3 subtasks:
- Numerical Reasoning (550): Calculate growth rates, margins, ratios from financial statements
- Consistency Checking (200): Verify internal consistency of figures
- Temporal Reasoning (250): Analyze year-over-year trends
- 68 companies from EDINET (Japan's securities filing system)
- Covers J-GAAP, IFRS, and US-GAAP accounting standards
- HuggingFace Dataset: ajtgjmdjp/jfinqa
- GitHub: ajtgjmdjp/jfinqa
Metrics
Two metrics per subtask:
- Exact Match — with Japanese financial normalisation (fullwidth→halfwidth, △→minus, comma removal, NFKC)
- Numerical Match — 1% relative tolerance, handles kanji multipliers (千/百万/億/兆) and unit suffixes (円/ドル/bps)
Prior Art
- lm-evaluation-harness PR #3570 (open, mergeable)
Baselines (zero-shot, temperature=0)
| Model | Overall | Numerical | Consistency | Temporal |
|---|---|---|---|---|
| GPT-4o | 87.0% | 80.2% | 90.5% | 99.2% |
| Gemini 2.0 Flash | 80.4% | 86.2% | 83.5% | 65.2% |
| GPT-4o-mini | 67.7% | 79.3% | 83.5% | 29.6% |
| Qwen2.5-3B | 39.6% | 46.4% | 51.0% | 15.6% |
I have a PR ready — happy to adjust the implementation based on your feedback (e.g., inspect-ai format if preferred).
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels