Add jfinqa: Japanese Financial Numerical Reasoning QA (1000 questions) by ajtgjmdjp · Pull Request #3570 · EleutherAI/lm-evaluation-harness

ajtgjmdjp · 2026-02-08T16:09:56Z

Summary

Add jfinqa, a benchmark of 1000 questions for evaluating LLMs on numerical reasoning over Japanese corporate financial statements
Three subtasks: numerical reasoning (550), consistency checking (200), temporal reasoning (250)
Questions require multi-step arithmetic (1–5 steps) over tables extracted from real EDINET filings, spanning 68 companies across J-GAAP, IFRS, and US-GAAP
Zero-shot, generate_until format with numerical matching (1% tolerance)

Task structure

Group	Tasks
`jfinqa`	`jfinqa_numerical`, `jfinqa_consistency`, `jfinqa_temporal`

Baseline results (zero-shot, temperature=0, 1000 questions)

Model	Overall	NR	CC	TR
GPT-4o	86.8%	79.6%	93.5%	97.2%
Gemini 2.0 Flash	77.3%	78.7%	82.5%	70.0%
GPT-4o-mini	69.0%	83.6%	86.0%	23.2%

Links

Dataset: https://huggingface.co/datasets/ajtgjmdjp/jfinqa
Code & paper: https://github.com/ajtgjmdjp/jfinqa
License: Apache-2.0

Checklist

Is the task an existing benchmark in the literature?
Have you verified the samples from the dataset are correct?
Is the dataset publicly available?
Does the task have a dedicated README.md?
Have you cited the original paper?
No external dependencies (utils.py uses only Python stdlib)

🤖 Generated with Claude Code

CLAassistant · 2026-02-08T16:10:03Z

All committers have signed the CLA.

927 questions evaluating LLM numerical reasoning over Japanese corporate financial statements (EDINET filings). Three subtasks: - jfinqa_numerical (550): ratio/growth calculations - jfinqa_consistency (200): verify internal consistency - jfinqa_temporal (177): year-over-year trend analysis Dataset: https://huggingface.co/datasets/ajtgjmdjp/jfinqa Paper: https://github.com/ajtgjmdjp/jfinqa Co-Authored-By: Claude Opus 4.6 <[email protected]>

Co-Authored-By: Claude Opus 4.6 <[email protected]>

- group: field belongs only in _jfinqa.yaml (group config), not in individual task configs where it causes TaskConfig init error - Apply ruff format to utils.py (no logic changes) Co-Authored-By: Claude Opus 4.6 <[email protected]>

Without this, running --tasks jfinqa shows blank group row. Uses weight_by_size for correct averaging across unequal subtasks (550/200/177 questions). Co-Authored-By: Claude Opus 4.6 <[email protected]>

Co-Authored-By: Claude Opus 4.6 <[email protected]>

- Add 51 unit tests covering all utility functions (normalize, extract_answer, try_parse_number, numerical_match, doc_to_text, process_results) - Support Japanese 回答: prefix in answer extraction - Extract NUMERICAL_TOLERANCE as documented module-level constant Co-Authored-By: Claude Opus 4.6 <[email protected]>

Align with jfinqa v0.3.0 release (1000 questions, DuPont decomposition, expanded tables) now published on PyPI and HuggingFace. Co-Authored-By: Claude Opus 4.6 <[email protected]>

The normalization logic in utils.py mirrors jfinqa._metrics. Added explicit cross-reference to keep both copies in sync. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Prevent local evaluation results from being tracked. Co-Authored-By: Claude Opus 4.6 <[email protected]>

ajtgjmdjp · 2026-02-13T16:35:48Z

Hi — just a friendly ping. Happy to address any feedback.

Quick summary: this adds jfinqa, a Japanese financial numerical reasoning QA benchmark (1,000 questions from 68 companies). It includes 4 baselines (GPT-4o 87.0%, Gemini 2.0 Flash 80.4%, GPT-4o-mini 67.7%, Qwen2.5-3B 39.6%) and is published on both PyPI and HuggingFace.

ajtgjmdjp requested a review from baberabb as a code owner February 8, 2026 16:09

ajtgjmdjp force-pushed the add-jfinqa branch from 5c48d0e to 61337c1 Compare February 8, 2026 16:16

ajtgjmdjp and others added 7 commits February 9, 2026 01:50

Add jfinqa entry to tasks README

a8cb576

Co-Authored-By: Claude Opus 4.6 <[email protected]>

Fix citation: unify title and BibTeX format

ec5ffb2

Co-Authored-By: Claude Opus 4.6 <[email protected]>

Fix alphabetical ordering of jfinqa in tasks README

4f5f881

Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add aggregate_metric_list for weighted group-level scores

7fb449c

Without this, running --tasks jfinqa shows blank group row. Uses weight_by_size for correct averaging across unequal subtasks (550/200/177 questions). Co-Authored-By: Claude Opus 4.6 <[email protected]>

Update jfinqa to 1000 questions and add Gemini baseline

caa12a4

Co-Authored-By: Claude Opus 4.6 <[email protected]>

Update jfinqa task configs to reference dataset v0.2.0

c217ce5

Co-Authored-By: Claude Opus 4.6 <[email protected]>

ajtgjmdjp changed the title ~~Add jfinqa: Japanese Financial Numerical Reasoning QA (927 questions)~~ Add jfinqa: Japanese Financial Numerical Reasoning QA (1000 questions) Feb 11, 2026

ajtgjmdjp and others added 4 commits February 11, 2026 21:52

Update jfinqa task metadata version 0.2.0 → 0.3.0

cb47b7a

Align with jfinqa v0.3.0 release (1000 questions, DuPont decomposition, expanded tables) now published on PyPI and HuggingFace. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add sync comment referencing canonical jfinqa._metrics

4192675

The normalization logic in utils.py mirrors jfinqa._metrics. Added explicit cross-reference to keep both copies in sync. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add results/ to .gitignore

ba87256

Prevent local evaluation results from being tracked. Co-Authored-By: Claude Opus 4.6 <[email protected]>

This was referenced Feb 14, 2026

Add jfinqa: 日本語金融数値推論QAベンチマーク llm-jp/awesome-japanese-llm#599

Merged

Add jfinqa: Japanese Financial Numerical Reasoning QA huggingface/lighteval#1168

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add jfinqa: Japanese Financial Numerical Reasoning QA (1000 questions)#3570

Add jfinqa: Japanese Financial Numerical Reasoning QA (1000 questions)#3570
ajtgjmdjp wants to merge 12 commits intoEleutherAI:mainfrom
ajtgjmdjp:add-jfinqa

ajtgjmdjp commented Feb 8, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Feb 8, 2026 •

edited

Loading

Uh oh!

ajtgjmdjp commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ajtgjmdjp commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Task structure

Baseline results (zero-shot, temperature=0, 1000 questions)

Links

Checklist

Uh oh!

CLAassistant commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajtgjmdjp commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ajtgjmdjp commented Feb 8, 2026 •

edited

Loading

CLAassistant commented Feb 8, 2026 •

edited

Loading