Skip to content

Add jfinqa: Japanese Financial Numerical Reasoning QA (1000 questions)#3570

Open
ajtgjmdjp wants to merge 12 commits intoEleutherAI:mainfrom
ajtgjmdjp:add-jfinqa
Open

Add jfinqa: Japanese Financial Numerical Reasoning QA (1000 questions)#3570
ajtgjmdjp wants to merge 12 commits intoEleutherAI:mainfrom
ajtgjmdjp:add-jfinqa

Conversation

@ajtgjmdjp
Copy link

@ajtgjmdjp ajtgjmdjp commented Feb 8, 2026

Summary

  • Add jfinqa, a benchmark of 1000 questions for evaluating LLMs on numerical reasoning over Japanese corporate financial statements
  • Three subtasks: numerical reasoning (550), consistency checking (200), temporal reasoning (250)
  • Questions require multi-step arithmetic (1–5 steps) over tables extracted from real EDINET filings, spanning 68 companies across J-GAAP, IFRS, and US-GAAP
  • Zero-shot, generate_until format with numerical matching (1% tolerance)

Task structure

Group Tasks
jfinqa jfinqa_numerical, jfinqa_consistency, jfinqa_temporal

Baseline results (zero-shot, temperature=0, 1000 questions)

Model Overall NR CC TR
GPT-4o 86.8% 79.6% 93.5% 97.2%
Gemini 2.0 Flash 77.3% 78.7% 82.5% 70.0%
GPT-4o-mini 69.0% 83.6% 86.0% 23.2%

Links

Checklist

  • Is the task an existing benchmark in the literature?
  • Have you verified the samples from the dataset are correct?
  • Is the dataset publicly available?
  • Does the task have a dedicated README.md?
  • Have you cited the original paper?
  • No external dependencies (utils.py uses only Python stdlib)

🤖 Generated with Claude Code

@ajtgjmdjp ajtgjmdjp requested a review from baberabb as a code owner February 8, 2026 16:09
@CLAassistant
Copy link

CLAassistant commented Feb 8, 2026

CLA assistant check
All committers have signed the CLA.

927 questions evaluating LLM numerical reasoning over Japanese
corporate financial statements (EDINET filings).

Three subtasks:
- jfinqa_numerical (550): ratio/growth calculations
- jfinqa_consistency (200): verify internal consistency
- jfinqa_temporal (177): year-over-year trend analysis

Dataset: https://huggingface.co/datasets/ajtgjmdjp/jfinqa
Paper: https://github.com/ajtgjmdjp/jfinqa

Co-Authored-By: Claude Opus 4.6 <[email protected]>
ajtgjmdjp and others added 7 commits February 9, 2026 01:50
Co-Authored-By: Claude Opus 4.6 <[email protected]>
- group: field belongs only in _jfinqa.yaml (group config), not in
  individual task configs where it causes TaskConfig init error
- Apply ruff format to utils.py (no logic changes)

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Without this, running --tasks jfinqa shows blank group row.
Uses weight_by_size for correct averaging across unequal subtasks
(550/200/177 questions).

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@ajtgjmdjp ajtgjmdjp changed the title Add jfinqa: Japanese Financial Numerical Reasoning QA (927 questions) Add jfinqa: Japanese Financial Numerical Reasoning QA (1000 questions) Feb 11, 2026
ajtgjmdjp and others added 4 commits February 11, 2026 21:52
- Add 51 unit tests covering all utility functions (normalize,
  extract_answer, try_parse_number, numerical_match, doc_to_text,
  process_results)
- Support Japanese 回答: prefix in answer extraction
- Extract NUMERICAL_TOLERANCE as documented module-level constant

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Align with jfinqa v0.3.0 release (1000 questions, DuPont decomposition,
expanded tables) now published on PyPI and HuggingFace.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
The normalization logic in utils.py mirrors jfinqa._metrics.
Added explicit cross-reference to keep both copies in sync.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Prevent local evaluation results from being tracked.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@ajtgjmdjp
Copy link
Author

Hi — just a friendly ping. Happy to address any feedback.

Quick summary: this adds jfinqa, a Japanese financial numerical reasoning QA benchmark (1,000 questions from 68 companies). It includes 4 baselines (GPT-4o 87.0%, Gemini 2.0 Flash 80.4%, GPT-4o-mini 67.7%, Qwen2.5-3B 39.6%) and is published on both PyPI and HuggingFace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants