Skip to content

Latest commit

 

History

History
63 lines (45 loc) · 2.31 KB

File metadata and controls

63 lines (45 loc) · 2.31 KB

jfinqa

Paper

Title: jfinqa: Japanese Financial Numerical Reasoning QA Benchmark

Homepage: https://github.com/ajtgjmdjp/jfinqa

Citation

@misc{ogawa2025jfinqa,
  title={jfinqa: Japanese Financial Numerical Reasoning QA Benchmark},
  author={Ogawa, Saichi},
  year={2025},
  url={https://github.com/ajtgjmdjp/jfinqa},
}

Description

jfinqa is a benchmark of 1000 questions for evaluating LLMs on numerical reasoning over Japanese corporate financial statements. Each question requires multi-step arithmetic (1-5 steps) over tables extracted from real EDINET filings, spanning 68 companies across J-GAAP, IFRS, and US-GAAP.

Three subtasks:

Subtask Questions Description
Numerical Reasoning 550 Calculate financial metrics (growth rates, margins, ratios)
Consistency Checking 200 Verify internal consistency of reported figures
Temporal Reasoning 250 Determine direction of year-over-year changes

Baseline results

Model Overall Numerical Reasoning Consistency Checking Temporal Reasoning
GPT-4o 86.8% 79.6% 93.5% 97.2%
Gemini 2.0 Flash 77.3% 78.7% 82.5% 70.0%
GPT-4o-mini 69.0% 83.6% 86.0% 23.2%

1000 questions, zero-shot, temperature=0, numerical matching with 1% tolerance

Groups and Tasks

  • Group: jfinqa
  • Tasks:
    • jfinqa_numerical — Numerical reasoning (550 questions)
    • jfinqa_consistency — Consistency checking (200 questions)
    • jfinqa_temporal — Temporal reasoning (250 questions)

Dataset

Checklist

  • Is the task an existing benchmark in the literature? — Yes, published with dataset, evaluation code, and baseline results.
  • Have you verified the samples from the dataset are correct? — Yes, all 1000 questions have DSL-verified gold answers.
  • Is the dataset publicly available? — Yes, on HuggingFace Hub and PyPI.
  • Does the task have a dedicated README.md? — This file.
  • Have you cited the original paper? — Yes (see Citation above).