Add jfinqa: Japanese Financial Numerical Reasoning QA (1000 questions)#3570
Open
ajtgjmdjp wants to merge 12 commits intoEleutherAI:mainfrom
Open
Add jfinqa: Japanese Financial Numerical Reasoning QA (1000 questions)#3570ajtgjmdjp wants to merge 12 commits intoEleutherAI:mainfrom
ajtgjmdjp wants to merge 12 commits intoEleutherAI:mainfrom
Conversation
927 questions evaluating LLM numerical reasoning over Japanese corporate financial statements (EDINET filings). Three subtasks: - jfinqa_numerical (550): ratio/growth calculations - jfinqa_consistency (200): verify internal consistency - jfinqa_temporal (177): year-over-year trend analysis Dataset: https://huggingface.co/datasets/ajtgjmdjp/jfinqa Paper: https://github.com/ajtgjmdjp/jfinqa Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
- group: field belongs only in _jfinqa.yaml (group config), not in individual task configs where it causes TaskConfig init error - Apply ruff format to utils.py (no logic changes) Co-Authored-By: Claude Opus 4.6 <[email protected]>
Without this, running --tasks jfinqa shows blank group row. Uses weight_by_size for correct averaging across unequal subtasks (550/200/177 questions). Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Add 51 unit tests covering all utility functions (normalize, extract_answer, try_parse_number, numerical_match, doc_to_text, process_results) - Support Japanese 回答: prefix in answer extraction - Extract NUMERICAL_TOLERANCE as documented module-level constant Co-Authored-By: Claude Opus 4.6 <[email protected]>
Align with jfinqa v0.3.0 release (1000 questions, DuPont decomposition, expanded tables) now published on PyPI and HuggingFace. Co-Authored-By: Claude Opus 4.6 <[email protected]>
The normalization logic in utils.py mirrors jfinqa._metrics. Added explicit cross-reference to keep both copies in sync. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Prevent local evaluation results from being tracked. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Author
|
Hi — just a friendly ping. Happy to address any feedback. Quick summary: this adds jfinqa, a Japanese financial numerical reasoning QA benchmark (1,000 questions from 68 companies). It includes 4 baselines (GPT-4o 87.0%, Gemini 2.0 Flash 80.4%, GPT-4o-mini 67.7%, Qwen2.5-3B 39.6%) and is published on both PyPI and HuggingFace. |
This was referenced Feb 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Task structure
jfinqajfinqa_numerical,jfinqa_consistency,jfinqa_temporalBaseline results (zero-shot, temperature=0, 1000 questions)
Links
Checklist
🤖 Generated with Claude Code