Span-cited English investor memos from Japanese 有価証券報告書, produced by a 14B nekomata-qfin fine-tune on a single AMD Instinct MI300X.
A Japanese annual securities report (有価証券報告書, "Yuho") is a dense, cross-referential disclosure document. The going-concern note in section 2 hedges the earnings call you would otherwise read off the P&L. Accrual-quality stress only shows up when DSO drifts up while revenue drifts down and operating cash flow drifts up — three numbers in three different sections, connected only by whoever is reading. YuhoLens reads the Japanese, writes an English investor memo, and cites every material claim back to a verbatim Japanese source span — or refuses to make the claim. Abstention is a feature, not a failure.
The filer flags that prolonged yen weakness materially compresses operating
margin in the electronic-components segment (ref: '事業等のリスクとして、
急激な為替変動は営業利益率に重大な影響を及ぼす可能性がある' p.23).
| Gate | Target | Best-of-5 composer (shipping) |
|---|---|---|
| Citation presence rate | ≥ 0.70 | 1.000 |
| Section coverage (7 sections) | ≥ 0.60 | 0.994 |
Judge coherence (gpt-5-mini Likert) |
≥ 3.80 | 3.88 |
Coherence distribution on the 50-prompt KG-2 test set: 0/2/7/36/5 (counts at
1/2/3/4/5), median 4.0, std 0.621. Verdict: PASS.
| stage | mean coherence | verdict |
|---|---|---|
| v5 single-shot | 3.56 | SOFT |
| best-of-2 (mixed decoder) | 3.72 | SOFT |
| best-of-3 (same-decoder seeds) | 3.64 | SOFT |
| best-of-5 (mixed decoder + seeds) | 3.88 | PASS |
| best-of-9 (bo5 + 4 new decoder profiles) | 4.04 | PASS |
The lift is inference-time, not training-time. The same SFT checkpoint, sampled
at five different decoder profiles per prompt, lets the coherence judge pick the
per-prompt argument-unity peak. In this small-N setting, decoder diversity appears to
contribute more lift than seed diversity (the bo2-vs-bo3 difference is 0.08 at n=50,
within a one-SE noise band — we report this as a trend, not a significance claim);
the observation is baked into src/yuholens/agents/decoder_profiles.py
and the memo_critic LangGraph node.
Mermaid source (renders on GitHub)
flowchart LR
A[Yuho PDF / EDINET row] --> B[Ingestor<br/><sub>regex section split</sub>]
B --> C[Pass-1 Detector<br/><sub>per-section JSON</sub>]
C --> D[MemoCriticAgent<br/><sub>best-of-N composer</sub>]
D --> E[Citation Grounder<br/><sub>span verification</sub>]
E --> F["English memo<br/><sub>or [evidence insufficient]</sub>"]
style D fill:#1f6feb,stroke:#fff,color:#fff
style E fill:#238636,stroke:#fff,color:#fff
- Ingestor. Regex over the Yuho's section markers (
事業等のリスク,経営者による財政状態の分析, …). Sub-second, deterministic. - Pass-1 Detector. One vLLM call per Japanese section. Emits red flags,
numerical claims, and
japanese_spancitations as structured JSON. - MemoCriticAgent (best-of-N). Fans out N Pass-2 composer calls across the
DEFAULT_PROFILESdecoder catalogue, scores each candidate via the gpt-5-mini coherence judge (or a no-API heuristic), and returns the highest-scoring memo. - Citation Grounder. Every inline
(ref: '<span>' p.N)is verified against the union of Pass-1 spans. Sentences with all-ungrounded citations are replaced with[evidence insufficient]. No span, no claim.
git clone https://github.com/javierdejesusda/YuhoLens.git
cd YuhoLens
pip install -e .
# Run the 4-agent composer end-to-end on a shipped sample.
#
# Pre-req: an OpenAI-compatible inference endpoint reachable at
# http://localhost:8000/v1 (vLLM-ROCm, llama.cpp server, or any drop-in
# replacement) serving the YuhoLens checkpoint or its GGUF quant. The
# heuristic judge avoids OpenAI calls for the candidate scorer, but
# Pass-1 / Pass-2 still need a real inference backend.
#
# --yuho-path supplies the Japanese source text; --yuho-row supplies the
# company metadata and pre-extracted BS/PL/CF tables.
python -m yuholens.agents \
--yuho-path data/sample/sample_yuho.txt \
--yuho-row data/sample/sample_yuho.jsonl --row-index 0 \
--best-of-n --n-candidates 5 --judge-mode heuristic
# Reproduce a best-of-N pick offline (no OpenAI calls, heuristic only).
# Replace the inputs with your own candidate memo JSONL files.
python scripts/run_bestofn_offline.py \
--memos path/to/candidates_a.jsonl path/to/candidates_b.jsonl \
--picked-memos /tmp/picked.jsonl \
--picked-scores /tmp/picked.jsonRun the test suite (laptop, no GPU, no API key required):
PYTHONPATH=src python -m pytest tests/ -qThe two supported decoding modes are fully documented in
docs/model-card.md. The short version:
| Mode | Use when | How |
|---|---|---|
| Single-shot | Latency-bound. Hits 3.56 mean coherence. | temperature=0.1, top_p=0.9, repetition_penalty=1.15 |
| Best-of-5 | Quality-bound. Hits 3.88 PASS. | MemoCriticAgent over DEFAULT_PROFILES |
The LangGraph composer is constructed via:
from yuholens.agents.graph import build_pipeline
app = build_pipeline(
best_of_n=True,
n_candidates=5,
judge_mode="auto", # "auto" probes the OpenAI key; falls back to heuristic
)
result = app.invoke(initial_state)
print(result["grounded_memo"])judge_mode="auto" performs a real auth probe before generating; transient outages
or invalid-but-present keys silently fall back to the heuristic so candidate
generation work is never wasted.
| Artefact | Where | Status |
|---|---|---|
| BF16 reference weights | javierdejesusda/yuholens-14b on HuggingFace |
shipped — 28.3 GB (26.4 GiB) across 6 safetensors shards |
| GGUF release (Q3_K_M / Q4_K_M / Q5_K_M / Q6_K / Q8_0) | javierdejesusda/yuholens-14b-GGUF on HuggingFace |
shipped — 51.4 GiB total, Q3_K_M smoke-verified on RTX 4070 Laptop @ 10.06 gen tok/s, 109.05 prompt tok/s |
| Pre-release sanity check | scripts/check_release_set.py |
built |
| Hub upload helper | scripts/hf_upload.py |
built (patches generation_config.json to v5 defaults before push) |
| Model card | docs/model-card.md |
shipped |
| Blog post | docs/blog_post.md |
shipped |
| Demo script (90s video + 5-min walkthrough) | docs/demo_script.md |
shipped |
- Training (one-shot). Single AMD Instinct MI300X (192 GB HBM3) on ROCm 7.0. Full-parameter SFT of a 14B Qwen1 model at sequence length 8192 does not fit on 80 GB-class hardware; the MI300X is not optional for the training path.
- Consumer inference. The Q3_K_M GGUF (7.18 GiB on disk) is the
recommended 8 GB-class target (RTX 4070 Laptop, RTX 3060 Ti, etc.) and
fits fully in VRAM at Pass-1 4-6K context. The Q4_K_M GGUF (8.81 GiB on disk)
is shipped for 12-16 GB cards (RTX 4060 Ti 16 GB, RTX 3080); on 8 GB
cards Q4_K_M still runs via partial GPU offload (
llama-cli -ngl 25). Larger Q5_K_M / Q6_K / Q8_0 quants ship for prosumer / dual-GPU rigs. - Demo / research inference. Any ROCm or CUDA host with the BF16 checkpoint;
the
MemoCriticAgentis pure orchestration and adds zero VRAM cost beyond the base model.
| Line item | Spend |
|---|---|
| MI300X SFT (~38 GPU-hours @ $1.99) | ~$75.62 |
| OpenAI teacher bootstrap (Batch API) | ~$4.84 |
| KG-2 judge passes (gpt-5-mini) | ~$2.00 |
| GPU best-of-3 generation run | ~$6.60 |
| Total | ~$80 (inside the $100 AMD Developer Cloud envelope) |
23 days end-to-end, including two 24-hour async batch waits and one 10-12 hour SFT job.
YuhoLens/
├── src/yuholens/
│ ├── ingestor.py # Yuho text → labelled Japanese sections
│ ├── agents/
│ │ ├── graph.py # build_pipeline() + 4-node LangGraph
│ │ ├── memo_critic.py # best-of-N composer + judge + heuristic
│ │ ├── decoder_profiles.py # the 5-profile catalogue from KG-2 PASS
│ │ ├── citation_grounder.py
│ │ └── cli.py # python -m yuholens.agents
│ ├── prompts/ # pass1, pass2 system + user templates
│ ├── eval/
│ │ ├── metrics.py # citation_rate, section_coverage, judge
│ │ └── run_kg2.py # KG-2 evaluation harness
│ └── training/ # sft.py, orpo.py, teacher.py (MI300X-only)
├── scripts/
│ ├── run_bestofn_offline.py # offline heuristic picker (no API)
│ ├── bestofn_pick.py # cached-judge picker
│ ├── bestofn_judge.py # fresh judge over candidate sets
│ ├── build_gguf.sh # llama.cpp Q4/Q5/Q6/Q8 release set
│ ├── hf_upload.py # patches generation_config + pushes to Hub
│ └── check_release_set.py # pre-release sanity check
├── tests/ # 88 pytest tests, all laptop-runnable
├── configs/ # sft.yaml, orpo.yaml
└── docs/ # model-card, blog_post, demo_script, sessions
GitHub renders a "Cite this repository" button from
CITATION.cff. For BibTeX:
@misc{dejesus2026yuholens,
author = {De Jesus, Javier},
title = {YuhoLens-14B: A Japanese-Finance Fine-Tune for
Span-Grounded Investor Memo Generation},
year = {2026},
howpublished = {Hugging Face model repository},
url = {https://huggingface.co/javierdejesusda/yuholens-14b},
note = {AMD Developer Hackathon, lablab.ai, May 2026}
}- Bug reports and pull requests welcome — see
CONTRIBUTING.md. - Security policy:
SECURITY.md. - Common installation and runtime gotchas:
docs/troubleshooting.md. - Changelog:
docs/CHANGELOG.md.
The shipping verdict above (PASS at 3.88) uses gpt-5-mini as the
coherence judge, the same autorater the project was designed against
end-to-end. As a secondary calibration check, we also re-judged the
50-memo bo5 and bo9 picked sets blinded under claude-opus-4-7
(judge engine selectable via --judge-engine anthropic). Opus
applies a stricter rubric — it reserves score 5 for "senior-PM-grade,
unedited" memos and produces no 5s on any LLM-generated text in this
domain — and reports a lower absolute mean. We treat this as rubric
calibration rather than a contradicting verdict; the
gpt-5-mini PASS remains the released configuration.
- AMD Developer Program — MI300X cloud credits that made full-parameter 14B training feasible inside a hackathon budget.
- Preferred Networks —
nekomata-14b-pfn-qfincontinual pre-training on Japanese financial text. - rinna Co., Ltd. — base
nekomata-14bJapanese-adapted Qwen1 checkpoint. - Alibaba Cloud / Qwen — original
Qwen-14Bbase weights. - Sakana AI —
EDINET-Benchannotated Yuho corpus. - OpenAI — gpt-5-mini Batch API for the teacher-bootstrap step and the coherence judge.
- lablab.ai — AMD Developer Hackathon platform.
MIT covers the wrapper code (LangGraph pipeline, training scripts, evaluation
harness, prompt modules). Model weights are released under the Tongyi Qianwen
license inherited from Qwen/Qwen-14B via rinna/nekomata-14b and
pfnet/nekomata-14b-pfn-qfin. Downstream users must comply with the Tongyi Qianwen
terms in addition to MIT.
⚠️ Outputs are model-generated text and may contain factual errors. Verify any material claim against the underlying Yuho before relying on it for any decision. The authors disclaim all liability for investment outcomes derived from this model.

