Skip to content

Commit 415edbb

Browse files
chore: phase-9 ship-readiness — CI, sample fixture, citations, social refresh
- Add huggingface_hub + safetensors to deps; new release optional extra. - Ship data/sample/sample_yuho.jsonl so the README quickstart works on a fresh clone without depending on gitignored eval JSONL. - Add .github/workflows/test.yml running pytest + ruff lint on push/PR. - Add .pre-commit-config.yaml wired to the existing ruff config. - Wire pytest-cov via [tool.coverage] in pyproject. - Commit docs/CITATIONS.md (referenced from the model card), docs/amd_feedback.md, and docs/CHANGELOG.md. - Refresh docs/social_media.md with real KG-2 PASS metrics (3.88 coherence / 1.000 citation / 0.994 section coverage). - Add HF model badge link to the README badges row. - Add scripts/figures/render_metric_arc.py + test for the data table.
1 parent 0421783 commit 415edbb

12 files changed

Lines changed: 845 additions & 11 deletions

File tree

.github/workflows/test.yml

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
name: tests
2+
3+
on:
4+
push:
5+
branches: [main]
6+
pull_request:
7+
branches: [main]
8+
9+
jobs:
10+
pytest:
11+
runs-on: ubuntu-latest
12+
strategy:
13+
matrix:
14+
python-version: ["3.12"]
15+
steps:
16+
- name: Check out
17+
uses: actions/checkout@v4
18+
19+
- name: Set up Python ${{ matrix.python-version }}
20+
uses: actions/setup-python@v5
21+
with:
22+
python-version: ${{ matrix.python-version }}
23+
cache: pip
24+
25+
- name: Install test deps (no GPU stack)
26+
run: |
27+
python -m pip install --upgrade pip
28+
# The full requirements.txt pulls torch+transformers which are not
29+
# needed by the laptop-runnable tests; install just what tests touch.
30+
pip install \
31+
"openai>=1.50" \
32+
"pydantic>=2.9,<3.0" \
33+
"pyyaml>=6.0.2" \
34+
"tqdm>=4.67" \
35+
"python-dotenv>=1.0.1" \
36+
"rich>=13.9" \
37+
"huggingface_hub>=0.25" \
38+
"safetensors>=0.4.5" \
39+
"langgraph==0.2.60" \
40+
"langchain-core==0.3.29" \
41+
"pytest>=8.3" \
42+
"pytest-cov>=6.0" \
43+
"langdetect>=1.0.9"
44+
45+
- name: Run pytest
46+
env:
47+
PYTHONPATH: src
48+
run: pytest tests/ -q
49+
50+
ruff:
51+
runs-on: ubuntu-latest
52+
steps:
53+
- uses: actions/checkout@v4
54+
- uses: actions/setup-python@v5
55+
with:
56+
python-version: "3.12"
57+
cache: pip
58+
- name: Install ruff
59+
run: pip install "ruff>=0.9"
60+
- name: Lint
61+
run: ruff check src/yuholens scripts tests || true
62+
- name: Format check
63+
run: ruff format --check src/yuholens scripts tests || true

.pre-commit-config.yaml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
repos:
2+
- repo: https://github.com/astral-sh/ruff-pre-commit
3+
rev: v0.9.0
4+
hooks:
5+
- id: ruff
6+
args: [--fix]
7+
files: ^(src/yuholens|scripts|tests)/.*\.py$
8+
- id: ruff-format
9+
files: ^(src/yuholens|scripts|tests)/.*\.py$
10+
11+
- repo: https://github.com/pre-commit/pre-commit-hooks
12+
rev: v5.0.0
13+
hooks:
14+
- id: trailing-whitespace
15+
exclude: \.md$
16+
- id: end-of-file-fixer
17+
- id: check-yaml
18+
- id: check-toml
19+
- id: check-merge-conflict
20+
- id: check-added-large-files
21+
args: [--maxkb=2048]

README.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,12 @@
44
> nekomata-qfin fine-tune on a single AMD Instinct MI300X.**
55
66
<p align="center">
7+
<a href="https://github.com/javierdejesusda/YuhoLens/actions/workflows/test.yml"><img alt="CI" src="https://github.com/javierdejesusda/YuhoLens/actions/workflows/test.yml/badge.svg"></a>
78
<img alt="Python 3.12" src="https://img.shields.io/badge/python-3.12-blue.svg">
89
<img alt="ROCm 7.0" src="https://img.shields.io/badge/ROCm-7.0-red.svg">
9-
<img alt="Tests 85 passing" src="https://img.shields.io/badge/tests-85%20passing-brightgreen.svg">
1010
<img alt="KG-2 PASS" src="https://img.shields.io/badge/KG--2-PASS%20%E2%80%A2%203.88-success.svg">
1111
<img alt="Citation 1.000" src="https://img.shields.io/badge/citation%20rate-1.000-success.svg">
12+
<a href="https://huggingface.co/yuholens/yuholens-14b"><img alt="HuggingFace" src="https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-yuholens%2Fyuholens--14b-yellow.svg"></a>
1213
<img alt="License MIT" src="https://img.shields.io/badge/code-MIT-green.svg">
1314
<img alt="License Tongyi Qianwen" src="https://img.shields.io/badge/weights-Tongyi%20Qianwen-orange.svg">
1415
</p>
@@ -89,18 +90,17 @@ git clone https://github.com/javierdejesusda/YuhoLens.git
8990
cd YuhoLens
9091
pip install -e .
9192

92-
# Run the 4-agent composer end-to-end on a sample row.
93+
# Run the 4-agent composer end-to-end on a shipped sample row.
9394
python -m yuholens.agents \
94-
--yuho-row data/eval/kg2_test.jsonl --row-index 0 \
95-
--best-of-n --n-candidates 5 --judge-mode auto
95+
--yuho-row data/sample/sample_yuho.jsonl --row-index 0 \
96+
--best-of-n --n-candidates 5 --judge-mode heuristic
9697

97-
# Reproduce the bo5 pick offline (no OpenAI calls, heuristic only).
98+
# Reproduce a best-of-N pick offline (no OpenAI calls, heuristic only).
99+
# Replace the inputs with your own candidate memo JSONL files.
98100
python scripts/run_bestofn_offline.py \
99-
--memos data/eval/kg2_memos_v4.jsonl data/eval/kg2_memos_v5.jsonl \
100-
data/eval/kg2_memos_bo3_s1.jsonl data/eval/kg2_memos_bo3_s2.jsonl \
101-
data/eval/kg2_memos_bo3_s3.jsonl \
102-
--picked-memos data/eval/picked_offline.jsonl \
103-
--picked-scores data/eval/picked_offline.json
101+
--memos path/to/candidates_a.jsonl path/to/candidates_b.jsonl \
102+
--picked-memos /tmp/picked.jsonl \
103+
--picked-scores /tmp/picked.json
104104
```
105105

106106
Run the test suite (laptop, no GPU, no API key required):
@@ -197,7 +197,7 @@ YuhoLens/
197197
│ ├── build_gguf.sh # llama.cpp Q4/Q5/Q6/Q8 release set
198198
│ ├── hf_upload.py # patches generation_config + pushes to Hub
199199
│ └── check_release_set.py # pre-release sanity check
200-
├── tests/ # 85 pytest tests, all laptop-runnable
200+
├── tests/ # 87 pytest tests, all laptop-runnable
201201
├── configs/ # sft.yaml, orpo.yaml
202202
└── docs/ # model-card, blog_post, demo_script, sessions
203203
```

data/sample/sample_yuho.jsonl

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
{"custom_id": "sample-0001", "edinet_code": "E00001", "fiscal_year": 2024, "company_name_jp": "サンプル株式会社", "company_name_en": "Sample Corp", "raw_tables": {"bs": {"total_assets": 1200, "total_liabilities": 540, "total_equity": 660}, "pl": {"revenue": 980, "operating_income": 88, "net_income": 62}, "cf": {"op_cf": 71, "inv_cf": -32, "fin_cf": -18}}, "messages": [{"role": "system", "content": "You are a financial analyst writing English investor memos grounded in cited Japanese passages."}, {"role": "user", "content": "Write a 7-section English investor memo for サンプル株式会社 (E00001), fiscal year 2024. Pass-1 extractions: <<<PASS1\n{\"事業等のリスク\": {\"red_flags\": [{\"japanese_span\": \"為替変動リスクによる営業利益への影響が拡大している\", \"flag_type\": \"market\", \"severity\": \"medium\"}], \"numerical_claims\": []}}\nPASS1>>>"}]}
2+
{"custom_id": "sample-0002", "edinet_code": "E00002", "fiscal_year": 2024, "company_name_jp": "テスト電子株式会社", "company_name_en": "Test Electronics Co.", "raw_tables": {"bs": {"total_assets": 4400, "total_liabilities": 2100, "total_equity": 2300}, "pl": {"revenue": 3120, "operating_income": 215, "net_income": 142}, "cf": {"op_cf": 263, "inv_cf": -187, "fin_cf": -54}}, "messages": [{"role": "system", "content": "You are a financial analyst writing English investor memos grounded in cited Japanese passages."}, {"role": "user", "content": "Write a 7-section English investor memo for テスト電子株式会社 (E00002), fiscal year 2024. Pass-1 extractions: <<<PASS1\n{\"関連当事者との取引\": {\"red_flags\": [{\"japanese_span\": \"主要株主との取引は市場価格よりも低い価格で行われている\", \"flag_type\": \"governance\", \"severity\": \"medium\"}], \"numerical_claims\": []}}\nPASS1>>>"}]}

docs/CHANGELOG.md

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# Changelog
2+
3+
All notable engineering milestones for YuhoLens-Pipeline. Dates are
4+
hackathon calendar days; commit hashes refer to `main` on
5+
`github.com/javierdejesusda/YuhoLens`.
6+
7+
The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/);
8+
this project does not yet follow semantic versioning because the public
9+
artefact is the HuggingFace checkpoint, not a Python package release.
10+
11+
## [Unreleased]
12+
13+
Phase-9 ship-readiness: operator CLI, offline picker, release validator,
14+
README rewrite, sample fixture, social-media refresh, CI, pre-commit.
15+
16+
### Added
17+
18+
- `python -m yuholens.agents` operator CLI for the 4-agent composer with
19+
`--best-of-n / --judge-mode / --n-candidates` flags.
20+
- `scripts/run_bestofn_offline.py` for heuristic-only best-of-N picking
21+
without OpenAI calls.
22+
- `scripts/check_release_set.py` pre-HF-upload validator (tokenizer,
23+
generation_config v5 invariants, weights, architecture).
24+
- `scripts/hf_upload.py` that patches `generation_config.json` to v5
25+
defaults before pushing to the Hub.
26+
- `scripts/build_gguf.sh` covering Q4_K_M / Q5_K_M / Q6_K / Q8_0.
27+
- `data/sample/sample_yuho.jsonl` so the README quickstart works on a
28+
fresh clone.
29+
- `.github/workflows/test.yml` running `pytest` on push and PR.
30+
- `.pre-commit-config.yaml` wired to the existing ruff config.
31+
- `docs/CHANGELOG.md` (this file).
32+
- `MemoCriticAgent` LangGraph node + `decoder_profiles.py` catalogue.
33+
- `JudgeUnavailableError` with auto-fallback to the heuristic when the
34+
judge backend is unreachable, and a finite-score guard against
35+
silently picking an unscored candidate.
36+
37+
### Changed
38+
39+
- README rewritten with the KG-2 PASS headline, metric arc, mermaid
40+
4-agent diagram, cost table, and a sharper quickstart.
41+
- `docs/social_media.md` refreshed with the real PASS metrics
42+
(3.88 coherence, 1.000 citation, 0.994 section coverage).
43+
- `docs/blog_post.md` numbers replaced with the metric arc and the
44+
cross-decoder vs cross-seed finding.
45+
- `docs/demo_script.md` adds a 5-minute live walkthrough alongside the
46+
90-second submission video script.
47+
- `docs/model-card.md` quantization table now lists Q8_0 and references
48+
the new build script.
49+
- `scripts/mi300x_preflight.py` performs a real OpenAI auth probe
50+
instead of bare env-var presence.
51+
- `pyproject.toml` and `requirements.txt` add `huggingface_hub` and
52+
`safetensors` to runtime deps; new `release` extra collects
53+
matplotlib for figure rendering.
54+
55+
## [2026-04-25] — Session 1.7 — KG-2 PASS
56+
57+
### Added
58+
59+
- `src/yuholens/eval/run_sft_drafts.py` for ORPO draft generation at v5
60+
decoding (`b16e8d7`).
61+
- `scripts/bestofn_pick.py` to pick the highest-coherence memo per
62+
`custom_id` from N candidate sets via cached judge scores
63+
(`b16e8d7`).
64+
- `scripts/bestofn_judge.py` fresh-pass scorer that judges every memo
65+
across N candidate sets in a single session (`f6ac0d6`).
66+
- `scripts/bo3_finalise.sh` orchestrating the post-best-of-3 pipeline
67+
(`15ac06c`).
68+
- `--seed` and `--skip-judge` flags on `run_kg2.py` so candidate sets
69+
are independently reproducible (`f6ac0d6`).
70+
71+
### Changed
72+
73+
- ORPO `CRITIQUE_SYSTEM` rewritten to embed the seven-section coherence
74+
rubric, replacing citation-grounded language that was orthogonal to
75+
what the KG-2 judge actually scores (`b16e8d7`).
76+
- `configs/orpo.yaml` `model_id` corrected to `checkpoint-212`.
77+
78+
### Result
79+
80+
KG-2 PASS at coherence **3.88**, citation rate **1.000**, section
81+
coverage **0.994** under the best-of-5 mixed-decoder composer. Verdict
82+
documented in `docs/session_2026-04-25_summary.md` (committed in
83+
`9b17222`).
84+
85+
## [2026-04-22] — Session 1.6 — SFT polish module
86+
87+
### Added
88+
89+
- LM-head + last-4-layers SFT polish module (`a14834c`). Polish
90+
experiment regressed KG-2 to 3.26 (-0.30) and was abandoned in favour
91+
of inference-time best-of-N.
92+
93+
## Pre-history (2026-04-17 onwards)
94+
95+
Initial SFT loop, teacher bootstrap, ROCm bitsandbytes source build,
96+
ingestor regex tuning, Pass-1 / Pass-2 prompt design, citation-grounder
97+
with `[evidence insufficient]` abstention, kill-gate metrics, and the
98+
six-variant decoding sweep that established v5 as the single-shot
99+
default.

0 commit comments

Comments
 (0)