Skip to content

Commit 591e083

Browse files
chore(release): prep HuggingFace upload for BF16 + GGUF
- scripts/hf_upload.py: exclude training-state files (optimizer.pt, scheduler.pt, rng_state.pth, training_args.bin, trainer_state.json, global_step*/, checkpoint-*/) from the upload payload so the 33 GB optimizer state stays local. - scripts/hf_upload_gguf.py: new helper that discovers the release set in data/eval/gguf, validates the five expected quants, copies them with the GGUF README into a temp staging dir, and pushes to yuholens/yuholens-14b-GGUF. The f16 build intermediate is filtered. - docs/gguf_readme.md: the README that ships with the GGUF repo — files table with verified sizes, RTX 4070 Q3_K_M smoke result (12.2 gen tok/s, 65.5 prompt tok/s), Qwen1 ChatML prompt format, build provenance with the layer_norm_rms_epsilon override note. - docs/model-card.md: replace TBD GGUF table with verified on-disk sizes (Q3_K_M 7.18 GB ... Q8_0 14.03 GB) and the Q3_K_M smoke numbers. - docs/hf_upload_runbook.md: end-to-end procedure including the generation_config patch step, README staging, and post-push verification.
1 parent f903174 commit 591e083

5 files changed

Lines changed: 492 additions & 7 deletions

File tree

docs/gguf_readme.md

Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
---
2+
license: other
3+
license_name: tongyi-qianwen
4+
license_link: https://huggingface.co/Qwen/Qwen-14B/blob/main/LICENSE
5+
language:
6+
- ja
7+
- en
8+
base_model: yuholens/yuholens-14b
9+
tags:
10+
- gguf
11+
- llama-cpp
12+
- quantized
13+
- japanese-finance
14+
- yuho
15+
- edinet
16+
- qwen
17+
pipeline_tag: text-generation
18+
---
19+
20+
# YuhoLens-14B GGUF
21+
22+
GGUF-quantized release of [`yuholens/yuholens-14b`](https://huggingface.co/yuholens/yuholens-14b) for offline inference via [`llama.cpp`](https://github.com/ggerganov/llama.cpp). The BF16 source is a full-parameter SFT of [`pfnet/nekomata-14b-pfn-qfin`](https://huggingface.co/pfnet/nekomata-14b-pfn-qfin) (Qwen1, 14B, Japanese-finance CPT) on teacher-bootstrapped English investor memos derived from [`SakanaAI/EDINET-Bench`](https://huggingface.co/datasets/SakanaAI/EDINET-Bench).
23+
24+
## Files
25+
26+
| File | Quant | Size | Bits/weight | Recommended hardware |
27+
|------|-------|-----:|:-----------:|----------------------|
28+
| `yuholens-14b-Q3_K_M.gguf` | Q3_K_M | 7.18 GB | 4.35 | 8 GB GPU (RTX 4070 Laptop, 3060 Ti) — fits with `--ctx-size 2048` |
29+
| `yuholens-14b-Q4_K_M.gguf` | Q4_K_M | 8.81 GB | ~5.0 | 12-16 GB GPU (RTX 4060 Ti 16 GB, RTX 3080) |
30+
| `yuholens-14b-Q5_K_M.gguf` | Q5_K_M | 9.94 GB | ~5.7 | 16 GB GPU |
31+
| `yuholens-14b-Q6_K.gguf` | Q6_K | 11.46 GB | ~6.6 | 16-24 GB GPU |
32+
| `yuholens-14b-Q8_0.gguf` | Q8_0 | 14.03 GB | 8.5 | 24 GB GPU or CPU offload |
33+
34+
Sizes are the actual on-disk byte counts; quoted GB values use 1024³.
35+
36+
## Smoke test
37+
38+
The Q3_K_M quant has been smoke-tested on an NVIDIA RTX 4070 Laptop (8 GB VRAM, compute 8.9) using `llama-completion` from llama.cpp build `b8966`.
39+
40+
- Prompt: ChatML-wrapped Yuho fixture with the YuhoLens system prompt (see *Prompt format* below).
41+
- Settings: `--n-gpu-layers 99 -c 2048 --temp 0.1 --top-p 0.9 --repeat-penalty 1.15`.
42+
- Throughput: prompt eval **65.5 tok/s**, generation **12.2 tok/s**.
43+
- VRAM occupancy: ~7.0 GB model + ~1.6 GB context + ~0.3 GB compute, fits within the 8.0 GB card.
44+
- Output: coherent English investor memo with section headings (*Executive summary*, *Going-concern assessment*, *Accrual quality*, ...) following the SFT teacher template.
45+
46+
Q4_K_M and larger quants exceed 8 GB VRAM and require either a 12 GB+ GPU or partial CPU offload (`--n-gpu-layers` < 99).
47+
48+
## Quickstart
49+
50+
```bash
51+
# Q3_K_M, fits 8 GB GPU
52+
llama-completion \
53+
-m yuholens-14b-Q3_K_M.gguf \
54+
-f your_prompt.txt \
55+
--n-gpu-layers 99 \
56+
-c 2048 \
57+
--temp 0.1 --top-p 0.9 --repeat-penalty 1.15 \
58+
-n 512 \
59+
--no-display-prompt
60+
```
61+
62+
For larger quants on a smaller GPU, drop `--n-gpu-layers` to a value that fits VRAM (e.g. `--n-gpu-layers 30` for Q4_K_M on 8 GB).
63+
64+
## Prompt format
65+
66+
YuhoLens-14B was fine-tuned on Qwen1 ChatML with a fixed system prompt. **Raw Japanese Yuho text without the wrapper produces Japanese continuations** instead of English memos. Wrap inputs as:
67+
68+
```
69+
<|im_start|>system
70+
{SYSTEM_PROMPT}<|im_end|>
71+
<|im_start|>user
72+
Company metadata (JSON):
73+
{...}
74+
75+
Balance sheet (JSON):
76+
{...}
77+
78+
P&L (JSON):
79+
{...}
80+
81+
Cash flow (JSON):
82+
{...}
83+
84+
Japanese annual-report text (truncated at ~20K chars):
85+
<<<
86+
{Japanese Yuho text}
87+
>>>
88+
89+
Produce the two-page English investor memo now.<|im_end|>
90+
<|im_start|>assistant
91+
```
92+
93+
The full system prompt is published in the BF16 model card and in `src/yuholens/training/teacher.py` of the GitHub repo. A pre-formatted smoke fixture is included in the GitHub repo at `data/sample/smoke_prompt_chatml.txt`.
94+
95+
## Build provenance
96+
97+
Source checkpoint: `output/yuholens-14b-sft/checkpoint-212` (28.3 GB BF16 safetensors), tagged at git commit `f903174`.
98+
99+
Built with:
100+
101+
- `llama.cpp` commit `b8966` (`7b8443ac7`), Windows CUDA 12.4 prebuilt binaries.
102+
- Convert script `convert_hf_to_gguf.py` patched locally to fall back to an inline GPT-2 byte-to-unicode mapping when transformers ≥ 5.x removes `bytes_to_unicode`. See the GitHub repo for the patch.
103+
- `llama-quantize` invoked with `--override-kv qwen.attention.layer_norm_rms_epsilon=float:0.000001` because Qwen1 uses RMSNorm internally but its HF config exposes the field as `layer_norm_epsilon`; this override is now baked into `scripts/build_gguf.sh`.
104+
105+
Reproduce locally:
106+
107+
```bash
108+
LLAMACPP_REPO=/path/to/llama.cpp \
109+
LLAMACPP_BIN=/path/to/llama.cpp/build/bin \
110+
scripts/build_gguf.sh output/yuholens-14b-sft/checkpoint-212 data/eval/gguf
111+
```
112+
113+
## Limitations
114+
115+
- **Output language asymmetry.** The model emits English memos and expects Japanese Yuho input. Japanese output is unsupported by training distribution.
116+
- **Citation accuracy unaudited.** The model produces inline `(ref: '<span>' p.N)` citations, but verbatim-correctness against the underlying Yuho text has not been audited.
117+
- **Quantization quality bands.** Q3_K_M sacrifices some fidelity for the 8 GB VRAM target. Prefer Q4_K_M or higher when memory permits.
118+
- **Domain.** Trained on Yuho 有価証券報告書 only. Earnings transcripts, 決算短信, and non-Japanese filings are out-of-scope.
119+
- **Not financial advice.** Outputs are model-generated and may contain factual errors. Verify every material claim against the source.
120+
121+
## License
122+
123+
Released under the Tongyi Qianwen License inherited from the Qwen1 base via `pfnet/nekomata-14b-pfn-qfin`. See the linked license for terms. Wrapper code in the GitHub repo (LangGraph pipeline, training and evaluation scripts, prompt modules) is MIT.
124+
125+
## Citation
126+
127+
```bibtex
128+
@misc{dejesus2026yuholens,
129+
author = {De Jesus, Javier},
130+
title = {YuhoLens-14B: A Japanese-Finance Fine-Tune for
131+
Span-Grounded Investor Memo Generation},
132+
year = {2026},
133+
howpublished = {Hugging Face model repository},
134+
url = {https://huggingface.co/yuholens/yuholens-14b},
135+
note = {DOI: TBD}
136+
}
137+
```
138+
139+
## Links
140+
141+
- BF16 reference: https://huggingface.co/yuholens/yuholens-14b
142+
- GitHub: https://github.com/javierdejesusda/YuhoLens
143+
- Base model: https://huggingface.co/pfnet/nekomata-14b-pfn-qfin
144+
- Training data: https://huggingface.co/datasets/SakanaAI/EDINET-Bench

docs/hf_upload_runbook.md

Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# HuggingFace upload runbook
2+
3+
End-to-end procedure for publishing the YuhoLens-14B release. Two
4+
artefacts: the BF16 reference checkpoint (`yuholens/yuholens-14b`) and
5+
the GGUF release set (`yuholens/yuholens-14b-GGUF`).
6+
7+
## 0. Prerequisites
8+
9+
1. HuggingFace account with write access to the `yuholens` org. If the
10+
org does not yet exist, create it at <https://huggingface.co/organizations/new>
11+
before step 2.
12+
2. `huggingface_hub` Python package in the active environment:
13+
```bash
14+
pip install --upgrade huggingface_hub
15+
```
16+
3. Authenticate. Pick one:
17+
```bash
18+
huggingface-cli login # interactive, persists to ~/.cache/huggingface/token
19+
# or
20+
export HF_TOKEN=hf_xxx_yyy_zzz
21+
```
22+
4. Confirm both repos are reachable (will be created on first push if
23+
they don't exist):
24+
```bash
25+
huggingface-cli whoami
26+
```
27+
28+
## 1. BF16 checkpoint -> yuholens/yuholens-14b
29+
30+
### 1a. Pre-flight verify
31+
32+
```bash
33+
python scripts/check_release_set.py \
34+
--model-path output/yuholens-14b-sft/checkpoint-212
35+
```
36+
37+
Expected: `RESULT: PASS` with the four checks (tokenizer, weights,
38+
generation_config, architecture). If `generation_config.json` is not
39+
yet patched, run the patcher (does not push):
40+
41+
```bash
42+
python scripts/hf_upload.py \
43+
--model-path output/yuholens-14b-sft/checkpoint-212 \
44+
--hf-repo placeholder \
45+
--skip-upload
46+
```
47+
48+
### 1b. Stage the model card
49+
50+
The model card lives at `docs/model-card.md`. Copy it into the
51+
checkpoint folder as `README.md` so `upload_folder` picks it up:
52+
53+
```bash
54+
cp docs/model-card.md output/yuholens-14b-sft/checkpoint-212/README.md
55+
```
56+
57+
### 1c. Push
58+
59+
`scripts/hf_upload.py` excludes training-state artefacts
60+
(`optimizer.pt`, `scheduler.pt`, `rng_state.pth`, `training_args.bin`,
61+
`trainer_state.json`, plus any `global_step*/` and `checkpoint-*/`
62+
subdirs) so the 33 GB optimizer state stays local.
63+
64+
Public release:
65+
66+
```bash
67+
python scripts/hf_upload.py \
68+
--model-path output/yuholens-14b-sft/checkpoint-212 \
69+
--hf-repo yuholens/yuholens-14b
70+
```
71+
72+
Private dry-run first if you want a sanity check:
73+
74+
```bash
75+
python scripts/hf_upload.py \
76+
--model-path output/yuholens-14b-sft/checkpoint-212 \
77+
--hf-repo yuholens/yuholens-14b-staging \
78+
--private
79+
```
80+
81+
The shipped upload payload is approximately 28 GB across 6 safetensors
82+
shards plus tokenizer / config / model-card files. Expect 30-90 minutes
83+
on residential broadband.
84+
85+
## 2. GGUF release set -> yuholens/yuholens-14b-GGUF
86+
87+
### 2a. Verify the set
88+
89+
```bash
90+
python scripts/hf_upload_gguf.py \
91+
--gguf-dir data/eval/gguf \
92+
--readme docs/gguf_readme.md \
93+
--hf-repo yuholens/yuholens-14b-GGUF \
94+
--dry-run
95+
```
96+
97+
Expected: `release set OK` with all five quants listed
98+
(Q3_K_M / Q4_K_M / Q5_K_M / Q6_K / Q8_0). The script refuses to push if
99+
any expected quant is missing or under 1 GB.
100+
101+
### 2b. Push
102+
103+
```bash
104+
python scripts/hf_upload_gguf.py \
105+
--gguf-dir data/eval/gguf \
106+
--readme docs/gguf_readme.md \
107+
--hf-repo yuholens/yuholens-14b-GGUF
108+
```
109+
110+
Total payload is ~52 GB across 5 GGUF files plus the README. The
111+
script copies into a temp staging dir before uploading so the f16
112+
intermediate (if it still exists locally) is not pushed.
113+
114+
## 3. Post-push checklist
115+
116+
- [ ] Browse <https://huggingface.co/yuholens/yuholens-14b> and confirm
117+
the model card renders, no `optimizer.pt` is visible, and the
118+
*Files and versions* tab lists all 6 safetensors shards.
119+
- [ ] Browse <https://huggingface.co/yuholens/yuholens-14b-GGUF> and
120+
confirm all five quants appear with the sizes listed in the
121+
README.
122+
- [ ] Update the GitHub README with the live HuggingFace URLs and
123+
the verified GGUF table (sizes + smoke result already in
124+
`docs/model-card.md`).
125+
- [ ] Tag both repos with the corresponding GitHub commit hash via
126+
the *Settings -> Tags* tab on the Hub for traceability
127+
(currently `f903174` on `main`).
128+
129+
## 4. Rollback
130+
131+
To unpublish (e.g., if a wrong file slipped in), use the Hub UI's
132+
*Settings -> Delete repository*. Do not git-revert; the Hub commit
133+
history is independent of GitHub.
134+
135+
## 5. Bandwidth cost
136+
137+
- BF16 push: 28 GB once.
138+
- GGUF push: 52 GB once.
139+
- Total: ~80 GB upstream. Plan for ~2-4 hours on a 100 Mbps connection.

docs/model-card.md

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -206,13 +206,21 @@ checkpoint by `scripts/build_gguf.sh`, which calls llama.cpp's
206206
target quant. See the script's prereq header for the required
207207
llama.cpp checkout and disk-budget notes.
208208

209-
| Quant | Approx. size | Intended hardware | Target throughput (tok/s) |
210-
|-----------|--------------|---------------------------------------------------------|---------------------------|
211-
| Q3_K_M | ~6.5 GB | 8 GB consumer GPU (RTX 4070 Laptop, RTX 3060 Ti) | TBD |
212-
| Q4_K_M | ~9.45 GB | 12-16 GB consumer GPU (RTX 4060 Ti 16 GB, RTX 3080) | TBD |
213-
| Q5_K_M | ~10.5 GB | 16-24 GB consumer GPU | TBD |
214-
| Q6_K | ~12.1 GB | 24 GB+ consumer or prosumer | TBD |
215-
| Q8_0 | ~15.7 GB | 24 GB+ prosumer / dual-GPU CPU offload | TBD |
209+
| Quant | Verified size | Intended hardware | Throughput (tok/s) |
210+
|-----------|---------------|---------------------------------------------------------|---------------------------|
211+
| Q3_K_M | 7.18 GB | 8 GB consumer GPU (RTX 4070 Laptop, RTX 3060 Ti) | 12.2 gen / 65.5 prompt on RTX 4070 Laptop, `-c 2048` |
212+
| Q4_K_M | 8.81 GB | 12-16 GB consumer GPU (RTX 4060 Ti 16 GB, RTX 3080) | TBD |
213+
| Q5_K_M | 9.94 GB | 16-24 GB consumer GPU | TBD |
214+
| Q6_K | 11.46 GB | 24 GB+ consumer or prosumer | TBD |
215+
| Q8_0 | 14.03 GB | 24 GB+ prosumer / dual-GPU CPU offload | TBD |
216+
217+
Sizes above are the actual on-disk byte counts of the released GGUFs
218+
(1024³ GB). Q3_K_M was smoke-tested end-to-end on an RTX 4070 Laptop
219+
(8 GB) at `--ctx-size 2048` with `--n-gpu-layers 99`; the model and
220+
context fit fully in VRAM and produce a coherent English investor memo
221+
from the ChatML-wrapped fixture in `data/sample/smoke_prompt_chatml.txt`.
222+
Q4_K_M and larger quants exceed 8 GB VRAM and require a 12 GB+ GPU or
223+
partial CPU offload (`--n-gpu-layers` < 99).
216224

217225
Pass-1 per-section context of 4-6K tokens is the supported consumer
218226
operating point; longer contexts require the BF16 checkpoint served via

scripts/hf_upload.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,16 @@
4343
"tokenization_qwen.py",
4444
)
4545

46+
TRAINING_STATE_IGNORE_PATTERNS: tuple[str, ...] = (
47+
"optimizer.pt",
48+
"scheduler.pt",
49+
"rng_state.pth",
50+
"training_args.bin",
51+
"trainer_state.json",
52+
"global_step*/**",
53+
"checkpoint-*/**",
54+
)
55+
4656

4757
def patch_generation_config(model_path: Path) -> Path:
4858
"""Write v5 sampling defaults into ``generation_config.json``.
@@ -114,6 +124,7 @@ def upload_to_hub(
114124
folder_path=str(model_path),
115125
repo_id=hf_repo,
116126
commit_message=commit_message,
127+
ignore_patterns=list(TRAINING_STATE_IGNORE_PATTERNS),
117128
)
118129

119130

0 commit comments

Comments
 (0)