chore(release): prep HuggingFace upload for BF16 + GGUF

javierdejesusda · javierdejesusda · commit 591e08341bac · 2026-04-29T18:25:42.000+02:00
- scripts/hf_upload.py: exclude training-state files (optimizer.pt,
  scheduler.pt, rng_state.pth, training_args.bin, trainer_state.json,
  global_step*/, checkpoint-*/) from the upload payload so the 33 GB
  optimizer state stays local.
- scripts/hf_upload_gguf.py: new helper that discovers the release
  set in data/eval/gguf, validates the five expected quants, copies
  them with the GGUF README into a temp staging dir, and pushes to
  yuholens/yuholens-14b-GGUF. The f16 build intermediate is filtered.
- docs/gguf_readme.md: the README that ships with the GGUF repo —
  files table with verified sizes, RTX 4070 Q3_K_M smoke result
  (12.2 gen tok/s, 65.5 prompt tok/s), Qwen1 ChatML prompt format,
  build provenance with the layer_norm_rms_epsilon override note.
- docs/model-card.md: replace TBD GGUF table with verified on-disk
  sizes (Q3_K_M 7.18 GB ... Q8_0 14.03 GB) and the Q3_K_M smoke
  numbers.
- docs/hf_upload_runbook.md: end-to-end procedure including the
  generation_config patch step, README staging, and post-push
  verification.
diff --git a/docs/gguf_readme.md b/docs/gguf_readme.md
@@ -0,0 +1,144 @@
+---
+license: other
+license_name: tongyi-qianwen
+license_link: https://huggingface.co/Qwen/Qwen-14B/blob/main/LICENSE
+language:
+  - ja
+  - en
+base_model: yuholens/yuholens-14b
+tags:
+  - gguf
+  - llama-cpp
+  - quantized
+  - japanese-finance
+  - yuho
+  - edinet
+  - qwen
+pipeline_tag: text-generation
+---
+
+# YuhoLens-14B GGUF
+
+GGUF-quantized release of [`yuholens/yuholens-14b`](https://huggingface.co/yuholens/yuholens-14b) for offline inference via [`llama.cpp`](https://github.com/ggerganov/llama.cpp). The BF16 source is a full-parameter SFT of [`pfnet/nekomata-14b-pfn-qfin`](https://huggingface.co/pfnet/nekomata-14b-pfn-qfin) (Qwen1, 14B, Japanese-finance CPT) on teacher-bootstrapped English investor memos derived from [`SakanaAI/EDINET-Bench`](https://huggingface.co/datasets/SakanaAI/EDINET-Bench).
+
+## Files
+
+| File | Quant | Size | Bits/weight | Recommended hardware |
+|------|-------|-----:|:-----------:|----------------------|
+| `yuholens-14b-Q3_K_M.gguf` | Q3_K_M | 7.18 GB |  4.35 | 8 GB GPU (RTX 4070 Laptop, 3060 Ti) — fits with `--ctx-size 2048` |
+| `yuholens-14b-Q4_K_M.gguf` | Q4_K_M | 8.81 GB | ~5.0 | 12-16 GB GPU (RTX 4060 Ti 16 GB, RTX 3080) |
+| `yuholens-14b-Q5_K_M.gguf` | Q5_K_M | 9.94 GB | ~5.7 | 16 GB GPU |
+| `yuholens-14b-Q6_K.gguf`   | Q6_K   | 11.46 GB | ~6.6 | 16-24 GB GPU |
+| `yuholens-14b-Q8_0.gguf`   | Q8_0   | 14.03 GB | 8.5 | 24 GB GPU or CPU offload |
+
+Sizes are the actual on-disk byte counts; quoted GB values use 1024³.
+
+## Smoke test
+
+The Q3_K_M quant has been smoke-tested on an NVIDIA RTX 4070 Laptop (8 GB VRAM, compute 8.9) using `llama-completion` from llama.cpp build `b8966`.
+
+- Prompt: ChatML-wrapped Yuho fixture with the YuhoLens system prompt (see *Prompt format* below).
+- Settings: `--n-gpu-layers 99 -c 2048 --temp 0.1 --top-p 0.9 --repeat-penalty 1.15`.
+- Throughput: prompt eval **65.5 tok/s**, generation **12.2 tok/s**.
+- VRAM occupancy: ~7.0 GB model + ~1.6 GB context + ~0.3 GB compute, fits within the 8.0 GB card.
+- Output: coherent English investor memo with section headings (*Executive summary*, *Going-concern assessment*, *Accrual quality*, ...) following the SFT teacher template.
+
+Q4_K_M and larger quants exceed 8 GB VRAM and require either a 12 GB+ GPU or partial CPU offload (`--n-gpu-layers` < 99).
+
+## Quickstart
+
+```bash
+# Q3_K_M, fits 8 GB GPU
+llama-completion \
+  -m yuholens-14b-Q3_K_M.gguf \
+  -f your_prompt.txt \
+  --n-gpu-layers 99 \
+  -c 2048 \
+  --temp 0.1 --top-p 0.9 --repeat-penalty 1.15 \
+  -n 512 \
+  --no-display-prompt
+```
+
+For larger quants on a smaller GPU, drop `--n-gpu-layers` to a value that fits VRAM (e.g. `--n-gpu-layers 30` for Q4_K_M on 8 GB).
+
+## Prompt format
+
+YuhoLens-14B was fine-tuned on Qwen1 ChatML with a fixed system prompt. **Raw Japanese Yuho text without the wrapper produces Japanese continuations** instead of English memos. Wrap inputs as:
+
+```
+<|im_start|>system
+{SYSTEM_PROMPT}<|im_end|>
+<|im_start|>user
+Company metadata (JSON):
+{...}
+
+Balance sheet (JSON):
+{...}
+
+P&L (JSON):
+{...}
+
+Cash flow (JSON):
+{...}
+
+Japanese annual-report text (truncated at ~20K chars):
+<<<
+{Japanese Yuho text}
+>>>
+
+Produce the two-page English investor memo now.<|im_end|>
+<|im_start|>assistant
+```
+
+The full system prompt is published in the BF16 model card and in `src/yuholens/training/teacher.py` of the GitHub repo. A pre-formatted smoke fixture is included in the GitHub repo at `data/sample/smoke_prompt_chatml.txt`.
+
+## Build provenance
+
+Source checkpoint: `output/yuholens-14b-sft/checkpoint-212` (28.3 GB BF16 safetensors), tagged at git commit `f903174`.
+
+Built with:
+
+- `llama.cpp` commit `b8966` (`7b8443ac7`), Windows CUDA 12.4 prebuilt binaries.
+- Convert script `convert_hf_to_gguf.py` patched locally to fall back to an inline GPT-2 byte-to-unicode mapping when transformers ≥ 5.x removes `bytes_to_unicode`. See the GitHub repo for the patch.
+- `llama-quantize` invoked with `--override-kv qwen.attention.layer_norm_rms_epsilon=float:0.000001` because Qwen1 uses RMSNorm internally but its HF config exposes the field as `layer_norm_epsilon`; this override is now baked into `scripts/build_gguf.sh`.
+
+Reproduce locally:
+
+```bash
+LLAMACPP_REPO=/path/to/llama.cpp \
+LLAMACPP_BIN=/path/to/llama.cpp/build/bin \
+  scripts/build_gguf.sh output/yuholens-14b-sft/checkpoint-212 data/eval/gguf
+```
+
+## Limitations
+
+- **Output language asymmetry.** The model emits English memos and expects Japanese Yuho input. Japanese output is unsupported by training distribution.
+- **Citation accuracy unaudited.** The model produces inline `(ref: '<span>' p.N)` citations, but verbatim-correctness against the underlying Yuho text has not been audited.
+- **Quantization quality bands.** Q3_K_M sacrifices some fidelity for the 8 GB VRAM target. Prefer Q4_K_M or higher when memory permits.
+- **Domain.** Trained on Yuho 有価証券報告書 only. Earnings transcripts, 決算短信, and non-Japanese filings are out-of-scope.
+- **Not financial advice.** Outputs are model-generated and may contain factual errors. Verify every material claim against the source.
+
+## License
+
+Released under the Tongyi Qianwen License inherited from the Qwen1 base via `pfnet/nekomata-14b-pfn-qfin`. See the linked license for terms. Wrapper code in the GitHub repo (LangGraph pipeline, training and evaluation scripts, prompt modules) is MIT.
+
+## Citation
+
+```bibtex
+@misc{dejesus2026yuholens,
+  author       = {De Jesus, Javier},
+  title        = {YuhoLens-14B: A Japanese-Finance Fine-Tune for
+                  Span-Grounded Investor Memo Generation},
+  year         = {2026},
+  howpublished = {Hugging Face model repository},
+  url          = {https://huggingface.co/yuholens/yuholens-14b},
+  note         = {DOI: TBD}
+}
+```
+
+## Links
+
+- BF16 reference: https://huggingface.co/yuholens/yuholens-14b
+- GitHub: https://github.com/javierdejesusda/YuhoLens
+- Base model: https://huggingface.co/pfnet/nekomata-14b-pfn-qfin
+- Training data: https://huggingface.co/datasets/SakanaAI/EDINET-Bench
diff --git a/docs/hf_upload_runbook.md b/docs/hf_upload_runbook.md
@@ -0,0 +1,139 @@
+# HuggingFace upload runbook
+
+End-to-end procedure for publishing the YuhoLens-14B release. Two
+artefacts: the BF16 reference checkpoint (`yuholens/yuholens-14b`) and
+the GGUF release set (`yuholens/yuholens-14b-GGUF`).
+
+## 0. Prerequisites
+
+1. HuggingFace account with write access to the `yuholens` org. If the
+   org does not yet exist, create it at <https://huggingface.co/organizations/new>
+   before step 2.
+2. `huggingface_hub` Python package in the active environment:
+   ```bash
+   pip install --upgrade huggingface_hub
+   ```
+3. Authenticate. Pick one:
+   ```bash
+   huggingface-cli login        # interactive, persists to ~/.cache/huggingface/token
+   # or
+   export HF_TOKEN=hf_xxx_yyy_zzz
+   ```
+4. Confirm both repos are reachable (will be created on first push if
+   they don't exist):
+   ```bash
+   huggingface-cli whoami
+   ```
+
+## 1. BF16 checkpoint -> yuholens/yuholens-14b
+
+### 1a. Pre-flight verify
+
+```bash
+python scripts/check_release_set.py \
+  --model-path output/yuholens-14b-sft/checkpoint-212
+```
+
+Expected: `RESULT: PASS` with the four checks (tokenizer, weights,
+generation_config, architecture). If `generation_config.json` is not
+yet patched, run the patcher (does not push):
+
+```bash
+python scripts/hf_upload.py \
+  --model-path output/yuholens-14b-sft/checkpoint-212 \
+  --hf-repo placeholder \
+  --skip-upload
+```
+
+### 1b. Stage the model card
+
+The model card lives at `docs/model-card.md`. Copy it into the
+checkpoint folder as `README.md` so `upload_folder` picks it up:
+
+```bash
+cp docs/model-card.md output/yuholens-14b-sft/checkpoint-212/README.md
+```
+
+### 1c. Push
+
+`scripts/hf_upload.py` excludes training-state artefacts
+(`optimizer.pt`, `scheduler.pt`, `rng_state.pth`, `training_args.bin`,
+`trainer_state.json`, plus any `global_step*/` and `checkpoint-*/`
+subdirs) so the 33 GB optimizer state stays local.
+
+Public release:
+
+```bash
+python scripts/hf_upload.py \
+  --model-path output/yuholens-14b-sft/checkpoint-212 \
+  --hf-repo yuholens/yuholens-14b
+```
+
+Private dry-run first if you want a sanity check:
+
+```bash
+python scripts/hf_upload.py \
+  --model-path output/yuholens-14b-sft/checkpoint-212 \
+  --hf-repo yuholens/yuholens-14b-staging \
+  --private
+```
+
+The shipped upload payload is approximately 28 GB across 6 safetensors
+shards plus tokenizer / config / model-card files. Expect 30-90 minutes
+on residential broadband.
+
+## 2. GGUF release set -> yuholens/yuholens-14b-GGUF
+
+### 2a. Verify the set
+
+```bash
+python scripts/hf_upload_gguf.py \
+  --gguf-dir data/eval/gguf \
+  --readme docs/gguf_readme.md \
+  --hf-repo yuholens/yuholens-14b-GGUF \
+  --dry-run
+```
+
+Expected: `release set OK` with all five quants listed
+(Q3_K_M / Q4_K_M / Q5_K_M / Q6_K / Q8_0). The script refuses to push if
+any expected quant is missing or under 1 GB.
+
+### 2b. Push
+
+```bash
+python scripts/hf_upload_gguf.py \
+  --gguf-dir data/eval/gguf \
+  --readme docs/gguf_readme.md \
+  --hf-repo yuholens/yuholens-14b-GGUF
+```
+
+Total payload is ~52 GB across 5 GGUF files plus the README. The
+script copies into a temp staging dir before uploading so the f16
+intermediate (if it still exists locally) is not pushed.
+
+## 3. Post-push checklist
+
+- [ ] Browse <https://huggingface.co/yuholens/yuholens-14b> and confirm
+      the model card renders, no `optimizer.pt` is visible, and the
+      *Files and versions* tab lists all 6 safetensors shards.
+- [ ] Browse <https://huggingface.co/yuholens/yuholens-14b-GGUF> and
+      confirm all five quants appear with the sizes listed in the
+      README.
+- [ ] Update the GitHub README with the live HuggingFace URLs and
+      the verified GGUF table (sizes + smoke result already in
+      `docs/model-card.md`).
+- [ ] Tag both repos with the corresponding GitHub commit hash via
+      the *Settings -> Tags* tab on the Hub for traceability
+      (currently `f903174` on `main`).
+
+## 4. Rollback
+
+To unpublish (e.g., if a wrong file slipped in), use the Hub UI's
+*Settings -> Delete repository*. Do not git-revert; the Hub commit
+history is independent of GitHub.
+
+## 5. Bandwidth cost
+
+- BF16 push: 28 GB once.
+- GGUF push: 52 GB once.
+- Total: ~80 GB upstream. Plan for ~2-4 hours on a 100 Mbps connection.
diff --git a/docs/model-card.md b/docs/model-card.md
@@ -206,13 +206,21 @@ checkpoint by `scripts/build_gguf.sh`, which calls llama.cpp's
 target quant. See the script's prereq header for the required
 llama.cpp checkout and disk-budget notes.
 
-| Quant     | Approx. size | Intended hardware                                       | Target throughput (tok/s) |
-|-----------|--------------|---------------------------------------------------------|---------------------------|
-| Q3_K_M    | ~6.5 GB      | 8 GB consumer GPU (RTX 4070 Laptop, RTX 3060 Ti)        | TBD                       |
-| Q4_K_M    | ~9.45 GB     | 12-16 GB consumer GPU (RTX 4060 Ti 16 GB, RTX 3080)     | TBD                       |
-| Q5_K_M    | ~10.5 GB     | 16-24 GB consumer GPU                                   | TBD                       |
-| Q6_K      | ~12.1 GB     | 24 GB+ consumer or prosumer                             | TBD                       |
-| Q8_0      | ~15.7 GB     | 24 GB+ prosumer / dual-GPU CPU offload                  | TBD                       |
+| Quant     | Verified size | Intended hardware                                       | Throughput (tok/s)        |
+|-----------|---------------|---------------------------------------------------------|---------------------------|
+| Q3_K_M    | 7.18 GB       | 8 GB consumer GPU (RTX 4070 Laptop, RTX 3060 Ti)        | 12.2 gen / 65.5 prompt on RTX 4070 Laptop, `-c 2048` |
+| Q4_K_M    | 8.81 GB       | 12-16 GB consumer GPU (RTX 4060 Ti 16 GB, RTX 3080)     | TBD                       |
+| Q5_K_M    | 9.94 GB       | 16-24 GB consumer GPU                                   | TBD                       |
+| Q6_K      | 11.46 GB      | 24 GB+ consumer or prosumer                             | TBD                       |
+| Q8_0      | 14.03 GB      | 24 GB+ prosumer / dual-GPU CPU offload                  | TBD                       |
+
+Sizes above are the actual on-disk byte counts of the released GGUFs
+(1024³ GB). Q3_K_M was smoke-tested end-to-end on an RTX 4070 Laptop
+(8 GB) at `--ctx-size 2048` with `--n-gpu-layers 99`; the model and
+context fit fully in VRAM and produce a coherent English investor memo
+from the ChatML-wrapped fixture in `data/sample/smoke_prompt_chatml.txt`.
+Q4_K_M and larger quants exceed 8 GB VRAM and require a 12 GB+ GPU or
+partial CPU offload (`--n-gpu-layers` < 99).
 
 Pass-1 per-section context of 4-6K tokens is the supported consumer
 operating point; longer contexts require the BF16 checkpoint served via
diff --git a/scripts/hf_upload.py b/scripts/hf_upload.py
@@ -43,6 +43,16 @@
     "tokenization_qwen.py",
 )
 
+TRAINING_STATE_IGNORE_PATTERNS: tuple[str, ...] = (
+    "optimizer.pt",
+    "scheduler.pt",
+    "rng_state.pth",
+    "training_args.bin",
+    "trainer_state.json",
+    "global_step*/**",
+    "checkpoint-*/**",
+)
+
 
 def patch_generation_config(model_path: Path) -> Path:
     """Write v5 sampling defaults into ``generation_config.json``.
@@ -114,6 +124,7 @@ def upload_to_hub(
         folder_path=str(model_path),
         repo_id=hf_repo,
         commit_message=commit_message,
+        ignore_patterns=list(TRAINING_STATE_IGNORE_PATTERNS),
     )
 
 
diff --git a/scripts/hf_upload_gguf.py b/scripts/hf_upload_gguf.py