All notable engineering milestones for YuhoLens-Pipeline. Dates are
hackathon calendar days; commit hashes refer to main on
github.com/javierdejesusda/YuhoLens.
The format follows Keep a Changelog; this project does not yet follow semantic versioning because the public artefact is the HuggingFace checkpoint, not a Python package release.
- 5 GGUF quants built locally from the SFT BF16 checkpoint: Q3_K_M (7.18 GiB), Q4_K_M (8.81 GiB), Q5_K_M (9.94 GiB), Q6_K (11.46 GiB), Q8_0 (14.03 GiB). Total release set 51.4 GiB.
scripts/hf_upload_gguf.py— discovers / verifies / stages the GGUF set into a temp dir and pushes to a dedicated HF repo. Filters the f16 build intermediate.docs/gguf_readme.md— model card published alongside the GGUF repo, including the Qwen1 ChatML prompt template, build provenance, and per-quant smoke results.docs/hf_upload_runbook.md— end-to-end upload procedure with pre-flight, README staging, and post-push verification.- Q3_K_M smoke on RTX 4070 Laptop (8 GB): 10.06 gen tok/s, 109.05 prompt
tok/s, full GPU offload,
-c 2048. - Q4_K_M smoke on RTX 4070 Laptop with
--n-gpu-layers 30partial offload: 7.92 gen tok/s, 152.30 prompt tok/s.
scripts/build_gguf.shpermanently injects--override-kv qwen.attention.layer_norm_rms_epsilon=float:0.000001. Qwen1 uses RMSNorm internally but its HF config exposes the field aslayer_norm_epsilon; without the override, llama-quantize b8966+ aborts on the first quant with "key not found in model".scripts/hf_upload.pynow ignores training-state artefacts (optimizer.pt,scheduler.pt,rng_state.pth,training_args.bin,trainer_state.json, plusglobal_step*/andcheckpoint-*/subdirs). Saves 33 GB of optimizer state from leaking into the public release.docs/model-card.md: GGUF table now carries verified on-disk sizes and measured tok/s for Q3_K_M and Q4_K_M (was "TBD" placeholders).- README badge / Releases table / BibTeX URL all point at the live
javierdejesusda/yuholens-14band-GGUFrepos.
Both HuggingFace artefacts are public:
- BF16: https://huggingface.co/javierdejesusda/yuholens-14b — 19 files, 28.3 GB (26.4 GiB), 6 safetensors shards.
- GGUF: https://huggingface.co/javierdejesusda/yuholens-14b-GGUF — 7 files, 51.4 GiB, 5 quants Q3_K_M..Q8_0 plus README.
scripts/check_release_set.py reports RESULT: PASS on the
checkpoint; HF API spot-check confirms no optimizer.pt and no f16
intermediate leaked. Build provenance and smoke numbers are in
data/eval/gguf_done.json.
Phase-9 ship-readiness: operator CLI, offline picker, release validator, README rewrite, sample fixture, social-media refresh, CI, pre-commit.
python -m yuholens.agentsoperator CLI for the 4-agent composer with--best-of-n / --judge-mode / --n-candidatesflags.scripts/run_bestofn_offline.pyfor heuristic-only best-of-N picking without OpenAI calls.scripts/check_release_set.pypre-HF-upload validator (tokenizer, generation_config v5 invariants, weights, architecture).scripts/hf_upload.pythat patchesgeneration_config.jsonto v5 defaults before pushing to the Hub.scripts/build_gguf.shcovering Q4_K_M / Q5_K_M / Q6_K / Q8_0.data/sample/sample_yuho.jsonlso the README quickstart works on a fresh clone..github/workflows/test.ymlrunningpyteston push and PR..pre-commit-config.yamlwired to the existing ruff config.docs/CHANGELOG.md(this file).MemoCriticAgentLangGraph node +decoder_profiles.pycatalogue.JudgeUnavailableErrorwith auto-fallback to the heuristic when the judge backend is unreachable, and a finite-score guard against silently picking an unscored candidate.
- README rewritten with the KG-2 PASS headline, metric arc, mermaid 4-agent diagram, cost table, and a sharper quickstart.
docs/social_media.mdrefreshed with the real PASS metrics (3.88 coherence, 1.000 citation, 0.994 section coverage).docs/blog_post.mdnumbers replaced with the metric arc and the cross-decoder vs cross-seed finding.docs/demo_script.mdadds a 5-minute live walkthrough alongside the 90-second submission video script.docs/model-card.mdquantization table now lists Q8_0 and references the new build script.scripts/mi300x_preflight.pyperforms a real OpenAI auth probe instead of bare env-var presence.pyproject.tomlandrequirements.txtaddhuggingface_hubandsafetensorsto runtime deps; newreleaseextra collects matplotlib for figure rendering.
judge_engine={openai,anthropic}dispatch insrc/yuholens/eval/metrics.py::judge_coherence, withclaude-opus-4-7as the default model on the anthropic path. Engine selectable via the new--judge-engineflag onscripts/rescore_kg2.pyandscripts/bestofn_judge.py; the OpenAI Batch path is unchanged when the flag is omitted.anthropicruntime dependency inrequirements.txtandpyproject.toml.- Four new tests in
tests/test_metrics_judge.pycovering the Anthropic engine path (model default, header construction, retry semantics, parse fallback). Suite is now 93 green (89 prior + 4 new).
Re-judged the 50-memo bo5 and bo9 picked sets same-pass blinded under
claude-opus-4-7 (Path B / interactive, n=100 memos pooled, mirror-
graded duplicates) as a methodological calibration check against the
shipping gpt-5-mini judge. Opus applies a stricter rubric (no
score-5 assignments on any LLM-generated memo in this corpus; modal
score 2): mean coherence bo5 = 2.60, bo9 = 2.48; paired delta
bo9 − bo5 = −0.12, 95% bootstrap CI [−0.36, +0.10] (10,000 resamples,
rng=20260428), sign-test two-sided p = 0.648. Cohen's κ Opus vs
gpt-5-mini = 0.017 unweighted, 0.080 quadratic-weighted (n=100). Read
as rubric calibration, not a contradicting verdict.
The shipping verdict is unchanged: bo5 picked under gpt-5-mini at
3.88 PASS is the released configuration. SFT checkpoint
output/yuholens-14b-sft/checkpoint-212 and the bo5/bo9 picked memo
files are unchanged. README and model-card lead with the gpt-5-mini
PASS headline; the Opus calibration check is documented in a small
end-of-doc note in each.
Best-of-9 mixed-decoder inference on the existing SFT checkpoint
output/yuholens-14b-sft/checkpoint-212 reached KG-2 PASS coherence
4.04, citation 1.000, section coverage 0.997 (n=50, fresh
gpt-5-mini judge). The +0.16 lift over the previously-locked bo5
baseline of 3.88 (session 1.7) came from extending the candidate pool
with 4 new decoder profiles unsampled by bo5: s8 tight (temp 0.05, rep 1.20), s9 creative (temp 0.30, rep 1.05), s10 mid (temp 0.15, rep 1.125), and s11 ngram-block (temp 0.10, rep 1.15, no_repeat_ngram_size 3). Pick share over 50 prompts: v4v5 mixed 62.0%,
bo3 seeds 22.0% combined, s8/s9/s10 16.0% combined, s11 0% (a
dud, mean 2.58 — ngram-blocking fragments financial-terminology reuse;
the constraint is empirically refuted at this checkpoint and should
not be tried again). Final picked-fresh distribution 0/1/5/35/9
versus bo5's 0/2/7/36/5; 4 prompts moved 4→5 and 1 weak-tail prompt
lifted out of score 2. The locked SFT checkpoint and bo5 fallback
files were SHA256-verified unchanged at session start and end via
scripts/_pipeline/bo9_anticlobber.py. No SFT or ORPO retraining; the
shipping artifact is now either bo5 @ 3.88 (smaller candidate pool,
less inference compute) or bo9 @ 4.04 (+0.16 lift, defensible
headline). The session was driven by a four-option enhancement
recommendation plan (Option A bo9 chosen over Option B 1k-pref ORPO V4
expansion, Option C DPO, and Option D 2nd SFT seed ensemble); A's
EV-per-dollar was 6× higher than the runner-up. Operator pipeline
(scripts/_pipeline/bo9_*) and the recommendation plan
(docs/superpowers/plans/yuholens_enhancement.md) are local-only by
convention. Session spend ~$16.00 ($15.66 droplet 7.87 h × $1.99/hr,
$0.34 OpenAI 490 judge calls); cumulative project ~$49.64 of $65 cap.
configs/orpo_v3.yaml— V3 trainer hyperparameters (4 epochs, lr 3.0e-6, beta 0.05, save_steps 50). Differs from V2.2 baseline (configs/orpo.yaml) along three axes: 4× the epochs, slower lr to keep gradients stable over 4× more steps, softer beta to relax SFT-anchor regularisation. Hypothesis: the V2.2 trained-but-flat result (50 steps, rewards/accuracies stuck at 0.0) was budget-bounded rather than data-bounded; widening the budget should let the preference signal transfer (902b87e).
V3 trained 100 of 200 planned steps (epoch 2.0 of 4) on the
canonical-gate-clean 790-row preference set from session 1.9, then was
killed by the G1.7 mid-train kill switch. Across all 10 logged step
blocks, rewards/accuracies stayed exactly 0.0; rewards/margins
improved monotonically from -0.0209 at step 10 to -0.0158 at step 70
(a 24% reduction) then plateaued through epoch 2.0 (steps 70–100) at
margins ≈ -0.0158 and log_odds_chosen ≈ -0.78 to -0.80. Neither PASS
condition (rewards/accuracies > 0.0 OR rewards/margins > 0.0) was met
at step 100; the trainer was terminated cleanly and the droplet
destroyed without snapshot retention. Pre-flight 5-step smoke
(introduced this session as G1.5 to validate plumbing before
committing the full $10 run) passed cleanly in ~9.5 min — no plumbing
bugs found, but the gate is now standard for any future
preference-optimisation attempt. Cross-iteration pattern: V2.2 (1 epoch,
beta 0.1, lr 5e-6) and V3 (≥2 epochs, beta 0.05, lr 3e-6) both showed
~24% reduction in reward margins before plateauing; doubling the budget
did not double the headroom. The signal-to-noise floor at this corpus
size appears asymptotic at margins ≈ -0.015, not zero. Shipping
artifact unchanged from session 1.7 (SFT bo5 @ KG-2 PASS 3.88). Session
spend ~$6.79 (3.41 h MI300X, no OAI calls); cumulative ~$33.64 of $65
cap.
CRITIQUE_SYSTEMinsrc/yuholens/training/orpo_data.pynow references the canonical singular(ref: "...")citation marker form explicitly, with twoCITATION_RE-matching examples and a counter-example that forbids rewriting markers as(refs:,[ref:, or(citation:. The V2.1 patch had referenced the plural(refs:form, which does not match the canonical evaluator regex; under the canonical regex the V2.1 hedge "PASS" was actually chosen 0.6926 / rejected 0.9950 — a measurement bug, not a real gate clearance.tests/test_orpo_data.py::test_critique_system_requires_citation_preservationstrengthened to (a) require the singular(ref:form to dominate the plural in count, (b) require at least 5 occurrences of the singular form, and (c) assert thatCITATION_RE.search(CRITIQUE_SYSTEM)matches a real example. Future drifts between prompt language and evaluator regex now fail this test loudly.
V2.2 critique batch (gpt-5-mini, 800 prompts, ~$3) cleared the
canonical-regex data gate cleanly: chosen citation rate 1.0000,
rejected 0.9962, median length ratio 1.221, 70% of rows fully
preserve all citations, 790 prefs after empty-rewrite filtering. ORPO
trained 50 steps in ~71 min on a fresh MI300X (atl1, snapshot 225943366,
~$3.75 compute) at the configs/orpo.yaml defaults. Across all five
logged step blocks, rewards/accuracies stayed 0.0 and reward
margins stayed negative — the preference signal did not transfer at
this data scale and step count. Smoke at the v5 decoder (n=7) produced
mean coherence 3.571, citation 1.000, section coverage 1.000 —
indistinguishable from the SFT v5 single-shot baseline of 3.56 within
judge noise. Best-of-7 generation was skipped on the basis of the smoke
result + training-time signal (expected bo7 mean would tie SFT bo5 at
3.88, not exceed it). Shipping artifact unchanged from session 1.7
(SFT bo5 @ KG-2 PASS 3.88). Negative ORPO arc now spans three failed
data-gate iterations plus one trained-and-evaluated tie, documenting
the bound for the build narrative.
CRITIQUE_SYSTEMinsrc/yuholens/training/orpo_data.pyrequires the rewriter to PRESERVE EVERY(refs: ...)citation marker from the SFT draft verbatim and never delete, rename, merge, reword, or reorder existing tags (55db47b). Locked in bytests/test_orpo_data.py::test_critique_system_requires_citation_preservation.docs/blog_post.mdcorrected: the previous draft claimed ORPO ran on ~1,000 preference pairs and that infrastructure was staged but not exercised. ORPO was wired end-to-end and tried twice; both attempts failed at a pre-training data-quality gate before any GPU training step (cd0f0cf).docs/model-card.mdabstract no longer claims "supervised fine-tuning and reference-free preference optimization"; the shipped artifact is SFT only. Training, hyperparameter table, evaluation, and limitations sections all updated to reflect the two ORPO data-gate failures (e01823d).
ORPO V2 critique batch (gpt-5-mini, 800 prompts) landed at chosen citation rate 0.305 versus rejected 0.995 — a hard fail on the 0.80 gate. No GPU training step was run. Shipping artifact unchanged from session 1.7 (best-of-5 SFT @ 3.88 KG-2 PASS).
src/yuholens/eval/run_sft_drafts.pyfor ORPO draft generation at v5 decoding (b16e8d7).scripts/bestofn_pick.pyto pick the highest-coherence memo percustom_idfrom N candidate sets via cached judge scores (b16e8d7).scripts/bestofn_judge.pyfresh-pass scorer that judges every memo across N candidate sets in a single session (f6ac0d6).scripts/bo3_finalise.shorchestrating the post-best-of-3 pipeline (15ac06c).--seedand--skip-judgeflags onrun_kg2.pyso candidate sets are independently reproducible (f6ac0d6).
- ORPO
CRITIQUE_SYSTEMrewritten to embed the seven-section coherence rubric, replacing citation-grounded language that was orthogonal to what the KG-2 judge actually scores (b16e8d7). configs/orpo.yamlmodel_idcorrected tocheckpoint-212.
KG-2 PASS at coherence 3.88, citation rate 1.000, section
coverage 0.994 under the best-of-5 mixed-decoder composer
(verdict committed in 9b17222).
- LM-head + last-4-layers SFT polish module (
a14834c). Polish experiment regressed KG-2 to 3.26 (-0.30) and was abandoned in favour of inference-time best-of-N.
Initial SFT loop, teacher bootstrap, ROCm bitsandbytes source build,
ingestor regex tuning, Pass-1 / Pass-2 prompt design, citation-grounder
with [evidence insufficient] abstention, kill-gate metrics, and the
six-variant decoding sweep that established v5 as the single-shot
default.