Skip to content

Latest commit

 

History

History
322 lines (269 loc) · 15.3 KB

File metadata and controls

322 lines (269 loc) · 15.3 KB

Changelog

All notable engineering milestones for YuhoLens-Pipeline. Dates are hackathon calendar days; commit hashes refer to main on github.com/javierdejesusda/YuhoLens.

The format follows Keep a Changelog; this project does not yet follow semantic versioning because the public artefact is the HuggingFace checkpoint, not a Python package release.

[Unreleased]

[2026-04-29] — Session 1.13 — GGUF release set + HuggingFace push

Added

  • 5 GGUF quants built locally from the SFT BF16 checkpoint: Q3_K_M (7.18 GiB), Q4_K_M (8.81 GiB), Q5_K_M (9.94 GiB), Q6_K (11.46 GiB), Q8_0 (14.03 GiB). Total release set 51.4 GiB.
  • scripts/hf_upload_gguf.py — discovers / verifies / stages the GGUF set into a temp dir and pushes to a dedicated HF repo. Filters the f16 build intermediate.
  • docs/gguf_readme.md — model card published alongside the GGUF repo, including the Qwen1 ChatML prompt template, build provenance, and per-quant smoke results.
  • docs/hf_upload_runbook.md — end-to-end upload procedure with pre-flight, README staging, and post-push verification.
  • Q3_K_M smoke on RTX 4070 Laptop (8 GB): 10.06 gen tok/s, 109.05 prompt tok/s, full GPU offload, -c 2048.
  • Q4_K_M smoke on RTX 4070 Laptop with --n-gpu-layers 30 partial offload: 7.92 gen tok/s, 152.30 prompt tok/s.

Changed

  • scripts/build_gguf.sh permanently injects --override-kv qwen.attention.layer_norm_rms_epsilon=float:0.000001. Qwen1 uses RMSNorm internally but its HF config exposes the field as layer_norm_epsilon; without the override, llama-quantize b8966+ aborts on the first quant with "key not found in model".
  • scripts/hf_upload.py now ignores training-state artefacts (optimizer.pt, scheduler.pt, rng_state.pth, training_args.bin, trainer_state.json, plus global_step*/ and checkpoint-*/ subdirs). Saves 33 GB of optimizer state from leaking into the public release.
  • docs/model-card.md: GGUF table now carries verified on-disk sizes and measured tok/s for Q3_K_M and Q4_K_M (was "TBD" placeholders).
  • README badge / Releases table / BibTeX URL all point at the live javierdejesusda/yuholens-14b and -GGUF repos.

Result

Both HuggingFace artefacts are public:

scripts/check_release_set.py reports RESULT: PASS on the checkpoint; HF API spot-check confirms no optimizer.pt and no f16 intermediate leaked. Build provenance and smoke numbers are in data/eval/gguf_done.json.

[Earlier] — Phase-9 ship-readiness

Phase-9 ship-readiness: operator CLI, offline picker, release validator, README rewrite, sample fixture, social-media refresh, CI, pre-commit.

Added

  • python -m yuholens.agents operator CLI for the 4-agent composer with --best-of-n / --judge-mode / --n-candidates flags.
  • scripts/run_bestofn_offline.py for heuristic-only best-of-N picking without OpenAI calls.
  • scripts/check_release_set.py pre-HF-upload validator (tokenizer, generation_config v5 invariants, weights, architecture).
  • scripts/hf_upload.py that patches generation_config.json to v5 defaults before pushing to the Hub.
  • scripts/build_gguf.sh covering Q4_K_M / Q5_K_M / Q6_K / Q8_0.
  • data/sample/sample_yuho.jsonl so the README quickstart works on a fresh clone.
  • .github/workflows/test.yml running pytest on push and PR.
  • .pre-commit-config.yaml wired to the existing ruff config.
  • docs/CHANGELOG.md (this file).
  • MemoCriticAgent LangGraph node + decoder_profiles.py catalogue.
  • JudgeUnavailableError with auto-fallback to the heuristic when the judge backend is unreachable, and a finite-score guard against silently picking an unscored candidate.

Changed

  • README rewritten with the KG-2 PASS headline, metric arc, mermaid 4-agent diagram, cost table, and a sharper quickstart.
  • docs/social_media.md refreshed with the real PASS metrics (3.88 coherence, 1.000 citation, 0.994 section coverage).
  • docs/blog_post.md numbers replaced with the metric arc and the cross-decoder vs cross-seed finding.
  • docs/demo_script.md adds a 5-minute live walkthrough alongside the 90-second submission video script.
  • docs/model-card.md quantization table now lists Q8_0 and references the new build script.
  • scripts/mi300x_preflight.py performs a real OpenAI auth probe instead of bare env-var presence.
  • pyproject.toml and requirements.txt add huggingface_hub and safetensors to runtime deps; new release extra collects matplotlib for figure rendering.

[2026-04-28] — Session 1.12 — Anthropic judge engine + Opus 4.7 primary re-eval

Added

  • judge_engine={openai,anthropic} dispatch in src/yuholens/eval/metrics.py::judge_coherence, with claude-opus-4-7 as the default model on the anthropic path. Engine selectable via the new --judge-engine flag on scripts/rescore_kg2.py and scripts/bestofn_judge.py; the OpenAI Batch path is unchanged when the flag is omitted.
  • anthropic runtime dependency in requirements.txt and pyproject.toml.
  • Four new tests in tests/test_metrics_judge.py covering the Anthropic engine path (model default, header construction, retry semantics, parse fallback). Suite is now 93 green (89 prior + 4 new).

Result

Re-judged the 50-memo bo5 and bo9 picked sets same-pass blinded under claude-opus-4-7 (Path B / interactive, n=100 memos pooled, mirror- graded duplicates) as a methodological calibration check against the shipping gpt-5-mini judge. Opus applies a stricter rubric (no score-5 assignments on any LLM-generated memo in this corpus; modal score 2): mean coherence bo5 = 2.60, bo9 = 2.48; paired delta bo9 − bo5 = −0.12, 95% bootstrap CI [−0.36, +0.10] (10,000 resamples, rng=20260428), sign-test two-sided p = 0.648. Cohen's κ Opus vs gpt-5-mini = 0.017 unweighted, 0.080 quadratic-weighted (n=100). Read as rubric calibration, not a contradicting verdict.

The shipping verdict is unchanged: bo5 picked under gpt-5-mini at 3.88 PASS is the released configuration. SFT checkpoint output/yuholens-14b-sft/checkpoint-212 and the bo5/bo9 picked memo files are unchanged. README and model-card lead with the gpt-5-mini PASS headline; the Opus calibration check is documented in a small end-of-doc note in each.

[2026-04-28] — Session 1.11 — bo9 KG-2 PASS at coherence 4.04 (+0.16 over bo5)

Result

Best-of-9 mixed-decoder inference on the existing SFT checkpoint output/yuholens-14b-sft/checkpoint-212 reached KG-2 PASS coherence 4.04, citation 1.000, section coverage 0.997 (n=50, fresh gpt-5-mini judge). The +0.16 lift over the previously-locked bo5 baseline of 3.88 (session 1.7) came from extending the candidate pool with 4 new decoder profiles unsampled by bo5: s8 tight (temp 0.05, rep 1.20), s9 creative (temp 0.30, rep 1.05), s10 mid (temp 0.15, rep 1.125), and s11 ngram-block (temp 0.10, rep 1.15, no_repeat_ngram_size 3). Pick share over 50 prompts: v4v5 mixed 62.0%, bo3 seeds 22.0% combined, s8/s9/s10 16.0% combined, s11 0% (a dud, mean 2.58 — ngram-blocking fragments financial-terminology reuse; the constraint is empirically refuted at this checkpoint and should not be tried again). Final picked-fresh distribution 0/1/5/35/9 versus bo5's 0/2/7/36/5; 4 prompts moved 4→5 and 1 weak-tail prompt lifted out of score 2. The locked SFT checkpoint and bo5 fallback files were SHA256-verified unchanged at session start and end via scripts/_pipeline/bo9_anticlobber.py. No SFT or ORPO retraining; the shipping artifact is now either bo5 @ 3.88 (smaller candidate pool, less inference compute) or bo9 @ 4.04 (+0.16 lift, defensible headline). The session was driven by a four-option enhancement recommendation plan (Option A bo9 chosen over Option B 1k-pref ORPO V4 expansion, Option C DPO, and Option D 2nd SFT seed ensemble); A's EV-per-dollar was 6× higher than the runner-up. Operator pipeline (scripts/_pipeline/bo9_*) and the recommendation plan (docs/superpowers/plans/yuholens_enhancement.md) are local-only by convention. Session spend ~$16.00 ($15.66 droplet 7.87 h × $1.99/hr, $0.34 OpenAI 490 judge calls); cumulative project ~$49.64 of $65 cap.

[2026-04-28] — Session 1.10 — ORPO V3 plateau at step 100, kill-switch fired

Added

  • configs/orpo_v3.yaml — V3 trainer hyperparameters (4 epochs, lr 3.0e-6, beta 0.05, save_steps 50). Differs from V2.2 baseline (configs/orpo.yaml) along three axes: 4× the epochs, slower lr to keep gradients stable over 4× more steps, softer beta to relax SFT-anchor regularisation. Hypothesis: the V2.2 trained-but-flat result (50 steps, rewards/accuracies stuck at 0.0) was budget-bounded rather than data-bounded; widening the budget should let the preference signal transfer (902b87e).

Result

V3 trained 100 of 200 planned steps (epoch 2.0 of 4) on the canonical-gate-clean 790-row preference set from session 1.9, then was killed by the G1.7 mid-train kill switch. Across all 10 logged step blocks, rewards/accuracies stayed exactly 0.0; rewards/margins improved monotonically from -0.0209 at step 10 to -0.0158 at step 70 (a 24% reduction) then plateaued through epoch 2.0 (steps 70–100) at margins ≈ -0.0158 and log_odds_chosen ≈ -0.78 to -0.80. Neither PASS condition (rewards/accuracies > 0.0 OR rewards/margins > 0.0) was met at step 100; the trainer was terminated cleanly and the droplet destroyed without snapshot retention. Pre-flight 5-step smoke (introduced this session as G1.5 to validate plumbing before committing the full $10 run) passed cleanly in ~9.5 min — no plumbing bugs found, but the gate is now standard for any future preference-optimisation attempt. Cross-iteration pattern: V2.2 (1 epoch, beta 0.1, lr 5e-6) and V3 (≥2 epochs, beta 0.05, lr 3e-6) both showed ~24% reduction in reward margins before plateauing; doubling the budget did not double the headroom. The signal-to-noise floor at this corpus size appears asymptotic at margins ≈ -0.015, not zero. Shipping artifact unchanged from session 1.7 (SFT bo5 @ KG-2 PASS 3.88). Session spend ~$6.79 (3.41 h MI300X, no OAI calls); cumulative ~$33.64 of $65 cap.

[2026-04-27] — Session 1.9 — ORPO V2.1 measurement bug, V2.2 trained-but-flat

Changed

  • CRITIQUE_SYSTEM in src/yuholens/training/orpo_data.py now references the canonical singular (ref: "...") citation marker form explicitly, with two CITATION_RE-matching examples and a counter-example that forbids rewriting markers as (refs:, [ref:, or (citation:. The V2.1 patch had referenced the plural (refs: form, which does not match the canonical evaluator regex; under the canonical regex the V2.1 hedge "PASS" was actually chosen 0.6926 / rejected 0.9950 — a measurement bug, not a real gate clearance.
  • tests/test_orpo_data.py::test_critique_system_requires_citation_preservation strengthened to (a) require the singular (ref: form to dominate the plural in count, (b) require at least 5 occurrences of the singular form, and (c) assert that CITATION_RE.search(CRITIQUE_SYSTEM) matches a real example. Future drifts between prompt language and evaluator regex now fail this test loudly.

Result

V2.2 critique batch (gpt-5-mini, 800 prompts, ~$3) cleared the canonical-regex data gate cleanly: chosen citation rate 1.0000, rejected 0.9962, median length ratio 1.221, 70% of rows fully preserve all citations, 790 prefs after empty-rewrite filtering. ORPO trained 50 steps in ~71 min on a fresh MI300X (atl1, snapshot 225943366, ~$3.75 compute) at the configs/orpo.yaml defaults. Across all five logged step blocks, rewards/accuracies stayed 0.0 and reward margins stayed negative — the preference signal did not transfer at this data scale and step count. Smoke at the v5 decoder (n=7) produced mean coherence 3.571, citation 1.000, section coverage 1.000 — indistinguishable from the SFT v5 single-shot baseline of 3.56 within judge noise. Best-of-7 generation was skipped on the basis of the smoke result + training-time signal (expected bo7 mean would tie SFT bo5 at 3.88, not exceed it). Shipping artifact unchanged from session 1.7 (SFT bo5 @ KG-2 PASS 3.88). Negative ORPO arc now spans three failed data-gate iterations plus one trained-and-evaluated tie, documenting the bound for the build narrative.

[2026-04-26] — Session 1.8 — ORPO V2 negative result

Changed

  • CRITIQUE_SYSTEM in src/yuholens/training/orpo_data.py requires the rewriter to PRESERVE EVERY (refs: ...) citation marker from the SFT draft verbatim and never delete, rename, merge, reword, or reorder existing tags (55db47b). Locked in by tests/test_orpo_data.py::test_critique_system_requires_citation_preservation.
  • docs/blog_post.md corrected: the previous draft claimed ORPO ran on ~1,000 preference pairs and that infrastructure was staged but not exercised. ORPO was wired end-to-end and tried twice; both attempts failed at a pre-training data-quality gate before any GPU training step (cd0f0cf).
  • docs/model-card.md abstract no longer claims "supervised fine-tuning and reference-free preference optimization"; the shipped artifact is SFT only. Training, hyperparameter table, evaluation, and limitations sections all updated to reflect the two ORPO data-gate failures (e01823d).

Result

ORPO V2 critique batch (gpt-5-mini, 800 prompts) landed at chosen citation rate 0.305 versus rejected 0.995 — a hard fail on the 0.80 gate. No GPU training step was run. Shipping artifact unchanged from session 1.7 (best-of-5 SFT @ 3.88 KG-2 PASS).

[2026-04-25] — Session 1.7 — KG-2 PASS

Added

  • src/yuholens/eval/run_sft_drafts.py for ORPO draft generation at v5 decoding (b16e8d7).
  • scripts/bestofn_pick.py to pick the highest-coherence memo per custom_id from N candidate sets via cached judge scores (b16e8d7).
  • scripts/bestofn_judge.py fresh-pass scorer that judges every memo across N candidate sets in a single session (f6ac0d6).
  • scripts/bo3_finalise.sh orchestrating the post-best-of-3 pipeline (15ac06c).
  • --seed and --skip-judge flags on run_kg2.py so candidate sets are independently reproducible (f6ac0d6).

Changed

  • ORPO CRITIQUE_SYSTEM rewritten to embed the seven-section coherence rubric, replacing citation-grounded language that was orthogonal to what the KG-2 judge actually scores (b16e8d7).
  • configs/orpo.yaml model_id corrected to checkpoint-212.

Result

KG-2 PASS at coherence 3.88, citation rate 1.000, section coverage 0.994 under the best-of-5 mixed-decoder composer (verdict committed in 9b17222).

[2026-04-22] — Session 1.6 — SFT polish module

Added

  • LM-head + last-4-layers SFT polish module (a14834c). Polish experiment regressed KG-2 to 3.26 (-0.30) and was abandoned in favour of inference-time best-of-N.

Pre-history (2026-04-17 onwards)

Initial SFT loop, teacher bootstrap, ROCm bitsandbytes source build, ingestor regex tuning, Pass-1 / Pass-2 prompt design, citation-grounder with [evidence insufficient] abstention, kill-gate metrics, and the six-variant decoding sweep that established v5 as the single-shot default.