docs(research): LOCK learning-curve diagnostic (facit data-sensitivity, Commit 1)

JohnCCarter · claude · JohnCCarter · commit d1d0845dbd2b · 2026-06-25T13:19:26.000+02:00
Blind lean lock for a cheap data-sensitivity shot toward north-star step 1: is
the Stage-2 model data-starved or saturated wrt facit size? Reuse Stage-2
verbatim, fixed held-out test set, vary only training-facit fraction (whole
human legs), build-once-vary-labels, R=64, finer grid near f=1.0. Advisor-refined
before lock: ASYMMETRIC verdict (model ~1-effective-param via cleanliness -&gt;
saturation is the EXPECTED default; flat curve routes to feature/crux, NOT "don't
grow facit"); inconclusive is a likely branch with 65 test positives; report
addable-supply. Diagnostic only, no edge/behaviour/PnL/Genesis claim; does not
resolve the cleanliness crux. Commit 2 (build/run) needs a separate explicit GO.

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/docs/research_wiki/handoff.md b/docs/research_wiki/handoff.md
@@ -27,20 +27,18 @@ append-only trail lives in [log.md](log.md).
 
 **ETH/USD:** blocked until BTC protocol approved.
 
-## Next Step (requires explicit GO)
-
-Enrichment shot **DONE** (Commit 2, 4h k=3): blind verdict **`enriched_worse_check`** — `exclusivity`
-is significantly *worse* than Stage-2 (AP-lift CI [−0.070, −0.0019]); validity checks pass (not a bug).
-**Per-leg-feature line is CLOSED.** Next is a **GO-fork** — direction is the user's call (AGENTS.md §1);
-recommendation: **B**.
-
-- **B — grow facit (recommended):** park modeling, return to the main quest — more/better human labels
-  ([main-quest reset](reviews/btc-fib-selection-learning-main-quest-reset-20260624.md) §5). Binding
-  constraint is now **data, not features**.
-- **A′ — decorrelated exclusivity (low prior, NEW lock):** `exclusivity` was 0.80 collinear with
-  `cleanliness`; an orthogonalized variant needs its **own** Commit-1 lock (reopens a closed line). Not free.
-
-**No code/run/build until a separate explicit GO.**
+## Next Step — learning-curve diagnostic LOCKED, awaiting GO to build
+
+Per-leg-feature line is **CLOSED** (`enriched_worse_check`). Toward [north-star](north-star.md) step 1,
+the user picked **learning curve first** (measure before investing in labeling). **Commit 1 lock
+written** — [learning-curve LOCK](reviews/btc-fib-selection-learning-learning-curve-lock-20260625.md):
+is the Stage-2 model **data-starved or saturated** w.r.t. facit size? Reuse Stage-2 verbatim, fixed
+test set, vary only training-facit fraction (whole legs), build-once-vary-labels, R=64, finer grid near
+f=1.0. **ASYMMETRIC verdict** (model ≈1-effective-param → saturation expected; flat ⇒ feature/crux,
+**not** "don't grow facit"); `inconclusive` is a likely branch (65 test pos). Diagnostic only, no claim.
+
+**Awaiting separate explicit GO for Commit 2 (build harness + tests + run).** Then: results doc +
+handoff/log. Other open fork if this routes there: A′ decorrelated `exclusivity` (NEW lock, low prior).
 
 ## Recent Changes
 
diff --git a/docs/research_wiki/log.md b/docs/research_wiki/log.md
@@ -16,6 +16,21 @@ Types: `ingest`, `decision`, `review`, `question`, `maintenance`.
 > Pre-reset (2026-06-10 and earlier): [part 3](log-archive-pre-btc-reset-part3.md) →
 > [part 2](log-archive-pre-btc-reset-part2.md) → [part 1](log-archive-pre-btc-reset-part1.md)
 
+## [2026-06-25] decision | Learning-curve diagnostic LOCKED (Commit 1, docs-only); awaiting GO
+
+User picked "learning curve first" as the next step toward [north-star](north-star.md) step 1 (does
+the engine select like the human). Lean blind lock for a cheap data-sensitivity shot: is the Stage-2
+model **data-starved or saturated** w.r.t. facit size? Reuses Stage-2 verbatim, fixes the held-out
+test set, varies only the **training-facit fraction** (whole human legs dropped), build-once-vary-
+labels. Advisor-refined before lock: **(1) ASYMMETRIC verdict** — model is ≈1-effective-parameter
+(`cleanliness` carries the lift) so saturation is the EXPECTED default; a flat curve means "this
+1-feature model is capacity-bound → back to the feature/crux", **NOT** "don't grow facit". (2) finer
+grid near f=1.0 + R=64 (build is once, relabel+refit is cheap). (3) `inconclusive_underpowered` is a
+LIKELY branch with 65 test positives. (4) report addable-supply (365 labeled vs ~86k candidates;
+if starved but little history left → more history/symbols, not grind same chart). Diagnostic only —
+no edge/behaviour/PnL/Genesis; does NOT resolve the cleanliness crux. Commit 2 (build/run) needs a
+separate explicit GO. [Lock](reviews/btc-fib-selection-learning-learning-curve-lock-20260625.md).
+
 ## [2026-06-25] decision | North-star vision documented (canonical [north-star.md](north-star.md))
 
 Docs-only. Captured the user's original intent verbatim: *lär maskinen att se chartet som människan*
diff --git a/docs/research_wiki/reviews/btc-fib-selection-learning-learning-curve-lock-20260625.md b/docs/research_wiki/reviews/btc-fib-selection-learning-learning-curve-lock-20260625.md
@@ -0,0 +1,93 @@
+# BTC Fib Selection-Learning — learning-curve LOCK (facit data-sensitivity) (2026-06-25)
+
+**DOCS-ONLY. Diagnostic — no claim, no code, no run, no build, no dependency, no new universe, no
+label/corpus change, no push.** Lean **blind Commit-1 lock** for one cheap learning-curve shot: is the
+Stage-2 selection model **data-starved or saturated** w.r.t. facit size? Reuses Stage-2 verbatim; only
+the **training-facit fraction** varies. Execution needs a **separate explicit GO** (Commit 2).
+
+**Blindness:** no learning-curve harness exists; **no AP at any fraction has been computed or seen.**
+Every rule below is fixed from the [Stage-2 headline](btc-fib-selection-learning-results-20260618.md),
+the frozen config, and existing code — not from any learning-curve result.
+
+> **Honest framing:** this answers *"would more human-labeled 4h legs plausibly raise OOS AP?"* — a
+> **data-sensitivity diagnostic**, step toward the [north star](../north-star.md) step 1. It is **not**
+> a headline, adds **no** positive claim, and does **not** resolve the `cleanliness` crux.
+
+## L1. Question
+
+> **Is the current Stage-2 model data-starved (OOS-AP curve still rising at full facit) or saturated
+> (flat) as a function of training-facit size, on the 4h primary at k=3?**
+
+## L2. Mechanics (reuse Stage-2 verbatim — locked)
+
+- **Cell = 4h primary, k=3.** `build_candidates`, ε-match, purged/embargoed split, `fit_logreg` (the 5
+  live features), pooled test **Average Precision**, decision-point cluster bootstrap — all **verbatim**
+  from the Stage-2 headline. Frozen data (no `--refresh`).
+- **FIXED test set = the Stage-2 held-out split** (65 positives / 24 852 candidates). **Never
+  subsampled** → AP is comparable across all fractions.
+- **Vary ONLY the training facit:** drop **whole human legs** whose `anchor_b` ∈ the train period; a
+  training candidate is positive iff it ε-matches a **retained** human leg. The candidate universe and
+  the features are **unchanged** — only which training rows are labeled positive shrinks.
+- **Subsample unit = whole human legs** (the unit you would actually add when "growing facit"), uniform
+  random **without replacement**.
+
+## L3. Grid + repeats (locked)
+
+- **Fractions** `f ∈ {0.25, 0.50, 0.75, 0.80, 0.90, 0.95, 1.00}` — finer near the top, because the
+  **local slope at f=1.0** is what speaks to "would the *next* labels help".
+- **R = 64** independent subsamples per fraction (`f=1.0` is the single full-facit point). Seeds =
+  `20260618 + fraction_index*1000 + repeat_index`. Report **mean AP + [p5, p95] band** per fraction;
+  ROC-AUC secondary (same shape check).
+- **BUILD ONCE (build-time requirement):** the universe, features, and per-candidate ε-match are
+  computed **once**; per `(f, r)` only **relabel train + refit logreg + recompute test AP** (all cheap).
+  If the harness rebuilds the universe per fraction the cost argument collapses — it must not.
+
+## L4. Verdict (pre-stated, ASYMMETRIC — fixed blind)
+
+The Stage-2 lift is carried almost entirely by **one** feature (`cleanliness` ~0.20; the rest ≈ 0) →
+**≈ 1 effective parameter → saturation is the EXPECTED default** and must not be over-read.
+
+- **`data_starved`** — mean `AP(1.0) − AP(0.95)` **exceeds the f=0.95 band half-width** (the last
+  increment moves AP beyond train-subsample noise) and the curve is increasing: **genuinely informative,
+  strong green light** — more facit helps *even this model*.
+- **`saturated`** — the last increment is **within** the band (flat): **ambiguous AND expected.** It
+  means the **current 1-feature set is capacity-bound**, **NOT** that facit is big enough or that
+  labeling is pointless. Routes back to the **feature / `cleanliness` crux** (matched-null), **not** away
+  from labeling.
+- **`inconclusive_underpowered`** — bands overlap heavily across fractions (a **live, LIKELY** outcome
+  with 65 test positives): **no verdict.** A within-band wiggle is not a result.
+
+## L5. Variance naming (locked)
+
+The R-band = train-side **"which legs were dropped"** variance. `AP(1.0)` is a single point (no
+train-subsample variance) but still carries **test-side noise from 65 positives** — shared across
+fractions, so it **partly cancels in fraction differences**. The verdict reads **differences**, not
+absolute levels.
+
+## L6. Addable-supply context (reported, not a gate)
+
+Report alongside the verdict: **labeled human legs (365)** vs the detector's **candidate universe
+(~86 244)** on frozen 4h, and the **bounded** nature of addable supply (true human-meaningful count is
+unknown without a human pass and is capped by available history). **If `data_starved` BUT little
+human-meaningful 4h history remains, the route is more history / symbols (a protocol change), not
+grinding the same chart** — this directly informs the "it takes forever" cost.
+
+## L7. Non-claims (binding)
+
+Diagnostic of **data-sensitivity only**. **No edge / behaviour / PnL / backtest / Genesis /
+auto-fib-as-truth.** Does **not** resolve the `cleanliness` crux — a rising curve means more data
+improves human-selection reproduction **regardless** of whether `cleanliness` is judgment or artifact.
+Frozen data, no `--refresh`, **4h primary only powered**; 1M/1w/1d are **context, never refuted**.
+
+## L8. Implementation (Commit 2 — NOT executed here)
+
+- **New module `src/fibengine/research/selection_learning_curve.py`** with its **own CLI**; **no code
+  added to byte-capped `selection_learning.py`**. Reuses `build_candidates`, `fit_logreg`,
+  `predict_proba`, `average_precision`, `roc_auc`, the decision-point machinery, `window_of`, ε, and the
+  `FROZEN_SNAPSHOT` preflight; **build-once-vary-labels** per L3.
+- **Tests** `tests/research/test_selection_learning_curve.py`: fixed-test-set invariance, subsample
+  unit = whole legs, **build-once** (features identical across fractions), verdict branches incl. the
+  L4 asymmetry + `inconclusive`, seed determinism.
+- **Results doc** later (Observed / Inferred / Unverified). Artifacts under
+  `experiments/review/fib_selection_learning/curve/` (**gitignored**). **Preflight FIRST**, frozen-data
+  parity. **Separate explicit GO before any build/run.**