|
| 1 | +# BTC Fib Selection-Learning — learning-curve LOCK (facit data-sensitivity) (2026-06-25) |
| 2 | + |
| 3 | +**DOCS-ONLY. Diagnostic — no claim, no code, no run, no build, no dependency, no new universe, no |
| 4 | +label/corpus change, no push.** Lean **blind Commit-1 lock** for one cheap learning-curve shot: is the |
| 5 | +Stage-2 selection model **data-starved or saturated** w.r.t. facit size? Reuses Stage-2 verbatim; only |
| 6 | +the **training-facit fraction** varies. Execution needs a **separate explicit GO** (Commit 2). |
| 7 | + |
| 8 | +**Blindness:** no learning-curve harness exists; **no AP at any fraction has been computed or seen.** |
| 9 | +Every rule below is fixed from the [Stage-2 headline](btc-fib-selection-learning-results-20260618.md), |
| 10 | +the frozen config, and existing code — not from any learning-curve result. |
| 11 | + |
| 12 | +> **Honest framing:** this answers *"would more human-labeled 4h legs plausibly raise OOS AP?"* — a |
| 13 | +> **data-sensitivity diagnostic**, step toward the [north star](../north-star.md) step 1. It is **not** |
| 14 | +> a headline, adds **no** positive claim, and does **not** resolve the `cleanliness` crux. |
| 15 | +
|
| 16 | +## L1. Question |
| 17 | + |
| 18 | +> **Is the current Stage-2 model data-starved (OOS-AP curve still rising at full facit) or saturated |
| 19 | +> (flat) as a function of training-facit size, on the 4h primary at k=3?** |
| 20 | +
|
| 21 | +## L2. Mechanics (reuse Stage-2 verbatim — locked) |
| 22 | + |
| 23 | +- **Cell = 4h primary, k=3.** `build_candidates`, ε-match, purged/embargoed split, `fit_logreg` (the 5 |
| 24 | + live features), pooled test **Average Precision**, decision-point cluster bootstrap — all **verbatim** |
| 25 | + from the Stage-2 headline. Frozen data (no `--refresh`). |
| 26 | +- **FIXED test set = the Stage-2 held-out split** (65 positives / 24 852 candidates). **Never |
| 27 | + subsampled** → AP is comparable across all fractions. |
| 28 | +- **Vary ONLY the training facit:** drop **whole human legs** whose `anchor_b` ∈ the train period; a |
| 29 | + training candidate is positive iff it ε-matches a **retained** human leg. The candidate universe and |
| 30 | + the features are **unchanged** — only which training rows are labeled positive shrinks. |
| 31 | +- **Subsample unit = whole human legs** (the unit you would actually add when "growing facit"), uniform |
| 32 | + random **without replacement**. |
| 33 | + |
| 34 | +## L3. Grid + repeats (locked) |
| 35 | + |
| 36 | +- **Fractions** `f ∈ {0.25, 0.50, 0.75, 0.80, 0.90, 0.95, 1.00}` — finer near the top, because the |
| 37 | + **local slope at f=1.0** is what speaks to "would the *next* labels help". |
| 38 | +- **R = 64** independent subsamples per fraction (`f=1.0` is the single full-facit point). Seeds = |
| 39 | + `20260618 + fraction_index*1000 + repeat_index`. Report **mean AP + [p5, p95] band** per fraction; |
| 40 | + ROC-AUC secondary (same shape check). |
| 41 | +- **BUILD ONCE (build-time requirement):** the universe, features, and per-candidate ε-match are |
| 42 | + computed **once**; per `(f, r)` only **relabel train + refit logreg + recompute test AP** (all cheap). |
| 43 | + If the harness rebuilds the universe per fraction the cost argument collapses — it must not. |
| 44 | + |
| 45 | +## L4. Verdict (pre-stated, ASYMMETRIC — fixed blind) |
| 46 | + |
| 47 | +The Stage-2 lift is carried almost entirely by **one** feature (`cleanliness` ~0.20; the rest ≈ 0) → |
| 48 | +**≈ 1 effective parameter → saturation is the EXPECTED default** and must not be over-read. |
| 49 | + |
| 50 | +- **`data_starved`** — mean `AP(1.0) − AP(0.95)` **exceeds the f=0.95 band half-width** (the last |
| 51 | + increment moves AP beyond train-subsample noise) and the curve is increasing: **genuinely informative, |
| 52 | + strong green light** — more facit helps *even this model*. |
| 53 | +- **`saturated`** — the last increment is **within** the band (flat): **ambiguous AND expected.** It |
| 54 | + means the **current 1-feature set is capacity-bound**, **NOT** that facit is big enough or that |
| 55 | + labeling is pointless. Routes back to the **feature / `cleanliness` crux** (matched-null), **not** away |
| 56 | + from labeling. |
| 57 | +- **`inconclusive_underpowered`** — bands overlap heavily across fractions (a **live, LIKELY** outcome |
| 58 | + with 65 test positives): **no verdict.** A within-band wiggle is not a result. |
| 59 | + |
| 60 | +## L5. Variance naming (locked) |
| 61 | + |
| 62 | +The R-band = train-side **"which legs were dropped"** variance. `AP(1.0)` is a single point (no |
| 63 | +train-subsample variance) but still carries **test-side noise from 65 positives** — shared across |
| 64 | +fractions, so it **partly cancels in fraction differences**. The verdict reads **differences**, not |
| 65 | +absolute levels. |
| 66 | + |
| 67 | +## L6. Addable-supply context (reported, not a gate) |
| 68 | + |
| 69 | +Report alongside the verdict: **labeled human legs (365)** vs the detector's **candidate universe |
| 70 | +(~86 244)** on frozen 4h, and the **bounded** nature of addable supply (true human-meaningful count is |
| 71 | +unknown without a human pass and is capped by available history). **If `data_starved` BUT little |
| 72 | +human-meaningful 4h history remains, the route is more history / symbols (a protocol change), not |
| 73 | +grinding the same chart** — this directly informs the "it takes forever" cost. |
| 74 | + |
| 75 | +## L7. Non-claims (binding) |
| 76 | + |
| 77 | +Diagnostic of **data-sensitivity only**. **No edge / behaviour / PnL / backtest / Genesis / |
| 78 | +auto-fib-as-truth.** Does **not** resolve the `cleanliness` crux — a rising curve means more data |
| 79 | +improves human-selection reproduction **regardless** of whether `cleanliness` is judgment or artifact. |
| 80 | +Frozen data, no `--refresh`, **4h primary only powered**; 1M/1w/1d are **context, never refuted**. |
| 81 | + |
| 82 | +## L8. Implementation (Commit 2 — NOT executed here) |
| 83 | + |
| 84 | +- **New module `src/fibengine/research/selection_learning_curve.py`** with its **own CLI**; **no code |
| 85 | + added to byte-capped `selection_learning.py`**. Reuses `build_candidates`, `fit_logreg`, |
| 86 | + `predict_proba`, `average_precision`, `roc_auc`, the decision-point machinery, `window_of`, ε, and the |
| 87 | + `FROZEN_SNAPSHOT` preflight; **build-once-vary-labels** per L3. |
| 88 | +- **Tests** `tests/research/test_selection_learning_curve.py`: fixed-test-set invariance, subsample |
| 89 | + unit = whole legs, **build-once** (features identical across fractions), verdict branches incl. the |
| 90 | + L4 asymmetry + `inconclusive`, seed determinism. |
| 91 | +- **Results doc** later (Observed / Inferred / Unverified). Artifacts under |
| 92 | + `experiments/review/fib_selection_learning/curve/` (**gitignored**). **Preflight FIRST**, frozen-data |
| 93 | + parity. **Separate explicit GO before any build/run.** |
0 commit comments