|
| 1 | +# BTC Fib Selection-Learning — model-ENRICHMENT RESULTS (leg-completeness) (2026-06-25) |
| 2 | + |
| 3 | +Blind Commit-2 execution of the [enrichment LOCK |
| 4 | +(2026-06-24)](btc-fib-selection-learning-enrichment-lock-20260624.md). One pre-specified feature |
| 5 | +(`exclusivity` / leg-completeness, E1), one nested comparison vs the current Stage-2 model (E2), one |
| 6 | +blind verdict (E4). **No edge / behaviour / PnL / backtest claim** (E7). Harness: |
| 7 | +[`selection_learning_enrich.py`](../../../src/fibengine/research/selection_learning_enrich.py) |
| 8 | +(commit `c80acb0`); seed `20260618`; frozen-data parity (no `--refresh`); preflight READY before run. |
| 9 | + |
| 10 | +> **Verdict (blind, 4h primary k=3): `enriched_worse_check`.** The enriched model is *significantly |
| 11 | +> worse* than current Stage-2 (AP-lift 95% CI entirely below 0). The lock's direction-guard checks |
| 12 | +> (parity, no look-ahead, bootstrap unit, power) all pass → **not a bug**. For the north-star this is |
| 13 | +> a negative shot: the locked per-leg `exclusivity` feature does **not** add human-like leg-selection |
| 14 | +> signal over the model we already have. The per-leg-feature modeling line is **closed**. |
| 15 | +
|
| 16 | +## Observed (measured — 4h primary, powered: 65 test positives) |
| 17 | + |
| 18 | +| quantity | value | |
| 19 | +|---|---| |
| 20 | +| AP baseline (current Stage-2, nested) | **0.056737** | |
| 21 | +| AP enriched (Stage-2 + `exclusivity`) | **0.038744** | |
| 22 | +| AP-lift (point) | **−0.017993** | |
| 23 | +| AP-lift bootstrap mean | −0.023651 | |
| 24 | +| **AP-lift 95% CI** | **[−0.070026, −0.001895]** (excludes 0, below) | |
| 25 | +| p(lift ≤ 0), one-sided | 0.994 | |
| 26 | +| bootstrap | decision-point cluster by `anchor_b`, 2000 resamples, 2071 groups | |
| 27 | +| ROC-AUC enriched (secondary) | 0.9252 | |
| 28 | +| `corr(exclusivity, cleanliness)` (train) | **0.804** | |
| 29 | +| `exclusivity` standardized weight | +0.1142 (`cleanliness` +0.1502 still leads) | |
| 30 | +| n_candidates / n_train / n_test | 86244 / 61368 / 24852 | |
| 31 | +| rows excluded (endpoint beyond data / not reconstructible) | 0 / 0 | |
| 32 | +| `exclusivity` dist | mean 0.275, std 0.345, frac@0 0.497, frac@1 0.093 | |
| 33 | + |
| 34 | +**Parity gate (proves the nested baseline IS the current model):** `ap_baseline_stage2` = |
| 35 | +**0.056737** = the Stage-2 headline **0.0567**; `n_test_positives` = **65**, matching Stage-2. |
| 36 | +*Spec-reconciliation:* the pre-run note "n_candidates ≈ 24852" was a label mix-up — **24852 = n_test**; |
| 37 | +the full candidate universe is **86244** (= Stage-2's universe). Substantive parity holds. |
| 38 | +`rows_excluded = 0` confirms every row reconstructs causally (no look-ahead, no endpoint dropped). |
| 39 | + |
| 40 | +**Context cells (underpowered, never refuted — E3 power floor ≥10 positives):** |
| 41 | + |
| 42 | +| TF | test pos | AP base | AP enr | lift | note | |
| 43 | +|---|---|---|---|---|---| |
| 44 | +| 1M | 5 | 0.2636 | 0.2789 | +0.0153 | underpowered; corr 0.878 | |
| 45 | +| 1w | 0 | — | — | — | no positives | |
| 46 | +| 1d | 7 | 0.1617 | 0.1599 | −0.0018 | underpowered; corr 0.808 | |
| 47 | + |
| 48 | +Context is reported for completeness only; the verdict rests solely on the 4h powered cell (E3/E4). |
| 49 | + |
| 50 | +## Inferred (interpretation — not measured) |
| 51 | + |
| 52 | +- **The locked `exclusivity` definition does not enrich the current model.** A negative powered lift |
| 53 | + with CI excluding 0 means the 6th feature does not help and, as fit, costs net OOS ranking power. |
| 54 | +- **Most likely mechanism (reported per E1, *not* a reason to discount the verdict): collinearity.** |
| 55 | + `corr(exclusivity, cleanliness) = 0.804` on train — `exclusivity` is largely a `cleanliness` proxy. |
| 56 | + Adding a near-collinear, noisier regressor on only 65 test positives plausibly inflates variance and |
| 57 | + drags pooled test AP down. This is a *mechanism*, not grounds to soften the blind result. |
| 58 | +- **North-star read:** this closes the per-leg-feature line cleanly. The locked honest prior was low |
| 59 | + (four per-leg features already ~0 at k=3; only `cleanliness` stuck); the shot confirms the per-leg |
| 60 | + approach has hit its ceiling on this corpus. The binding constraint is now **data, not features**. |
| 61 | +- **Lock-routing nuance:** E8 pre-committed `no_enrichment_signal → grow the facit`. The realized |
| 62 | + branch is `enriched_worse_check`, whose *substantive* north-star implication is the same (per-leg |
| 63 | + features do not beat Stage-2 → grow the facit), but the **direction choice is not pre-committed** for |
| 64 | + this branch — it is surfaced to the user (see handoff Next Step). |
| 65 | + |
| 66 | +## Unverified (open — would need a NEW lock) |
| 67 | + |
| 68 | +- Whether a **decorrelated / residualized** exclusivity (orthogonalized vs `cleanliness`) carries any |
| 69 | + orthogonal signal. This is a **different feature needing its own Commit-1 lock**, not a continuation |
| 70 | + of this one — and the prior is **low** (if the 0.80-collinear version's residual hurt here, the |
| 71 | + orthogonal component is small). Not a free natural next step; reopening a closed line. |
| 72 | +- The `cleanliness`-as-genuine-signal crux stays **OPEN** (E7) — this shot does not resolve it. |
| 73 | +- Absolute reproduction of human selection remains capped by the ~0.83 coverage ceiling (E7); no |
| 74 | + edge / behaviour / PnL / Genesis / auto-fib-as-truth / label-mutation claim is made or implied. |
| 75 | + |
| 76 | +## Artifacts |
| 77 | + |
| 78 | +- Summary JSON: `experiments/review/fib_selection_learning/enrich/summary.json` (**gitignored**). |
| 79 | +- Per-cell checkpoints: `experiments/review/fib_selection_learning/enrich/cells/*.json` (**gitignored**). |
| 80 | +- Harness + tests: `selection_learning_enrich.py`, |
| 81 | + `tests/research/test_selection_learning_enrich.py` (commit `c80acb0`; gates green — 601 pass, 74% cov). |
0 commit comments