Skip to content

Commit d1d0845

Browse files
JohnCCarterclaude
andcommitted
docs(research): LOCK learning-curve diagnostic (facit data-sensitivity, Commit 1)
Blind lean lock for a cheap data-sensitivity shot toward north-star step 1: is the Stage-2 model data-starved or saturated wrt facit size? Reuse Stage-2 verbatim, fixed held-out test set, vary only training-facit fraction (whole human legs), build-once-vary-labels, R=64, finer grid near f=1.0. Advisor-refined before lock: ASYMMETRIC verdict (model ~1-effective-param via cleanliness -> saturation is the EXPECTED default; flat curve routes to feature/crux, NOT "don't grow facit"); inconclusive is a likely branch with 65 test positives; report addable-supply. Diagnostic only, no edge/behaviour/PnL/Genesis claim; does not resolve the cleanliness crux. Commit 2 (build/run) needs a separate explicit GO. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 050a0b4 commit d1d0845

3 files changed

Lines changed: 120 additions & 14 deletions

File tree

docs/research_wiki/handoff.md

Lines changed: 12 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -27,20 +27,18 @@ append-only trail lives in [log.md](log.md).
2727

2828
**ETH/USD:** blocked until BTC protocol approved.
2929

30-
## Next Step (requires explicit GO)
31-
32-
Enrichment shot **DONE** (Commit 2, 4h k=3): blind verdict **`enriched_worse_check`**`exclusivity`
33-
is significantly *worse* than Stage-2 (AP-lift CI [−0.070, −0.0019]); validity checks pass (not a bug).
34-
**Per-leg-feature line is CLOSED.** Next is a **GO-fork** — direction is the user's call (AGENTS.md §1);
35-
recommendation: **B**.
36-
37-
- **B — grow facit (recommended):** park modeling, return to the main quest — more/better human labels
38-
([main-quest reset](reviews/btc-fib-selection-learning-main-quest-reset-20260624.md) §5). Binding
39-
constraint is now **data, not features**.
40-
- **A′ — decorrelated exclusivity (low prior, NEW lock):** `exclusivity` was 0.80 collinear with
41-
`cleanliness`; an orthogonalized variant needs its **own** Commit-1 lock (reopens a closed line). Not free.
42-
43-
**No code/run/build until a separate explicit GO.**
30+
## Next Step — learning-curve diagnostic LOCKED, awaiting GO to build
31+
32+
Per-leg-feature line is **CLOSED** (`enriched_worse_check`). Toward [north-star](north-star.md) step 1,
33+
the user picked **learning curve first** (measure before investing in labeling). **Commit 1 lock
34+
written**[learning-curve LOCK](reviews/btc-fib-selection-learning-learning-curve-lock-20260625.md):
35+
is the Stage-2 model **data-starved or saturated** w.r.t. facit size? Reuse Stage-2 verbatim, fixed
36+
test set, vary only training-facit fraction (whole legs), build-once-vary-labels, R=64, finer grid near
37+
f=1.0. **ASYMMETRIC verdict** (model ≈1-effective-param → saturation expected; flat ⇒ feature/crux,
38+
**not** "don't grow facit"); `inconclusive` is a likely branch (65 test pos). Diagnostic only, no claim.
39+
40+
**Awaiting separate explicit GO for Commit 2 (build harness + tests + run).** Then: results doc +
41+
handoff/log. Other open fork if this routes there: A′ decorrelated `exclusivity` (NEW lock, low prior).
4442

4543
## Recent Changes
4644

docs/research_wiki/log.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,21 @@ Types: `ingest`, `decision`, `review`, `question`, `maintenance`.
1616
> Pre-reset (2026-06-10 and earlier): [part 3](log-archive-pre-btc-reset-part3.md)
1717
> [part 2](log-archive-pre-btc-reset-part2.md)[part 1](log-archive-pre-btc-reset-part1.md)
1818
19+
## [2026-06-25] decision | Learning-curve diagnostic LOCKED (Commit 1, docs-only); awaiting GO
20+
21+
User picked "learning curve first" as the next step toward [north-star](north-star.md) step 1 (does
22+
the engine select like the human). Lean blind lock for a cheap data-sensitivity shot: is the Stage-2
23+
model **data-starved or saturated** w.r.t. facit size? Reuses Stage-2 verbatim, fixes the held-out
24+
test set, varies only the **training-facit fraction** (whole human legs dropped), build-once-vary-
25+
labels. Advisor-refined before lock: **(1) ASYMMETRIC verdict** — model is ≈1-effective-parameter
26+
(`cleanliness` carries the lift) so saturation is the EXPECTED default; a flat curve means "this
27+
1-feature model is capacity-bound → back to the feature/crux", **NOT** "don't grow facit". (2) finer
28+
grid near f=1.0 + R=64 (build is once, relabel+refit is cheap). (3) `inconclusive_underpowered` is a
29+
LIKELY branch with 65 test positives. (4) report addable-supply (365 labeled vs ~86k candidates;
30+
if starved but little history left → more history/symbols, not grind same chart). Diagnostic only —
31+
no edge/behaviour/PnL/Genesis; does NOT resolve the cleanliness crux. Commit 2 (build/run) needs a
32+
separate explicit GO. [Lock](reviews/btc-fib-selection-learning-learning-curve-lock-20260625.md).
33+
1934
## [2026-06-25] decision | North-star vision documented (canonical [north-star.md](north-star.md))
2035

2136
Docs-only. Captured the user's original intent verbatim: *lär maskinen att se chartet som människan*
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# BTC Fib Selection-Learning — learning-curve LOCK (facit data-sensitivity) (2026-06-25)
2+
3+
**DOCS-ONLY. Diagnostic — no claim, no code, no run, no build, no dependency, no new universe, no
4+
label/corpus change, no push.** Lean **blind Commit-1 lock** for one cheap learning-curve shot: is the
5+
Stage-2 selection model **data-starved or saturated** w.r.t. facit size? Reuses Stage-2 verbatim; only
6+
the **training-facit fraction** varies. Execution needs a **separate explicit GO** (Commit 2).
7+
8+
**Blindness:** no learning-curve harness exists; **no AP at any fraction has been computed or seen.**
9+
Every rule below is fixed from the [Stage-2 headline](btc-fib-selection-learning-results-20260618.md),
10+
the frozen config, and existing code — not from any learning-curve result.
11+
12+
> **Honest framing:** this answers *"would more human-labeled 4h legs plausibly raise OOS AP?"* — a
13+
> **data-sensitivity diagnostic**, step toward the [north star](../north-star.md) step 1. It is **not**
14+
> a headline, adds **no** positive claim, and does **not** resolve the `cleanliness` crux.
15+
16+
## L1. Question
17+
18+
> **Is the current Stage-2 model data-starved (OOS-AP curve still rising at full facit) or saturated
19+
> (flat) as a function of training-facit size, on the 4h primary at k=3?**
20+
21+
## L2. Mechanics (reuse Stage-2 verbatim — locked)
22+
23+
- **Cell = 4h primary, k=3.** `build_candidates`, ε-match, purged/embargoed split, `fit_logreg` (the 5
24+
live features), pooled test **Average Precision**, decision-point cluster bootstrap — all **verbatim**
25+
from the Stage-2 headline. Frozen data (no `--refresh`).
26+
- **FIXED test set = the Stage-2 held-out split** (65 positives / 24 852 candidates). **Never
27+
subsampled** → AP is comparable across all fractions.
28+
- **Vary ONLY the training facit:** drop **whole human legs** whose `anchor_b` ∈ the train period; a
29+
training candidate is positive iff it ε-matches a **retained** human leg. The candidate universe and
30+
the features are **unchanged** — only which training rows are labeled positive shrinks.
31+
- **Subsample unit = whole human legs** (the unit you would actually add when "growing facit"), uniform
32+
random **without replacement**.
33+
34+
## L3. Grid + repeats (locked)
35+
36+
- **Fractions** `f ∈ {0.25, 0.50, 0.75, 0.80, 0.90, 0.95, 1.00}` — finer near the top, because the
37+
**local slope at f=1.0** is what speaks to "would the *next* labels help".
38+
- **R = 64** independent subsamples per fraction (`f=1.0` is the single full-facit point). Seeds =
39+
`20260618 + fraction_index*1000 + repeat_index`. Report **mean AP + [p5, p95] band** per fraction;
40+
ROC-AUC secondary (same shape check).
41+
- **BUILD ONCE (build-time requirement):** the universe, features, and per-candidate ε-match are
42+
computed **once**; per `(f, r)` only **relabel train + refit logreg + recompute test AP** (all cheap).
43+
If the harness rebuilds the universe per fraction the cost argument collapses — it must not.
44+
45+
## L4. Verdict (pre-stated, ASYMMETRIC — fixed blind)
46+
47+
The Stage-2 lift is carried almost entirely by **one** feature (`cleanliness` ~0.20; the rest ≈ 0) →
48+
**≈ 1 effective parameter → saturation is the EXPECTED default** and must not be over-read.
49+
50+
- **`data_starved`** — mean `AP(1.0) − AP(0.95)` **exceeds the f=0.95 band half-width** (the last
51+
increment moves AP beyond train-subsample noise) and the curve is increasing: **genuinely informative,
52+
strong green light** — more facit helps *even this model*.
53+
- **`saturated`** — the last increment is **within** the band (flat): **ambiguous AND expected.** It
54+
means the **current 1-feature set is capacity-bound**, **NOT** that facit is big enough or that
55+
labeling is pointless. Routes back to the **feature / `cleanliness` crux** (matched-null), **not** away
56+
from labeling.
57+
- **`inconclusive_underpowered`** — bands overlap heavily across fractions (a **live, LIKELY** outcome
58+
with 65 test positives): **no verdict.** A within-band wiggle is not a result.
59+
60+
## L5. Variance naming (locked)
61+
62+
The R-band = train-side **"which legs were dropped"** variance. `AP(1.0)` is a single point (no
63+
train-subsample variance) but still carries **test-side noise from 65 positives** — shared across
64+
fractions, so it **partly cancels in fraction differences**. The verdict reads **differences**, not
65+
absolute levels.
66+
67+
## L6. Addable-supply context (reported, not a gate)
68+
69+
Report alongside the verdict: **labeled human legs (365)** vs the detector's **candidate universe
70+
(~86 244)** on frozen 4h, and the **bounded** nature of addable supply (true human-meaningful count is
71+
unknown without a human pass and is capped by available history). **If `data_starved` BUT little
72+
human-meaningful 4h history remains, the route is more history / symbols (a protocol change), not
73+
grinding the same chart** — this directly informs the "it takes forever" cost.
74+
75+
## L7. Non-claims (binding)
76+
77+
Diagnostic of **data-sensitivity only**. **No edge / behaviour / PnL / backtest / Genesis /
78+
auto-fib-as-truth.** Does **not** resolve the `cleanliness` crux — a rising curve means more data
79+
improves human-selection reproduction **regardless** of whether `cleanliness` is judgment or artifact.
80+
Frozen data, no `--refresh`, **4h primary only powered**; 1M/1w/1d are **context, never refuted**.
81+
82+
## L8. Implementation (Commit 2 — NOT executed here)
83+
84+
- **New module `src/fibengine/research/selection_learning_curve.py`** with its **own CLI**; **no code
85+
added to byte-capped `selection_learning.py`**. Reuses `build_candidates`, `fit_logreg`,
86+
`predict_proba`, `average_precision`, `roc_auc`, the decision-point machinery, `window_of`, ε, and the
87+
`FROZEN_SNAPSHOT` preflight; **build-once-vary-labels** per L3.
88+
- **Tests** `tests/research/test_selection_learning_curve.py`: fixed-test-set invariance, subsample
89+
unit = whole legs, **build-once** (features identical across fractions), verdict branches incl. the
90+
L4 asymmetry + `inconclusive`, seed determinism.
91+
- **Results doc** later (Observed / Inferred / Unverified). Artifacts under
92+
`experiments/review/fib_selection_learning/curve/` (**gitignored**). **Preflight FIRST**, frozen-data
93+
parity. **Separate explicit GO before any build/run.**

0 commit comments

Comments
 (0)