Skip to content

Commit 7cf39ec

Browse files
bewestCopilot
andcommitted
EXP-2859 + PROD: bootstrap P(simpson) replaces noisy boolean flag
Block bootstrap (N=200, 48h chunks) per patient gives explicit P(simpson) confidence: - 2/26 high-confidence Simpson (P>=0.9) - 12/26 high-confidence clean (P<=0.1) - 12/26 boundary (0.1<P<0.9) Of the 9 patients EXP-2853 boolean-flagged as Simpson, only 2 are statistically robust; the rest sit near the regime boundary (median P=0.76 — boundary, not confident). PROD: - AuditionInputs.p_simpson: Optional[float] takes precedence - P>=0.9 → medium severity, 0.1<P<0.9 → low (boundary), P<=0.1 → suppress - SimpsonFactsLoader extended with bootstrap_path - 3 new audition tests; 19/19 pass Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 1adb9c4 commit 7cf39ec

5 files changed

Lines changed: 229 additions & 8 deletions

File tree

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
# EXP-2859 — Bootstrap Confidence Replaces Boolean Simpson Flag (2026-04-22)
2+
3+
**Stream**: B (operational) — also methodological for Stream A
4+
**Predecessor**: EXP-2853 (point Simpson), EXP-2856 (rolling stability), EXP-2858 (no flip drivers)
5+
**Productionized**: ✅ `p_simpson` field + 3 new severity rules
6+
7+
## Headline
8+
9+
Block bootstrap (N=200, 48h block size) over per-patient β_fast and
10+
β_slow gives **explicit confidence** that the noisy boolean Simpson
11+
flag was missing:
12+
13+
| Cohort | n |
14+
|--------|---|
15+
| **High-confidence Simpson** (P ≥ 0.9) | **2/26** |
16+
| **High-confidence non-Simpson** (P ≤ 0.1) | **12/26** |
17+
| **Boundary / uncertain** (0.1 < P < 0.9) | **12/26** |
18+
19+
The boolean EXP-2853 Simpson flag tagged 9/29 patients as Simpson.
20+
The bootstrap shows **only 2 of those are statistically robust**
21+
the rest sit near the regime boundary with median P=0.76 (still
22+
"more likely than not" but far from confident). EXP-2856 saw this
23+
as agreement-fraction; bootstrap quantifies it as a probability.
24+
25+
## Method
26+
27+
Block bootstrap to preserve within-window correlation:
28+
29+
1. Slice each patient's data into non-overlapping 48h chunks
30+
`(WIN_SIZE = 48 × 12)`.
31+
2. Resample chunks with replacement N=200 times.
32+
3. For each replicate, compute β_fast (5-min OLS over flattened
33+
chunks) and β_slow (OLS over per-chunk means).
34+
4. P(simpson) = fraction of replicates with `sign(β_fast) ≠ sign(β_slow)`
35+
AND both magnitudes > 1e-6.
36+
37+
Block bootstrap is essential — naive sample-with-replacement of
38+
5-min rows would destroy the slow-window structure (β_slow is
39+
defined on 48h means).
40+
41+
## Results
42+
43+
- N=26 patients with ≥7 chunks (336h ≈ 14 days) of data.
44+
- Median P(simpson) overall: 0.16
45+
- Median P(simpson) | point Simpson = True: **0.76**
46+
- Median P(simpson) | point Simpson = False: **0.01**
47+
48+
The point flag and bootstrap agree on direction (median P 0.76 vs
49+
0.01 across the two subsets), but bootstrap reveals that the
50+
"True" subset is heterogeneous — only 2 are confidently Simpson;
51+
7 are boundary cases.
52+
53+
## Visualization (Charter V8)
54+
55+
![Bootstrap P(simpson) histogram and beta_fast×beta_slow with CI](figures/exp-2859_bootstrap_simpson.png)
56+
57+
Left: P(simpson) distribution with 0.1 / 0.9 cut lines.
58+
Right: β_fast × β_slow scatter with bootstrap 95% CI bars colored
59+
by P(simpson). Points near the axes have wide CIs and intermediate
60+
P; points far from axes have tight CIs and clear classification.
61+
62+
## Production change
63+
64+
`AuditionInputs` gains optional `p_simpson: Optional[float]` field
65+
(audition_matrix.py:69). New top-priority branch in
66+
`classify_triage_flags`:
67+
68+
| `p_simpson` | Severity | Action |
69+
|---|---|---|
70+
| ≥ 0.9 | **medium** | "high-confidence Simpson regime" |
71+
| 0.1 < P < 0.9 | **low** | "boundary case ... sanity-check" |
72+
| ≤ 0.1 | (suppress) | "confidently non-Simpson" |
73+
| `None` | fall through | EXP-2854/2856 boolean+stability path |
74+
75+
3 new tests:
76+
- `test_p_simpson_high_emits_medium`
77+
- `test_p_simpson_boundary_emits_low`
78+
- `test_p_simpson_low_suppresses` (overrides up_shift phenotype proxy)
79+
80+
`SimpsonFactsLoader` extended:
81+
- New `bootstrap_path` arg (defaults to
82+
`externals/experiments/exp-2859_bootstrap_simpson.parquet`).
83+
- `SimpsonAuditionFacts` gains `p_simpson: Optional[float]` field.
84+
- Live smoke-test: 30 patients indexed (29 from EXP-2853 ∪ EXP-2856
85+
+ ~26 from EXP-2859), `b` returns `(True, 0.20, 0.39)`
86+
boundary case as expected.
87+
88+
19/19 audition + loader tests pass.
89+
90+
## Findings invariants
91+
92+
- **Bootstrap sharpens classification**: 12/26 confidently clean,
93+
2/26 confidently Simpson, 12/26 boundary. The boolean flag was
94+
a 50-50 coin flip for the 7 "boundary-Simpson" patients.
95+
- **Block bootstrap is mandatory** — non-block resampling would
96+
violate β_slow's exchangeability assumption.
97+
- **2 confident-Simpson patients** (P ≥ 0.9) merit medium-severity
98+
attention; **12 confident-clean** can have Simpson flag suppressed
99+
outright; **12 boundary** get low-severity acknowledgment.
100+
- The point Simpson flag from EXP-2853 stays as a fallback when
101+
bootstrap data isn't available.
102+
103+
## Deliverables
104+
105+
| File | Purpose |
106+
|------|---------|
107+
| `tools/cgmencode/exp_bootstrap_simpson_2859.py` | Driver |
108+
| `externals/experiments/exp-2859_bootstrap_simpson.parquet` | Per-patient P + β_fast/β_slow CIs |
109+
| `externals/experiments/exp-2859_summary.json` | Cohort tabulation |
110+
| `docs/60-research/figures/exp-2859_bootstrap_simpson.png` | Two-panel chart |
111+
| `tools/cgmencode/production/audition_matrix.py` | `p_simpson` field + severity rules |
112+
| `tools/cgmencode/production/simpson_facts_loader.py` | Bootstrap artifact loader |
113+
| `tools/cgmencode/production/test_audition_matrix.py` | 3 new tests |
114+
| `tools/cgmencode/production/test_simpson_facts_loader.py` | bootstrap-path test fixtures |
115+
116+
## Next experiments
117+
118+
- **EXP-2860**: bootstrap CI on per-(patient, TOD) Simpson —
119+
combine EXP-2855's TOD slicing with EXP-2859's bootstrap to give
120+
TOD-aware confidence (do TOD buckets stabilize the boundary
121+
cases?).
122+
- **EXP-2861**: extend bootstrap to other audition signals (ISF gap,
123+
recovery fraction) — generalize the "confidence-band" pattern.
124+
- **viz-meal-overlay-absorption** (carryover): meal-event chart
125+
with declared vs modeled carb absorption.

tools/cgmencode/production/audition_matrix.py

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -155,7 +155,36 @@ def classify_triage_flags(inputs: AuditionInputs) -> List[AuditionFlag]:
155155
# patients have only 25% median agreement across rolling 30d windows
156156
# (vs 87.5% for Simpson-negative), so a single-window Simpson=True without
157157
# stability evidence is LOW confidence.
158-
if inputs.simpson_paradox is True:
158+
# EXP-2859: bootstrap P(simpson) takes precedence when available — gives
159+
# explicit confidence (only 2/26 patients are P>=0.9, 12/26 are P<=0.1
160+
# confidently clean, 12/26 are uncertain boundary).
161+
if inputs.p_simpson is not None:
162+
if inputs.p_simpson >= 0.9:
163+
flags.append(AuditionFlag(
164+
name="window_dependence_warning",
165+
severity="medium",
166+
rationale=(
167+
f"EXP-2859 bootstrap P(simpson)={inputs.p_simpson:.0%} — "
168+
"high-confidence Simpson regime. β_fast (5-min reactive) "
169+
"and β_slow (48h structural) sign-mismatch is robust to "
170+
"data resampling; conflicting timescale recommendations "
171+
"are expected."
172+
),
173+
))
174+
elif inputs.p_simpson > 0.1:
175+
flags.append(AuditionFlag(
176+
name="window_dependence_warning",
177+
severity="low",
178+
rationale=(
179+
f"EXP-2859 bootstrap P(simpson)={inputs.p_simpson:.0%} — "
180+
"boundary case. Patient sits near the β_fast=0 / β_slow=0 "
181+
"regime boundary; Simpson classification is uncertain. "
182+
"Recommendations from one timescale MAY conflict with the "
183+
"other — sanity-check before applying."
184+
),
185+
))
186+
# P<=0.1: confidently non-Simpson, suppress flag
187+
elif inputs.simpson_paradox is True:
159188
if (
160189
inputs.simpson_stability_frac is not None
161190
and inputs.simpson_stability_frac >= 0.75

tools/cgmencode/production/simpson_facts_loader.py

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,13 +25,17 @@
2525
DEFAULT_STABILITY_PARQUET = (
2626
_REPO / "externals" / "experiments" / "exp-2856_per_patient_stability.parquet"
2727
)
28+
DEFAULT_BOOTSTRAP_PARQUET = (
29+
_REPO / "externals" / "experiments" / "exp-2859_bootstrap_simpson.parquet"
30+
)
2831

2932

3033
@dataclass(frozen=True)
3134
class SimpsonAuditionFacts:
3235
"""Per-patient Simpson facts ready for AuditionInputs."""
3336
simpson_paradox: Optional[bool]
3437
simpson_stability_frac: Optional[float]
38+
p_simpson: Optional[float] = None
3539

3640

3741
class SimpsonFactsLoader:
@@ -46,9 +50,11 @@ def __init__(
4650
self,
4751
simpson_path: Path = DEFAULT_SIMPSON_PARQUET,
4852
stability_path: Path = DEFAULT_STABILITY_PARQUET,
53+
bootstrap_path: Path = DEFAULT_BOOTSTRAP_PARQUET,
4954
) -> None:
5055
self._simpson_path = Path(simpson_path)
5156
self._stability_path = Path(stability_path)
57+
self._bootstrap_path = Path(bootstrap_path)
5258
self._index: Optional[dict[str, SimpsonAuditionFacts]] = None
5359

5460
def _load(self) -> dict[str, SimpsonAuditionFacts]:
@@ -72,11 +78,19 @@ def _load(self) -> dict[str, SimpsonAuditionFacts]:
7278
stab_by_pid[str(r["patient_id"])] = float(
7379
r["frac_agree_with_overall"]
7480
)
75-
all_pids = set(flag_by_pid) | set(stab_by_pid)
81+
# Bootstrap P(simpson) (EXP-2859)
82+
psim_by_pid: dict[str, float] = {}
83+
if self._bootstrap_path.exists():
84+
df = pd.read_parquet(self._bootstrap_path)
85+
if "patient_id" in df.columns and "p_simpson" in df.columns:
86+
for _, r in df.iterrows():
87+
psim_by_pid[str(r["patient_id"])] = float(r["p_simpson"])
88+
all_pids = set(flag_by_pid) | set(stab_by_pid) | set(psim_by_pid)
7689
for pid in all_pids:
7790
idx[pid] = SimpsonAuditionFacts(
7891
simpson_paradox=flag_by_pid.get(pid),
7992
simpson_stability_frac=stab_by_pid.get(pid),
93+
p_simpson=psim_by_pid.get(pid),
8094
)
8195
return idx
8296

tools/cgmencode/production/test_audition_matrix.py

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -238,3 +238,48 @@ def test_simpson_false_suppresses_phenotype_proxy():
238238
assert "window_dependence_warning" not in names, (
239239
"Explicit Simpson=False should suppress phenotype-proxy warning"
240240
)
241+
242+
243+
def test_p_simpson_high_emits_medium():
244+
"""EXP-2859: P(simpson) >= 0.9 → MEDIUM severity, takes precedence."""
245+
inputs = AuditionInputs(
246+
controller=ControllerType.LOOP,
247+
smb_capable=False,
248+
phenotype="flat",
249+
median_recovery_fraction=0.6,
250+
simpson_paradox=False, # would suppress under EXP-2854 logic
251+
p_simpson=0.95, # but bootstrap says high-confidence Simpson
252+
)
253+
flags = classify_triage_flags(inputs)
254+
warn = [f for f in flags if f.name == "window_dependence_warning"]
255+
assert warn and warn[0].severity == "medium"
256+
assert "P(simpson)=95%" in warn[0].rationale
257+
258+
259+
def test_p_simpson_boundary_emits_low():
260+
"""EXP-2859: 0.1 < P(simpson) < 0.9 → LOW severity, boundary case."""
261+
inputs = AuditionInputs(
262+
controller=ControllerType.LOOP,
263+
smb_capable=False,
264+
phenotype="flat",
265+
median_recovery_fraction=0.6,
266+
p_simpson=0.5,
267+
)
268+
flags = classify_triage_flags(inputs)
269+
warn = [f for f in flags if f.name == "window_dependence_warning"]
270+
assert warn and warn[0].severity == "low"
271+
assert "boundary" in warn[0].rationale.lower()
272+
273+
274+
def test_p_simpson_low_suppresses():
275+
"""EXP-2859: P(simpson) <= 0.1 → confidently non-Simpson, suppress."""
276+
inputs = AuditionInputs(
277+
controller=ControllerType.LOOP,
278+
smb_capable=False,
279+
phenotype="up_shift", # would normally trigger phenotype proxy
280+
median_recovery_fraction=0.6,
281+
p_simpson=0.05, # but bootstrap says definitely not
282+
)
283+
flags = classify_triage_flags(inputs)
284+
names = {f.name for f in flags}
285+
assert "window_dependence_warning" not in names

tools/cgmencode/production/test_simpson_facts_loader.py

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,10 @@
1111
)
1212

1313

14-
def _write_artifacts(tmp_path: Path) -> tuple[Path, Path]:
14+
def _write_artifacts(tmp_path: Path) -> tuple[Path, Path, Path]:
1515
sim = tmp_path / "exp-2853_simpson_decomposition.parquet"
1616
stab = tmp_path / "exp-2856_per_patient_stability.parquet"
17+
boot = tmp_path / "exp-2859_bootstrap_simpson.parquet"
1718
pd.DataFrame({
1819
"patient_id": ["a", "b", "c"],
1920
"simpson_paradox": [False, True, True],
@@ -23,12 +24,16 @@ def _write_artifacts(tmp_path: Path) -> tuple[Path, Path]:
2324
"patient_id": ["a", "b"],
2425
"frac_agree_with_overall": [0.9, 0.25],
2526
}).to_parquet(stab, index=False)
26-
return sim, stab
27+
# Empty bootstrap by default; tests that need it can write more
28+
pd.DataFrame({"patient_id": [], "p_simpson": []}).to_parquet(boot, index=False)
29+
return sim, stab, boot
2730

2831

2932
def test_loader_round_trip(tmp_path):
30-
sim, stab = _write_artifacts(tmp_path)
31-
loader = SimpsonFactsLoader(simpson_path=sim, stability_path=stab)
33+
sim, stab, boot = _write_artifacts(tmp_path)
34+
loader = SimpsonFactsLoader(
35+
simpson_path=sim, stability_path=stab, bootstrap_path=boot,
36+
)
3237

3338
a = loader.get("a")
3439
assert a.simpson_paradox is False
@@ -52,6 +57,7 @@ def test_loader_missing_files_returns_empty(tmp_path):
5257
loader = SimpsonFactsLoader(
5358
simpson_path=tmp_path / "nope1.parquet",
5459
stability_path=tmp_path / "nope2.parquet",
60+
bootstrap_path=tmp_path / "nope3.parquet",
5561
)
5662
assert loader.n_patients == 0
5763
assert loader.get("a") == SimpsonAuditionFacts(None, None)
@@ -66,8 +72,10 @@ def test_loader_integration_with_audition_inputs(tmp_path):
6672
classify_triage_flags,
6773
)
6874

69-
sim, stab = _write_artifacts(tmp_path)
70-
loader = SimpsonFactsLoader(simpson_path=sim, stability_path=stab)
75+
sim, stab, boot = _write_artifacts(tmp_path)
76+
loader = SimpsonFactsLoader(
77+
simpson_path=sim, stability_path=stab, bootstrap_path=boot,
78+
)
7179

7280
# Patient b: Simpson=True, stability=0.25 → LOW severity
7381
facts_b = loader.get("b")

0 commit comments

Comments
 (0)