EXP-2859 + PROD: bootstrap P(simpson) replaces noisy boolean flag

bewest · Copilot · bewest · commit 7cf39ec01b7b · 2026-04-22T11:36:01.000-07:00
Block bootstrap (N=200, 48h chunks) per patient gives explicit
P(simpson) confidence:
- 2/26 high-confidence Simpson (P&gt;=0.9)
- 12/26 high-confidence clean (P&lt;=0.1)
- 12/26 boundary (0.1&lt;P&lt;0.9)

Of the 9 patients EXP-2853 boolean-flagged as Simpson, only 2 are
statistically robust; the rest sit near the regime boundary
(median P=0.76 — boundary, not confident).

PROD:
- AuditionInputs.p_simpson: Optional[float] takes precedence
- P&gt;=0.9 → medium severity, 0.1&lt;P&lt;0.9 → low (boundary), P&lt;=0.1 → suppress
- SimpsonFactsLoader extended with bootstrap_path
- 3 new audition tests; 19/19 pass

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;
diff --git a/docs/60-research/exp-2859-bootstrap-simpson-report-2026-04-22.md b/docs/60-research/exp-2859-bootstrap-simpson-report-2026-04-22.md
@@ -0,0 +1,125 @@
+# EXP-2859 — Bootstrap Confidence Replaces Boolean Simpson Flag (2026-04-22)
+
+**Stream**: B (operational) — also methodological for Stream A
+**Predecessor**: EXP-2853 (point Simpson), EXP-2856 (rolling stability), EXP-2858 (no flip drivers)
+**Productionized**: ✅ `p_simpson` field + 3 new severity rules
+
+## Headline
+
+Block bootstrap (N=200, 48h block size) over per-patient β_fast and
+β_slow gives **explicit confidence** that the noisy boolean Simpson
+flag was missing:
+
+| Cohort | n |
+|--------|---|
+| **High-confidence Simpson** (P ≥ 0.9) | **2/26** |
+| **High-confidence non-Simpson** (P ≤ 0.1) | **12/26** |
+| **Boundary / uncertain** (0.1 < P < 0.9) | **12/26** |
+
+The boolean EXP-2853 Simpson flag tagged 9/29 patients as Simpson.
+The bootstrap shows **only 2 of those are statistically robust** —
+the rest sit near the regime boundary with median P=0.76 (still
+"more likely than not" but far from confident). EXP-2856 saw this
+as agreement-fraction; bootstrap quantifies it as a probability.
+
+## Method
+
+Block bootstrap to preserve within-window correlation:
+
+1. Slice each patient's data into non-overlapping 48h chunks
+   `(WIN_SIZE = 48 × 12)`.
+2. Resample chunks with replacement N=200 times.
+3. For each replicate, compute β_fast (5-min OLS over flattened
+   chunks) and β_slow (OLS over per-chunk means).
+4. P(simpson) = fraction of replicates with `sign(β_fast) ≠ sign(β_slow)`
+   AND both magnitudes > 1e-6.
+
+Block bootstrap is essential — naive sample-with-replacement of
+5-min rows would destroy the slow-window structure (β_slow is
+defined on 48h means).
+
+## Results
+
+- N=26 patients with ≥7 chunks (336h ≈ 14 days) of data.
+- Median P(simpson) overall: 0.16
+- Median P(simpson) | point Simpson = True: **0.76**
+- Median P(simpson) | point Simpson = False: **0.01**
+
+The point flag and bootstrap agree on direction (median P 0.76 vs
+0.01 across the two subsets), but bootstrap reveals that the
+"True" subset is heterogeneous — only 2 are confidently Simpson;
+7 are boundary cases.
+
+## Visualization (Charter V8)
+
+![Bootstrap P(simpson) histogram and beta_fast×beta_slow with CI](figures/exp-2859_bootstrap_simpson.png)
+
+Left: P(simpson) distribution with 0.1 / 0.9 cut lines.
+Right: β_fast × β_slow scatter with bootstrap 95% CI bars colored
+by P(simpson). Points near the axes have wide CIs and intermediate
+P; points far from axes have tight CIs and clear classification.
+
+## Production change
+
+`AuditionInputs` gains optional `p_simpson: Optional[float]` field
+(audition_matrix.py:69). New top-priority branch in
+`classify_triage_flags`:
+
+| `p_simpson` | Severity | Action |
+|---|---|---|
+| ≥ 0.9 | **medium** | "high-confidence Simpson regime" |
+| 0.1 < P < 0.9 | **low** | "boundary case ... sanity-check" |
+| ≤ 0.1 | (suppress) | "confidently non-Simpson" |
+| `None` | fall through | EXP-2854/2856 boolean+stability path |
+
+3 new tests:
+- `test_p_simpson_high_emits_medium`
+- `test_p_simpson_boundary_emits_low`
+- `test_p_simpson_low_suppresses` (overrides up_shift phenotype proxy)
+
+`SimpsonFactsLoader` extended:
+- New `bootstrap_path` arg (defaults to
+  `externals/experiments/exp-2859_bootstrap_simpson.parquet`).
+- `SimpsonAuditionFacts` gains `p_simpson: Optional[float]` field.
+- Live smoke-test: 30 patients indexed (29 from EXP-2853 ∪ EXP-2856
+  + ~26 from EXP-2859), `b` returns `(True, 0.20, 0.39)` —
+  boundary case as expected.
+
+19/19 audition + loader tests pass.
+
+## Findings invariants
+
+- **Bootstrap sharpens classification**: 12/26 confidently clean,
+  2/26 confidently Simpson, 12/26 boundary. The boolean flag was
+  a 50-50 coin flip for the 7 "boundary-Simpson" patients.
+- **Block bootstrap is mandatory** — non-block resampling would
+  violate β_slow's exchangeability assumption.
+- **2 confident-Simpson patients** (P ≥ 0.9) merit medium-severity
+  attention; **12 confident-clean** can have Simpson flag suppressed
+  outright; **12 boundary** get low-severity acknowledgment.
+- The point Simpson flag from EXP-2853 stays as a fallback when
+  bootstrap data isn't available.
+
+## Deliverables
+
+| File | Purpose |
+|------|---------|
+| `tools/cgmencode/exp_bootstrap_simpson_2859.py` | Driver |
+| `externals/experiments/exp-2859_bootstrap_simpson.parquet` | Per-patient P + β_fast/β_slow CIs |
+| `externals/experiments/exp-2859_summary.json` | Cohort tabulation |
+| `docs/60-research/figures/exp-2859_bootstrap_simpson.png` | Two-panel chart |
+| `tools/cgmencode/production/audition_matrix.py` | `p_simpson` field + severity rules |
+| `tools/cgmencode/production/simpson_facts_loader.py` | Bootstrap artifact loader |
+| `tools/cgmencode/production/test_audition_matrix.py` | 3 new tests |
+| `tools/cgmencode/production/test_simpson_facts_loader.py` | bootstrap-path test fixtures |
+
+## Next experiments
+
+- **EXP-2860**: bootstrap CI on per-(patient, TOD) Simpson —
+  combine EXP-2855's TOD slicing with EXP-2859's bootstrap to give
+  TOD-aware confidence (do TOD buckets stabilize the boundary
+  cases?).
+- **EXP-2861**: extend bootstrap to other audition signals (ISF gap,
+  recovery fraction) — generalize the "confidence-band" pattern.
+- **viz-meal-overlay-absorption** (carryover): meal-event chart
+  with declared vs modeled carb absorption.
diff --git a/tools/cgmencode/production/audition_matrix.py b/tools/cgmencode/production/audition_matrix.py
@@ -155,7 +155,36 @@ def classify_triage_flags(inputs: AuditionInputs) -> List[AuditionFlag]:
     # patients have only 25% median agreement across rolling 30d windows
     # (vs 87.5% for Simpson-negative), so a single-window Simpson=True without
     # stability evidence is LOW confidence.
-    if inputs.simpson_paradox is True:
+    # EXP-2859: bootstrap P(simpson) takes precedence when available — gives
+    # explicit confidence (only 2/26 patients are P>=0.9, 12/26 are P<=0.1
+    # confidently clean, 12/26 are uncertain boundary).
+    if inputs.p_simpson is not None:
+        if inputs.p_simpson >= 0.9:
+            flags.append(AuditionFlag(
+                name="window_dependence_warning",
+                severity="medium",
+                rationale=(
+                    f"EXP-2859 bootstrap P(simpson)={inputs.p_simpson:.0%} — "
+                    "high-confidence Simpson regime. β_fast (5-min reactive) "
+                    "and β_slow (48h structural) sign-mismatch is robust to "
+                    "data resampling; conflicting timescale recommendations "
+                    "are expected."
+                ),
+            ))
+        elif inputs.p_simpson > 0.1:
+            flags.append(AuditionFlag(
+                name="window_dependence_warning",
+                severity="low",
+                rationale=(
+                    f"EXP-2859 bootstrap P(simpson)={inputs.p_simpson:.0%} — "
+                    "boundary case. Patient sits near the β_fast=0 / β_slow=0 "
+                    "regime boundary; Simpson classification is uncertain. "
+                    "Recommendations from one timescale MAY conflict with the "
+                    "other — sanity-check before applying."
+                ),
+            ))
+        # P<=0.1: confidently non-Simpson, suppress flag
+    elif inputs.simpson_paradox is True:
         if (
             inputs.simpson_stability_frac is not None
             and inputs.simpson_stability_frac >= 0.75
diff --git a/tools/cgmencode/production/simpson_facts_loader.py b/tools/cgmencode/production/simpson_facts_loader.py
@@ -25,13 +25,17 @@
 DEFAULT_STABILITY_PARQUET = (
     _REPO / "externals" / "experiments" / "exp-2856_per_patient_stability.parquet"
 )
+DEFAULT_BOOTSTRAP_PARQUET = (
+    _REPO / "externals" / "experiments" / "exp-2859_bootstrap_simpson.parquet"
+)
 
 
 @dataclass(frozen=True)
 class SimpsonAuditionFacts:
     """Per-patient Simpson facts ready for AuditionInputs."""
     simpson_paradox: Optional[bool]
     simpson_stability_frac: Optional[float]
+    p_simpson: Optional[float] = None
 
 
 class SimpsonFactsLoader:
@@ -46,9 +50,11 @@ def __init__(
         self,
         simpson_path: Path = DEFAULT_SIMPSON_PARQUET,
         stability_path: Path = DEFAULT_STABILITY_PARQUET,
+        bootstrap_path: Path = DEFAULT_BOOTSTRAP_PARQUET,
     ) -> None:
         self._simpson_path = Path(simpson_path)
         self._stability_path = Path(stability_path)
+        self._bootstrap_path = Path(bootstrap_path)
         self._index: Optional[dict[str, SimpsonAuditionFacts]] = None
 
     def _load(self) -> dict[str, SimpsonAuditionFacts]:
@@ -72,11 +78,19 @@ def _load(self) -> dict[str, SimpsonAuditionFacts]:
                     stab_by_pid[str(r["patient_id"])] = float(
                         r["frac_agree_with_overall"]
                     )
-        all_pids = set(flag_by_pid) | set(stab_by_pid)
+        # Bootstrap P(simpson) (EXP-2859)
+        psim_by_pid: dict[str, float] = {}
+        if self._bootstrap_path.exists():
+            df = pd.read_parquet(self._bootstrap_path)
+            if "patient_id" in df.columns and "p_simpson" in df.columns:
+                for _, r in df.iterrows():
+                    psim_by_pid[str(r["patient_id"])] = float(r["p_simpson"])
+        all_pids = set(flag_by_pid) | set(stab_by_pid) | set(psim_by_pid)
         for pid in all_pids:
             idx[pid] = SimpsonAuditionFacts(
                 simpson_paradox=flag_by_pid.get(pid),
                 simpson_stability_frac=stab_by_pid.get(pid),
+                p_simpson=psim_by_pid.get(pid),
             )
         return idx
 
diff --git a/tools/cgmencode/production/test_audition_matrix.py b/tools/cgmencode/production/test_audition_matrix.py
@@ -238,3 +238,48 @@ def test_simpson_false_suppresses_phenotype_proxy():
     assert "window_dependence_warning" not in names, (
         "Explicit Simpson=False should suppress phenotype-proxy warning"
     )
+
+
+def test_p_simpson_high_emits_medium():
+    """EXP-2859: P(simpson) >= 0.9 → MEDIUM severity, takes precedence."""
+    inputs = AuditionInputs(
+        controller=ControllerType.LOOP,
+        smb_capable=False,
+        phenotype="flat",
+        median_recovery_fraction=0.6,
+        simpson_paradox=False,  # would suppress under EXP-2854 logic
+        p_simpson=0.95,         # but bootstrap says high-confidence Simpson
+    )
+    flags = classify_triage_flags(inputs)
+    warn = [f for f in flags if f.name == "window_dependence_warning"]
+    assert warn and warn[0].severity == "medium"
+    assert "P(simpson)=95%" in warn[0].rationale
+
+
+def test_p_simpson_boundary_emits_low():
+    """EXP-2859: 0.1 < P(simpson) < 0.9 → LOW severity, boundary case."""
+    inputs = AuditionInputs(
+        controller=ControllerType.LOOP,
+        smb_capable=False,
+        phenotype="flat",
+        median_recovery_fraction=0.6,
+        p_simpson=0.5,
+    )
+    flags = classify_triage_flags(inputs)
+    warn = [f for f in flags if f.name == "window_dependence_warning"]
+    assert warn and warn[0].severity == "low"
+    assert "boundary" in warn[0].rationale.lower()
+
+
+def test_p_simpson_low_suppresses():
+    """EXP-2859: P(simpson) <= 0.1 → confidently non-Simpson, suppress."""
+    inputs = AuditionInputs(
+        controller=ControllerType.LOOP,
+        smb_capable=False,
+        phenotype="up_shift",   # would normally trigger phenotype proxy
+        median_recovery_fraction=0.6,
+        p_simpson=0.05,         # but bootstrap says definitely not
+    )
+    flags = classify_triage_flags(inputs)
+    names = {f.name for f in flags}
+    assert "window_dependence_warning" not in names
diff --git a/tools/cgmencode/production/test_simpson_facts_loader.py b/tools/cgmencode/production/test_simpson_facts_loader.py
@@ -11,9 +11,10 @@
 )
 
 
-def _write_artifacts(tmp_path: Path) -> tuple[Path, Path]:
+def _write_artifacts(tmp_path: Path) -> tuple[Path, Path, Path]:
     sim = tmp_path / "exp-2853_simpson_decomposition.parquet"
     stab = tmp_path / "exp-2856_per_patient_stability.parquet"
+    boot = tmp_path / "exp-2859_bootstrap_simpson.parquet"
     pd.DataFrame({
         "patient_id": ["a", "b", "c"],
         "simpson_paradox": [False, True, True],
@@ -23,12 +24,16 @@ def _write_artifacts(tmp_path: Path) -> tuple[Path, Path]:
         "patient_id": ["a", "b"],
         "frac_agree_with_overall": [0.9, 0.25],
     }).to_parquet(stab, index=False)
-    return sim, stab
+    # Empty bootstrap by default; tests that need it can write more
+    pd.DataFrame({"patient_id": [], "p_simpson": []}).to_parquet(boot, index=False)
+    return sim, stab, boot
 
 
 def test_loader_round_trip(tmp_path):
-    sim, stab = _write_artifacts(tmp_path)
-    loader = SimpsonFactsLoader(simpson_path=sim, stability_path=stab)
+    sim, stab, boot = _write_artifacts(tmp_path)
+    loader = SimpsonFactsLoader(
+        simpson_path=sim, stability_path=stab, bootstrap_path=boot,
+    )
 
     a = loader.get("a")
     assert a.simpson_paradox is False
@@ -52,6 +57,7 @@ def test_loader_missing_files_returns_empty(tmp_path):
     loader = SimpsonFactsLoader(
         simpson_path=tmp_path / "nope1.parquet",
         stability_path=tmp_path / "nope2.parquet",
+        bootstrap_path=tmp_path / "nope3.parquet",
     )
     assert loader.n_patients == 0
     assert loader.get("a") == SimpsonAuditionFacts(None, None)
@@ -66,8 +72,10 @@ def test_loader_integration_with_audition_inputs(tmp_path):
         classify_triage_flags,
     )
 
-    sim, stab = _write_artifacts(tmp_path)
-    loader = SimpsonFactsLoader(simpson_path=sim, stability_path=stab)
+    sim, stab, boot = _write_artifacts(tmp_path)
+    loader = SimpsonFactsLoader(
+        simpson_path=sim, stability_path=stab, bootstrap_path=boot,
+    )
 
     # Patient b: Simpson=True, stability=0.25 → LOW severity
     facts_b = loader.get("b")