feat: close Phase 3E, open Phase 4

Your Name · claude · Your Name · commit b8cb558a96c5 · 2026-06-13T13:19:07.000+08:00
Phase 3E closure report documents 6 deliverables: lane baseline, intervention
annotation, capture instrumentation, parser detection gate, Phase 2 auto-detection,
and Tier 2 honest deferral. superpowers_workflow_intervention gate is OPEN at 5
tagged sessions. routed_prompt_intervention and controlled_prompt_morphology remain
BLOCKED.

Update lane baseline with current numbers (992 metadata, 131,952 events, 8 SP
sessions). Update Phase 3E README status to complete. Update research README
to mark Phase 4 as open.

Co-Authored-By: Claude Opus 4.7 &lt;noreply@anthropic.com&gt;
diff --git a/docs/research/README.md b/docs/research/README.md
@@ -11,8 +11,8 @@ This directory groups the research tracks and branch studies that sit alongside
 | Phase 3B | complete | Topology taxonomy |
 | Phase 3C | complete | Metadata & provenance |
 | [Phase 3D](phase3d/README.md) | **complete** | Hypothesis registry + Tier 1 validation |
-| [Phase 3E](phase3e/README.md) | **active** | Controlled transition & intervention-aware validation |
-| Phase 4 | **not open** | Theory finalization |
+| [Phase 3E](phase3e/README.md) | **complete** | Controlled transition & intervention-aware validation |
+| Phase 4 | **open** | Theory finalization |
 
 ## Current Corpus Snapshot
 
@@ -30,16 +30,9 @@ This directory groups the research tracks and branch studies that sit alongside
 
 Phase 3D delivered the hypothesis registry (19 hypotheses, 8 categories), completed Tier 1 validation (3 supported, 1 inconclusive, 1 not supported), and honestly deferred Tier 2 (failure samples genuinely rare in real agent behavior: 1/100 native failure, 0/100 near-failure). See [closure report](phase3d/closure_report_v0.2.5.md).
 
-## Phase 3E Active Scope
+## Phase 3E Closure Summary
 
-Controlled transition and intervention-aware validation. Lanes kept separate:
-
-- `direct_prompt_native`
-- `routed_prompt_intervention`
-- `superpowers_workflow_intervention`
-- `controlled_prompt_morphology`
-
-Deferred hypotheses from Phase 3D Tier 2/3/4 carried forward. Tier 2 validation is opportunistic (background acquisition), not a phase gate. See [Phase 3E README](phase3e/README.md).
+Phase 3E delivered the intervention lane infrastructure (4 lanes, parser detection gate, auto-detection in enrichment), completed 3 sub-phases (baseline, annotation, instrumentation), opened the superpowers_workflow_intervention gate (5 tagged sessions), and honestly deferred Tier 2 validation (failure samples genuinely rare: 1/101 native failure, 5/101 near-failure). Phase 2 auto-detection is operational for superpowers lane. See [closure report](phase3e/closure_report_v0.2.5.md).
 
 ## Cross-project Branch Studies
 
diff --git a/docs/research/phase3e/README.md b/docs/research/phase3e/README.md
@@ -5,8 +5,8 @@ Phase 3E validates selected runtime morphology hypotheses under controlled or in
 ## Position
 
 - Phase 3D: complete
-- Phase 3E: active
-- Phase 4: not open
+- Phase 3E: **complete**
+- Phase 4: open
 
 ## Mission
 
@@ -159,6 +159,7 @@ Premature parser detection on 1-2 samples risks encoding heuristic patterns that
 - [Intervention Lane Annotation Plan](intervention_lane_annotation_plan_v0.2.5.md) — Annotation criteria, non-inference rule, process (Phase 3E-2)
 - [Intervention Lane Candidates](intervention_lane_candidates_v0.2.5.md) — Candidate review table with evidence and decisions (Phase 3E-2)
 - [Intervention Capture Instrumentation Plan](intervention_capture_instrumentation_plan_v0.2.5.md) — Capture requirements, tag format specs, enrichment recognition plan (Phase 3E-3)
+- [Closure Report](closure_report_v0.2.5.md) — Phase 3E closure: deliverables, gate status, deferred hypotheses, graduation assessment
 
 ## Current State
 
diff --git a/docs/research/phase3e/closure_report_v0.2.5.md b/docs/research/phase3e/closure_report_v0.2.5.md
@@ -0,0 +1,141 @@
+# Phase 3E Closure Report v0.2.5
+
+Phase 3E is the controlled transition and intervention-aware validation layer. This report assesses whether its deliverables are complete and whether the phase can graduate.
+
+## Phase Mission (from charter)
+
+Validate the relationship between events, observations, interventions, workflow conditions and topology transitions under controlled or intervention-aware conditions, without entering Phase 4 theory finalization.
+
+## Corpus Snapshot at Closure
+
+| Metric | Value |
+|--------|-------|
+| Data sessions | 1,517 |
+| Metadata sessions | 992 |
+| Events | 131,952 |
+| Runtime breadth | 7 |
+| Task breadth | 9 |
+
+**Lane distribution**:
+
+| Lane | Sessions | Events |
+|------|----------|--------|
+| `direct_prompt_native` | 101 | 32,141 |
+| `superpowers_workflow_intervention` | 8 | 42,465 |
+| `controlled_prompt_morphology` | 3 | 135 |
+| `routed_prompt_intervention` | 0 | 0 |
+| unlabeled | 880 | 16,160 |
+
+## Deliverable 1: Lane Baseline (3E-1) — COMPLETE
+
+Four lanes characterized with per-lane session counts, event counts, runtime distributions, and task type distributions. Baseline frozen at `lane_baseline_v0.2.5.md`.
+
+Key finding: native lane (101 sessions) is the only lane with statistical mass. Intervention lanes are infrastructure-ready but corpus-sparse.
+
+## Deliverable 2: Intervention Lane Annotation (3E-2) — COMPLETE
+
+Annotation pass reviewed 18 Skill-tool candidate sessions. Results:
+
+| Evidence Level | Count | Action |
+|----------------|-------|--------|
+| Strong (Skill+PlanMode+Workflow) | 1 | Annotated as superpowers |
+| Moderate (Skill+PlanMode) | 2 | Annotated as superpowers |
+| Weak (Skill only) | 15 | Deferred (insufficient evidence) |
+| Routed | 0 | Honestly reported |
+
+Annotation criteria, evidence levels, and non-inference rule documented in `intervention_lane_annotation_plan_v0.2.5.md` and `intervention_lane_candidates_v0.2.5.md`.
+
+## Deliverable 3: Capture Instrumentation (3E-3) — COMPLETE
+
+- `SOURCES` expanded to accept `routed_prompt_intervention`, `superpowers_workflow_intervention`, `controlled_prompt_morphology`
+- `INTERVENTION_LANES` constant defined
+- `SessionMetadata` expanded with `intervention_lane`, `causetrace_tags`, `intervention_evidence_source`, `intervention_evidence_level`
+- Capture tag format specs defined for prompt-routing-skill, superpowers, and controlled prompt morphology
+- Upstream tools updated (prompt-routing-skill SKILL.md, superpowers using-superpowers SKILL.md)
+- `causetrace metadata-set --intervention-lane`, `annotate --tag`, `corpus --lane` CLI tooling built
+- `corpus lane-count` and `corpus gate-status` subcommands operational
+- `detect-tags` command for scanning session JSONL for causetrace_tags YAML blocks
+
+## Deliverable 4: Parser Detection Gate — COMPLETE
+
+Activation gate system implemented: parser detection for an intervention lane requires >=5 explicitly tagged sessions before activation. Rationale: premature detection on 1-2 samples risks encoding heuristic patterns that mislabel future sessions.
+
+| Lane | Tagged | Required | Gate |
+|------|--------|----------|------|
+| `superpowers_workflow_intervention` | 5 | 5 | **OPEN** |
+| `routed_prompt_intervention` | 0 | 5 | BLOCKED |
+| `controlled_prompt_morphology` | 0 | 5 | BLOCKED |
+
+Gate opened for superpowers_workflow_intervention on 2026-06-13 after 5 headless Claude Code sessions accumulated workflow intervention tags.
+
+## Deliverable 5: Phase 2 Auto-Detection — COMPLETE
+
+With superpowers gate OPEN, Phase 2 enrichment recognition implemented:
+
+- `_auto_detect_intervention_tags()` scans newly enriched session JSONL for causetrace_tags YAML blocks
+- Auto-sets `task_source`, `intervention_lane`, `causetrace_tags`, `intervention_evidence_level`, `intervention_evidence_source` in metadata sidecar
+- Wired into all three enrichment handlers (`enrich`, `enrich-opencode`, `enrich-codex`)
+- JSON-escaped newline handling fixed in `detect_causetrace_tags`
+
+Tag format detection verified against session 7e8574ec (actual YAML blocks in tool_input). Other 4 tagged sessions carry tags in metadata sidecars only (manually annotated during Phase 3E-3 headless runs).
+
+## Deliverable 6: Tier 2 Readiness — DEFERRED (honest)
+
+Tier 2 requires failure/near-failure density that the current corpus does not provide:
+
+| Criterion | Current | Required | Status |
+|-----------|---------|----------|--------|
+| Native failure sessions (success=False) | 1 | 10 | NOT MET |
+| Native near-failure (human_intervention=True) | 5 | 10 | NOT MET |
+| Multi-runtime failure coverage | 6 | 3 | MET |
+
+Failure and near-failure samples remain genuinely rare in real agent behavior. This mirrors the Phase 3D Tier 2 deferral finding. Background acquisition continues.
+
+## Deferred Hypotheses — Carried Forward
+
+### Tier 2 (failure / intervention morphology)
+
+- H-FM-001, H-FM-002, H-IM-001, H-IM-002, H-EV-004, H-EV-005
+
+Target: opportunistic validation when native failure >= 10, near-failure >= 10.
+
+### Tier 3 (controlled benchmark / external lane)
+
+- H-OT-001, H-OT-002, H-EG-001, H-EG-002, H-EV-002, H-EV-003
+
+Activate when controlled benchmark protocol is operational.
+
+### Tier 4 (literature-inspired, registry-only)
+
+- H-EV-001, H-LH-001, H-LH-002
+
+Maintain in registry for future corpus expansion.
+
+## What Phase 3E Did NOT Do (per charter)
+
+- Did not enter Phase 4 theory finalization
+- Did not merge intervention lanes into native baseline
+- Did not implement heuristic parser detection for blocked lanes
+- Did not implement prediction, anomaly detection, or auto-diagnosis
+- Did not promote hypotheses to conclusions without corpus-backed validation
+- Did not change topology taxonomy or readiness gates without justification
+- Did not make cross-lane comparisons beyond trend reporting
+- Did not make universal prompt policy recommendations
+
+## Phase 3E Graduation Assessment
+
+Phase 3E infrastructure work is complete. All designed sub-phases (3E-1 through 3E-3) delivered. Phase 2 auto-detection is operational for the one lane that met the gate threshold.
+
+Tier 2 validation is honestly deferred — the bottleneck is corpus failure density, not methodology or infrastructure. This is a data problem, not a design problem.
+
+**Recommendation: Graduate Phase 3E. Mark complete. Carry deferred hypotheses and background acquisition forward.**
+
+## Next Phase
+
+Phase 4 (Theory Finalization) can be opened. Phase 4 scope is constrained by the Phase 3E operating rules that remain in effect:
+
+- All claims must bind to a specific corpus snapshot and lane
+- Every percentage must include its denominator
+- Negative results are first-class entries
+- Do not promote hypotheses without corpus-backed validation
+- Intervention lane findings do not become universal policy without additional validation
diff --git a/docs/research/phase3e/lane_baseline_v0.2.5.md b/docs/research/phase3e/lane_baseline_v0.2.5.md
@@ -4,21 +4,23 @@ This document records the first intervention-aware lane baseline for Phase 3E. I
 
 ## Corpus Snapshot
 
-- metadata sessions: `983`
-- events: `128,552`
-- strict research-grade sessions: `157`
-- native strict sessions: `100`
-- data_origin coverage: `100%`
+- metadata sessions: `992`
+- events: `131,952`
+- data sessions: `1,517`
+- runtime breadth: `7`
+- task breadth: `9`
 
 ## Lane Distribution
 
 | Lane | Sessions | Events | % of corpus |
 |------|----------|--------|-------------|
-| `direct_prompt_native` | 101 | 32,141 | 25.0% |
+| `direct_prompt_native` | 101 | 32,141 | 24.4% |
 | `controlled_prompt_morphology` | 3 | 135 | 0.1% |
+| `superpowers_workflow_intervention` | 8 | 42,465 | 32.2% |
 | `routed_prompt_intervention` | 0 | 0 | 0% |
-| `superpowers_workflow_intervention` | 0 | 0 | 0% |
-| unlabeled | 879 | ~96,276 | 74.9% |
+| unlabeled | 880 | 16,160 | 12.2% |
+
+Note: superpowers_workflow_intervention event count inflated by 3 large sessions (10K-19K events each) from Phase 3E-2 manual annotation. These sessions have no causetrace_tags in event content; classification is via metadata sidecar.
 
 ## Lane: `direct_prompt_native`
 
@@ -110,45 +112,48 @@ The `prompt-routing-skill` is deployed but routing metadata has not yet been pro
 
 ## Lane: `superpowers_workflow_intervention`
 
-No sessions explicitly labeled with `task_source=superpowers_workflow_intervention`.
+8 sessions labeled with `task_source=superpowers_workflow_intervention`. 5 carry explicit causetrace_tags in metadata sidecars (Phase 3E-3 headless runs); 3 were manually annotated during Phase 3E-2.
 
-Structured workflow plugins (superpowers) are in active use, but workflow intervention metadata has not yet been propagated into the causetrace metadata system. This lane is defined and scoped but carries zero labeled sessions at this baseline.
+| Metric | Value |
+|--------|-------|
+| Sessions | 8 |
+| Events | 42,465 |
+| Tagged (causetrace_tags in metadata) | 5 |
+| Untagged (manual annotation only) | 3 |
+| Evidence level: strong | 5 (tagged) |
+| Evidence level: moderate | 3 (manual) |
+| Agent | claude-code (8) |
+| Runtime | claude-code (8) |
 
-## Unlabeled Sessions
+Note: event count inflated by 3 large sessions (10K-19K events each) from Phase 3E-2 annotation. The 5 tagged headless sessions are small (10-26 events each, except 7e8574ec at 1,180 events).
 
-879 sessions (74.9% of metadata corpus) lack explicit lane labels. These sessions have `data_origin` set but do not match any of the four Phase 3E lane criteria:
+Parser detection gate is OPEN for this lane. Phase 2 auto-detection in enrichment pipeline is operational.
 
-- No `direct_prompt_native` / `native` / `real_work` data_origin
-- No `controlled_benchmark` data_origin
-- No `routed_prompt_intervention` task_source
-- No `superpowers_workflow_intervention` task_source
+## Unlabeled Sessions
 
-These sessions remain in the corpus but are excluded from lane-separated analysis until labeled.
+880 sessions lack explicit lane labels. These sessions have `data_origin` set but do not match any of the four Phase 3E lane criteria.
 
 ## Lane Comparison Summary
 
 | Metric | direct_prompt_native | controlled_prompt_morphology | routed_prompt_intervention | superpowers_workflow_intervention |
 |--------|---------------------|------------------------------|---------------------------|----------------------------------|
-| Sessions | 101 | 3 | 0 | 0 |
-| Events | 32,141 | 135 | 0 | 0 |
-| Avg events/session | 318 | 45 | - | - |
-| Long sessions | 38 | 0 | 0 | 0 |
-| AUQ sessions | 5 | 0 | 0 | 0 |
-| Human intervention | 5 | 0 | 0 | 0 |
-| Failure | 1 | 0 | 0 | 0 |
-| Agent breadth | 5 | 1 | 0 | 0 |
-| Runtime breadth | 6 | 0 | 0 | 0 |
-| Task breadth | 8 | 0 | 0 | 0 |
+| Sessions | 101 | 3 | 0 | 8 |
+| Events | 32,141 | 135 | 0 | 42,465 |
+| Tagged | N/A (native) | 0 | 0 | 5 |
+| Evidence: strong | N/A | 0 | 0 | 5 |
+| Evidence: moderate | N/A | 0 | 0 | 3 |
+| Agent breadth | 5 | 1 | 0 | 1 |
 
 ## Current Cautions
 
-- `direct_prompt_native` is the only lane with sufficient sample size for any hypothesis check.
+- `direct_prompt_native` is the only lane with sufficient sample size for statistical hypothesis checks.
+- `superpowers_workflow_intervention` (8 sessions) is dominated by 3 large manually-annotated sessions; descriptive observation only.
 - `controlled_prompt_morphology` (3 sessions) is too small for validation — it exists only as a lane marker.
-- `routed_prompt_intervention` and `superpowers_workflow_intervention` are definitionally scoped but carry no labeled data — they are placeholder lanes.
+- `routed_prompt_intervention` carries zero labeled sessions — it is a placeholder lane.
 - Do not merge intervention lanes into the native baseline.
-- Do not draw cross-lane conclusions when only one lane has data.
-- The 879 unlabeled sessions are not a lane — they are a labeling gap.
+- Do not draw cross-lane conclusions when only one lane has statistical mass.
+- The 880 unlabeled sessions are not a lane — they are a labeling gap.
 
 ## Next Action
 
-Establish a labeling pipeline so that `routed_prompt_intervention` and `superpowers_workflow_intervention` sessions accumulate in the corpus. Until then, Phase 3E validation is limited to within-lane analysis of `direct_prompt_native`.
+Phase 3E infrastructure is complete. Parser detection gate is OPEN for superpowers_workflow_intervention. Background acquisition continues for all lanes. Tier 2 validation deferred until native failure >= 10, near-failure >= 10. See [closure report](closure_report_v0.2.5.md).