Skip to content

Commit b8cb558

Browse files
Your Nameclaude
andcommitted
feat: close Phase 3E, open Phase 4
Phase 3E closure report documents 6 deliverables: lane baseline, intervention annotation, capture instrumentation, parser detection gate, Phase 2 auto-detection, and Tier 2 honest deferral. superpowers_workflow_intervention gate is OPEN at 5 tagged sessions. routed_prompt_intervention and controlled_prompt_morphology remain BLOCKED. Update lane baseline with current numbers (992 metadata, 131,952 events, 8 SP sessions). Update Phase 3E README status to complete. Update research README to mark Phase 4 as open. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 403dd73 commit b8cb558

4 files changed

Lines changed: 185 additions & 45 deletions

File tree

docs/research/README.md

Lines changed: 4 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,8 @@ This directory groups the research tracks and branch studies that sit alongside
1111
| Phase 3B | complete | Topology taxonomy |
1212
| Phase 3C | complete | Metadata & provenance |
1313
| [Phase 3D](phase3d/README.md) | **complete** | Hypothesis registry + Tier 1 validation |
14-
| [Phase 3E](phase3e/README.md) | **active** | Controlled transition & intervention-aware validation |
15-
| Phase 4 | **not open** | Theory finalization |
14+
| [Phase 3E](phase3e/README.md) | **complete** | Controlled transition & intervention-aware validation |
15+
| Phase 4 | **open** | Theory finalization |
1616

1717
## Current Corpus Snapshot
1818

@@ -30,16 +30,9 @@ This directory groups the research tracks and branch studies that sit alongside
3030

3131
Phase 3D delivered the hypothesis registry (19 hypotheses, 8 categories), completed Tier 1 validation (3 supported, 1 inconclusive, 1 not supported), and honestly deferred Tier 2 (failure samples genuinely rare in real agent behavior: 1/100 native failure, 0/100 near-failure). See [closure report](phase3d/closure_report_v0.2.5.md).
3232

33-
## Phase 3E Active Scope
33+
## Phase 3E Closure Summary
3434

35-
Controlled transition and intervention-aware validation. Lanes kept separate:
36-
37-
- `direct_prompt_native`
38-
- `routed_prompt_intervention`
39-
- `superpowers_workflow_intervention`
40-
- `controlled_prompt_morphology`
41-
42-
Deferred hypotheses from Phase 3D Tier 2/3/4 carried forward. Tier 2 validation is opportunistic (background acquisition), not a phase gate. See [Phase 3E README](phase3e/README.md).
35+
Phase 3E delivered the intervention lane infrastructure (4 lanes, parser detection gate, auto-detection in enrichment), completed 3 sub-phases (baseline, annotation, instrumentation), opened the superpowers_workflow_intervention gate (5 tagged sessions), and honestly deferred Tier 2 validation (failure samples genuinely rare: 1/101 native failure, 5/101 near-failure). Phase 2 auto-detection is operational for superpowers lane. See [closure report](phase3e/closure_report_v0.2.5.md).
4336

4437
## Cross-project Branch Studies
4538

docs/research/phase3e/README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,8 @@ Phase 3E validates selected runtime morphology hypotheses under controlled or in
55
## Position
66

77
- Phase 3D: complete
8-
- Phase 3E: active
9-
- Phase 4: not open
8+
- Phase 3E: **complete**
9+
- Phase 4: open
1010

1111
## Mission
1212

@@ -159,6 +159,7 @@ Premature parser detection on 1-2 samples risks encoding heuristic patterns that
159159
- [Intervention Lane Annotation Plan](intervention_lane_annotation_plan_v0.2.5.md) — Annotation criteria, non-inference rule, process (Phase 3E-2)
160160
- [Intervention Lane Candidates](intervention_lane_candidates_v0.2.5.md) — Candidate review table with evidence and decisions (Phase 3E-2)
161161
- [Intervention Capture Instrumentation Plan](intervention_capture_instrumentation_plan_v0.2.5.md) — Capture requirements, tag format specs, enrichment recognition plan (Phase 3E-3)
162+
- [Closure Report](closure_report_v0.2.5.md) — Phase 3E closure: deliverables, gate status, deferred hypotheses, graduation assessment
162163

163164
## Current State
164165

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
# Phase 3E Closure Report v0.2.5
2+
3+
Phase 3E is the controlled transition and intervention-aware validation layer. This report assesses whether its deliverables are complete and whether the phase can graduate.
4+
5+
## Phase Mission (from charter)
6+
7+
Validate the relationship between events, observations, interventions, workflow conditions and topology transitions under controlled or intervention-aware conditions, without entering Phase 4 theory finalization.
8+
9+
## Corpus Snapshot at Closure
10+
11+
| Metric | Value |
12+
|--------|-------|
13+
| Data sessions | 1,517 |
14+
| Metadata sessions | 992 |
15+
| Events | 131,952 |
16+
| Runtime breadth | 7 |
17+
| Task breadth | 9 |
18+
19+
**Lane distribution**:
20+
21+
| Lane | Sessions | Events |
22+
|------|----------|--------|
23+
| `direct_prompt_native` | 101 | 32,141 |
24+
| `superpowers_workflow_intervention` | 8 | 42,465 |
25+
| `controlled_prompt_morphology` | 3 | 135 |
26+
| `routed_prompt_intervention` | 0 | 0 |
27+
| unlabeled | 880 | 16,160 |
28+
29+
## Deliverable 1: Lane Baseline (3E-1) — COMPLETE
30+
31+
Four lanes characterized with per-lane session counts, event counts, runtime distributions, and task type distributions. Baseline frozen at `lane_baseline_v0.2.5.md`.
32+
33+
Key finding: native lane (101 sessions) is the only lane with statistical mass. Intervention lanes are infrastructure-ready but corpus-sparse.
34+
35+
## Deliverable 2: Intervention Lane Annotation (3E-2) — COMPLETE
36+
37+
Annotation pass reviewed 18 Skill-tool candidate sessions. Results:
38+
39+
| Evidence Level | Count | Action |
40+
|----------------|-------|--------|
41+
| Strong (Skill+PlanMode+Workflow) | 1 | Annotated as superpowers |
42+
| Moderate (Skill+PlanMode) | 2 | Annotated as superpowers |
43+
| Weak (Skill only) | 15 | Deferred (insufficient evidence) |
44+
| Routed | 0 | Honestly reported |
45+
46+
Annotation criteria, evidence levels, and non-inference rule documented in `intervention_lane_annotation_plan_v0.2.5.md` and `intervention_lane_candidates_v0.2.5.md`.
47+
48+
## Deliverable 3: Capture Instrumentation (3E-3) — COMPLETE
49+
50+
- `SOURCES` expanded to accept `routed_prompt_intervention`, `superpowers_workflow_intervention`, `controlled_prompt_morphology`
51+
- `INTERVENTION_LANES` constant defined
52+
- `SessionMetadata` expanded with `intervention_lane`, `causetrace_tags`, `intervention_evidence_source`, `intervention_evidence_level`
53+
- Capture tag format specs defined for prompt-routing-skill, superpowers, and controlled prompt morphology
54+
- Upstream tools updated (prompt-routing-skill SKILL.md, superpowers using-superpowers SKILL.md)
55+
- `causetrace metadata-set --intervention-lane`, `annotate --tag`, `corpus --lane` CLI tooling built
56+
- `corpus lane-count` and `corpus gate-status` subcommands operational
57+
- `detect-tags` command for scanning session JSONL for causetrace_tags YAML blocks
58+
59+
## Deliverable 4: Parser Detection Gate — COMPLETE
60+
61+
Activation gate system implemented: parser detection for an intervention lane requires >=5 explicitly tagged sessions before activation. Rationale: premature detection on 1-2 samples risks encoding heuristic patterns that mislabel future sessions.
62+
63+
| Lane | Tagged | Required | Gate |
64+
|------|--------|----------|------|
65+
| `superpowers_workflow_intervention` | 5 | 5 | **OPEN** |
66+
| `routed_prompt_intervention` | 0 | 5 | BLOCKED |
67+
| `controlled_prompt_morphology` | 0 | 5 | BLOCKED |
68+
69+
Gate opened for superpowers_workflow_intervention on 2026-06-13 after 5 headless Claude Code sessions accumulated workflow intervention tags.
70+
71+
## Deliverable 5: Phase 2 Auto-Detection — COMPLETE
72+
73+
With superpowers gate OPEN, Phase 2 enrichment recognition implemented:
74+
75+
- `_auto_detect_intervention_tags()` scans newly enriched session JSONL for causetrace_tags YAML blocks
76+
- Auto-sets `task_source`, `intervention_lane`, `causetrace_tags`, `intervention_evidence_level`, `intervention_evidence_source` in metadata sidecar
77+
- Wired into all three enrichment handlers (`enrich`, `enrich-opencode`, `enrich-codex`)
78+
- JSON-escaped newline handling fixed in `detect_causetrace_tags`
79+
80+
Tag format detection verified against session 7e8574ec (actual YAML blocks in tool_input). Other 4 tagged sessions carry tags in metadata sidecars only (manually annotated during Phase 3E-3 headless runs).
81+
82+
## Deliverable 6: Tier 2 Readiness — DEFERRED (honest)
83+
84+
Tier 2 requires failure/near-failure density that the current corpus does not provide:
85+
86+
| Criterion | Current | Required | Status |
87+
|-----------|---------|----------|--------|
88+
| Native failure sessions (success=False) | 1 | 10 | NOT MET |
89+
| Native near-failure (human_intervention=True) | 5 | 10 | NOT MET |
90+
| Multi-runtime failure coverage | 6 | 3 | MET |
91+
92+
Failure and near-failure samples remain genuinely rare in real agent behavior. This mirrors the Phase 3D Tier 2 deferral finding. Background acquisition continues.
93+
94+
## Deferred Hypotheses — Carried Forward
95+
96+
### Tier 2 (failure / intervention morphology)
97+
98+
- H-FM-001, H-FM-002, H-IM-001, H-IM-002, H-EV-004, H-EV-005
99+
100+
Target: opportunistic validation when native failure >= 10, near-failure >= 10.
101+
102+
### Tier 3 (controlled benchmark / external lane)
103+
104+
- H-OT-001, H-OT-002, H-EG-001, H-EG-002, H-EV-002, H-EV-003
105+
106+
Activate when controlled benchmark protocol is operational.
107+
108+
### Tier 4 (literature-inspired, registry-only)
109+
110+
- H-EV-001, H-LH-001, H-LH-002
111+
112+
Maintain in registry for future corpus expansion.
113+
114+
## What Phase 3E Did NOT Do (per charter)
115+
116+
- Did not enter Phase 4 theory finalization
117+
- Did not merge intervention lanes into native baseline
118+
- Did not implement heuristic parser detection for blocked lanes
119+
- Did not implement prediction, anomaly detection, or auto-diagnosis
120+
- Did not promote hypotheses to conclusions without corpus-backed validation
121+
- Did not change topology taxonomy or readiness gates without justification
122+
- Did not make cross-lane comparisons beyond trend reporting
123+
- Did not make universal prompt policy recommendations
124+
125+
## Phase 3E Graduation Assessment
126+
127+
Phase 3E infrastructure work is complete. All designed sub-phases (3E-1 through 3E-3) delivered. Phase 2 auto-detection is operational for the one lane that met the gate threshold.
128+
129+
Tier 2 validation is honestly deferred — the bottleneck is corpus failure density, not methodology or infrastructure. This is a data problem, not a design problem.
130+
131+
**Recommendation: Graduate Phase 3E. Mark complete. Carry deferred hypotheses and background acquisition forward.**
132+
133+
## Next Phase
134+
135+
Phase 4 (Theory Finalization) can be opened. Phase 4 scope is constrained by the Phase 3E operating rules that remain in effect:
136+
137+
- All claims must bind to a specific corpus snapshot and lane
138+
- Every percentage must include its denominator
139+
- Negative results are first-class entries
140+
- Do not promote hypotheses without corpus-backed validation
141+
- Intervention lane findings do not become universal policy without additional validation

docs/research/phase3e/lane_baseline_v0.2.5.md

Lines changed: 37 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -4,21 +4,23 @@ This document records the first intervention-aware lane baseline for Phase 3E. I
44

55
## Corpus Snapshot
66

7-
- metadata sessions: `983`
8-
- events: `128,552`
9-
- strict research-grade sessions: `157`
10-
- native strict sessions: `100`
11-
- data_origin coverage: `100%`
7+
- metadata sessions: `992`
8+
- events: `131,952`
9+
- data sessions: `1,517`
10+
- runtime breadth: `7`
11+
- task breadth: `9`
1212

1313
## Lane Distribution
1414

1515
| Lane | Sessions | Events | % of corpus |
1616
|------|----------|--------|-------------|
17-
| `direct_prompt_native` | 101 | 32,141 | 25.0% |
17+
| `direct_prompt_native` | 101 | 32,141 | 24.4% |
1818
| `controlled_prompt_morphology` | 3 | 135 | 0.1% |
19+
| `superpowers_workflow_intervention` | 8 | 42,465 | 32.2% |
1920
| `routed_prompt_intervention` | 0 | 0 | 0% |
20-
| `superpowers_workflow_intervention` | 0 | 0 | 0% |
21-
| unlabeled | 879 | ~96,276 | 74.9% |
21+
| unlabeled | 880 | 16,160 | 12.2% |
22+
23+
Note: superpowers_workflow_intervention event count inflated by 3 large sessions (10K-19K events each) from Phase 3E-2 manual annotation. These sessions have no causetrace_tags in event content; classification is via metadata sidecar.
2224

2325
## Lane: `direct_prompt_native`
2426

@@ -110,45 +112,48 @@ The `prompt-routing-skill` is deployed but routing metadata has not yet been pro
110112

111113
## Lane: `superpowers_workflow_intervention`
112114

113-
No sessions explicitly labeled with `task_source=superpowers_workflow_intervention`.
115+
8 sessions labeled with `task_source=superpowers_workflow_intervention`. 5 carry explicit causetrace_tags in metadata sidecars (Phase 3E-3 headless runs); 3 were manually annotated during Phase 3E-2.
114116

115-
Structured workflow plugins (superpowers) are in active use, but workflow intervention metadata has not yet been propagated into the causetrace metadata system. This lane is defined and scoped but carries zero labeled sessions at this baseline.
117+
| Metric | Value |
118+
|--------|-------|
119+
| Sessions | 8 |
120+
| Events | 42,465 |
121+
| Tagged (causetrace_tags in metadata) | 5 |
122+
| Untagged (manual annotation only) | 3 |
123+
| Evidence level: strong | 5 (tagged) |
124+
| Evidence level: moderate | 3 (manual) |
125+
| Agent | claude-code (8) |
126+
| Runtime | claude-code (8) |
116127

117-
## Unlabeled Sessions
128+
Note: event count inflated by 3 large sessions (10K-19K events each) from Phase 3E-2 annotation. The 5 tagged headless sessions are small (10-26 events each, except 7e8574ec at 1,180 events).
118129

119-
879 sessions (74.9% of metadata corpus) lack explicit lane labels. These sessions have `data_origin` set but do not match any of the four Phase 3E lane criteria:
130+
Parser detection gate is OPEN for this lane. Phase 2 auto-detection in enrichment pipeline is operational.
120131

121-
- No `direct_prompt_native` / `native` / `real_work` data_origin
122-
- No `controlled_benchmark` data_origin
123-
- No `routed_prompt_intervention` task_source
124-
- No `superpowers_workflow_intervention` task_source
132+
## Unlabeled Sessions
125133

126-
These sessions remain in the corpus but are excluded from lane-separated analysis until labeled.
134+
880 sessions lack explicit lane labels. These sessions have `data_origin` set but do not match any of the four Phase 3E lane criteria.
127135

128136
## Lane Comparison Summary
129137

130138
| Metric | direct_prompt_native | controlled_prompt_morphology | routed_prompt_intervention | superpowers_workflow_intervention |
131139
|--------|---------------------|------------------------------|---------------------------|----------------------------------|
132-
| Sessions | 101 | 3 | 0 | 0 |
133-
| Events | 32,141 | 135 | 0 | 0 |
134-
| Avg events/session | 318 | 45 | - | - |
135-
| Long sessions | 38 | 0 | 0 | 0 |
136-
| AUQ sessions | 5 | 0 | 0 | 0 |
137-
| Human intervention | 5 | 0 | 0 | 0 |
138-
| Failure | 1 | 0 | 0 | 0 |
139-
| Agent breadth | 5 | 1 | 0 | 0 |
140-
| Runtime breadth | 6 | 0 | 0 | 0 |
141-
| Task breadth | 8 | 0 | 0 | 0 |
140+
| Sessions | 101 | 3 | 0 | 8 |
141+
| Events | 32,141 | 135 | 0 | 42,465 |
142+
| Tagged | N/A (native) | 0 | 0 | 5 |
143+
| Evidence: strong | N/A | 0 | 0 | 5 |
144+
| Evidence: moderate | N/A | 0 | 0 | 3 |
145+
| Agent breadth | 5 | 1 | 0 | 1 |
142146

143147
## Current Cautions
144148

145-
- `direct_prompt_native` is the only lane with sufficient sample size for any hypothesis check.
149+
- `direct_prompt_native` is the only lane with sufficient sample size for statistical hypothesis checks.
150+
- `superpowers_workflow_intervention` (8 sessions) is dominated by 3 large manually-annotated sessions; descriptive observation only.
146151
- `controlled_prompt_morphology` (3 sessions) is too small for validation — it exists only as a lane marker.
147-
- `routed_prompt_intervention` and `superpowers_workflow_intervention` are definitionally scoped but carry no labeled datathey are placeholder lanes.
152+
- `routed_prompt_intervention` carries zero labeled sessionsit is a placeholder lane.
148153
- Do not merge intervention lanes into the native baseline.
149-
- Do not draw cross-lane conclusions when only one lane has data.
150-
- The 879 unlabeled sessions are not a lane — they are a labeling gap.
154+
- Do not draw cross-lane conclusions when only one lane has statistical mass.
155+
- The 880 unlabeled sessions are not a lane — they are a labeling gap.
151156

152157
## Next Action
153158

154-
Establish a labeling pipeline so that `routed_prompt_intervention` and `superpowers_workflow_intervention` sessions accumulate in the corpus. Until then, Phase 3E validation is limited to within-lane analysis of `direct_prompt_native`.
159+
Phase 3E infrastructure is complete. Parser detection gate is OPEN for superpowers_workflow_intervention. Background acquisition continues for all lanes. Tier 2 validation deferred until native failure >= 10, near-failure >= 10. See [closure report](closure_report_v0.2.5.md).

0 commit comments

Comments
 (0)