Skip to content

Commit d57c7c0

Browse files
Your Nameclaude
andcommitted
feat: open Phase 4 for evidence-graded theory drafting
Create Phase 4 README with precise scope: theory drafting and consolidation only, not prediction/anomaly/auto-diagnosis/universal policy. Add theory candidate inventory with 7 candidates across 5 domains, each carrying evidence grade (supported/supported_with_caveat/exploratory/deferred), corpus snapshot, lane scope, denominator, runtime/task caveats, and falsification condition. Update Phase 3E closure report with precise Phase 4 boundary. Update research index with Phase 4 active (theory drafting) and Phase 5 not open. Evidence grade distribution: 2 supported, 1 supported_with_caveat, 1 exploratory, 3 deferred. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent b8cb558 commit d57c7c0

4 files changed

Lines changed: 285 additions & 7 deletions

File tree

docs/research/README.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,8 @@ This directory groups the research tracks and branch studies that sit alongside
1212
| Phase 3C | complete | Metadata & provenance |
1313
| [Phase 3D](phase3d/README.md) | **complete** | Hypothesis registry + Tier 1 validation |
1414
| [Phase 3E](phase3e/README.md) | **complete** | Controlled transition & intervention-aware validation |
15-
| Phase 4 | **open** | Theory finalization |
15+
| [Phase 4](phase4/README.md) | **active** | Runtime morphology theory drafting (evidence-graded, not finalized) |
16+
| Phase 5 | **not open** | Evaluation, diagnostics, prediction |
1617

1718
## Current Corpus Snapshot
1819

@@ -34,6 +35,16 @@ Phase 3D delivered the hypothesis registry (19 hypotheses, 8 categories), comple
3435

3536
Phase 3E delivered the intervention lane infrastructure (4 lanes, parser detection gate, auto-detection in enrichment), completed 3 sub-phases (baseline, annotation, instrumentation), opened the superpowers_workflow_intervention gate (5 tagged sessions), and honestly deferred Tier 2 validation (failure samples genuinely rare: 1/101 native failure, 5/101 near-failure). Phase 2 auto-detection is operational for superpowers lane. See [closure report](phase3e/closure_report_v0.2.5.md).
3637

38+
## Phase 4 Boundary
39+
40+
Phase 4 is open for **evidence-graded theory drafting and consolidation only**. It must not enter:
41+
42+
- Prediction, anomaly detection, or automatic diagnosis
43+
- Universal prompt policy defaulting
44+
- Cross-lane aggregation without lane disclosure
45+
- Promotion of exploratory findings to stable theory without additional evidence
46+
- Phase 5 (evaluation / diagnostics)
47+
3748
## Cross-project Branch Studies
3849

3950
- [Cross-project Prompt Morphology Study](branches/cross_project_prompt_morphology/README.md)

docs/research/phase3e/closure_report_v0.2.5.md

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -132,10 +132,14 @@ Tier 2 validation is honestly deferred — the bottleneck is corpus failure dens
132132

133133
## Next Phase
134134

135-
Phase 4 (Theory Finalization) can be opened. Phase 4 scope is constrained by the Phase 3E operating rules that remain in effect:
135+
Phase 4 is open for **evidence-graded theory drafting and consolidation only**. It is explicitly NOT open for:
136136

137-
- All claims must bind to a specific corpus snapshot and lane
138-
- Every percentage must include its denominator
139-
- Negative results are first-class entries
140-
- Do not promote hypotheses without corpus-backed validation
141-
- Intervention lane findings do not become universal policy without additional validation
137+
- Prediction, anomaly detection, or automatic diagnosis
138+
- Universal prompt policy defaulting
139+
- Promotion of exploratory findings to stable theory without additional evidence
140+
- Cross-lane aggregation without lane disclosure
141+
- Phase 5 (evaluation / diagnostics)
142+
143+
Phase 4-1 deliverable: theory candidate inventory with evidence grading (`supported`, `supported_with_caveat`, `exploratory`, `inconclusive`, `deferred`). Each candidate must carry corpus snapshot, lane scope, denominator, runtime/task caveats, and falsification condition.
144+
145+
Phase 3E operating rules carry forward into Phase 4 unchanged.

docs/research/phase4/README.md

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Phase 4: Runtime Morphology Theory Drafting
2+
3+
Phase 4 consolidates evidence-graded theory candidates from Phase 3D (hypothesis registry) and Phase 3E (intervention-aware validation). It drafts, grades, and organizes theory statements. It does not finalize, productize, or operationalize them.
4+
5+
## Position
6+
7+
- Phase 3D: complete
8+
- Phase 3E: complete
9+
- Phase 4: **active** (theory drafting only)
10+
- Phase 5: not open (evaluation, diagnostics, prediction)
11+
12+
## Mission
13+
14+
Convert the strongest evidence-backed findings from Phase 3D and Phase 3E into graded theory candidates. Each candidate must carry an evidence grade, a corpus snapshot, a lane scope, a denominator, runtime/task caveats, and a falsification condition.
15+
16+
## What Phase 4 Is
17+
18+
- Evidence-graded theory drafting
19+
- Consolidation of Phase 3D + Phase 3E findings into theory statements
20+
- Organization of theory candidates by domain (runtime morphology, workflow intervention, failure, prompt posture)
21+
- Honest documentation of what is underdetermined
22+
- Maintenance of the hypothesis registry as a living document
23+
24+
## What Phase 4 Is NOT
25+
26+
- Theory finalization or publication of stable conclusions
27+
- Prediction of agent behavior
28+
- Anomaly detection or scoring
29+
- Automatic diagnosis of trace quality
30+
- Universal prompt policy recommendations
31+
- Cross-lane aggregation without lane disclosure
32+
- Promotion of exploratory findings to stable theory
33+
- Merging intervention lane findings into native baseline conclusions
34+
- Phase 5 (evaluation / diagnostics)
35+
36+
## Evidence Grades
37+
38+
Every theory candidate must carry exactly one grade:
39+
40+
| Grade | Meaning | Criteria |
41+
|-------|---------|----------|
42+
| `supported` | Evidence sufficient under current corpus constraints | Multiple independent sessions, disclosed denominator, runtime/task distribution reported, falsification condition stated |
43+
| `supported_with_caveat` | Evidence present but sample-limited or lane-restricted | Same as supported but gated on lane scope or corpus size |
44+
| `exploratory` | Trend visible but sample too small for confidence | <10 sessions in relevant lane, or single-runtime only |
45+
| `inconclusive` | Cannot determine from current corpus | Conflicting signals, or insufficient per-condition samples |
46+
| `deferred` | Explicitly not evaluated | Gated on corpus growth, controlled benchmark, or tag accumulation |
47+
48+
## Theory Candidate Structure
49+
50+
Every candidate must include:
51+
52+
- **Claim**: one-sentence theory statement (falsifiable)
53+
- **Evidence grade**: from the table above
54+
- **Supporting corpus snapshot**: date and metrics
55+
- **Lane**: which lane(s) the evidence comes from
56+
- **Denominator**: session count the claim is based on
57+
- **Runtime/task caveats**: distribution limitations
58+
- **Falsification condition**: what evidence would disprove it
59+
- **Status**: `active`, `under_review`, `superseded`, `retracted`
60+
- **Source hypotheses**: Phase 3D registry entries that fed this candidate
61+
62+
## Documents
63+
64+
- [Theory Candidate Inventory](theory_candidate_inventory_v0.2.5.md) — All current theory candidates with evidence grades, supporting data, and caveats
65+
66+
## Operating Rules
67+
68+
- Do not remove or downgrade negative results.
69+
- Do not promote a candidate beyond its evidence grade.
70+
- Do not merge intervention lane evidence into native lane theory statements.
71+
- Every claim must bind to a specific corpus snapshot and lane.
72+
- Every percentage must include its denominator.
73+
- Every runtime conclusion must disclose runtime distribution.
74+
- Cross-lane comparison may report trends only.
75+
- Do not enter Phase 5.
76+
- Do not implement prediction, anomaly detection, or auto-diagnosis.
77+
- Do not create universal prompt policy defaults.
78+
- Do not modify topology taxonomy or readiness gates unless explicitly justified by evidence review.
79+
80+
## Current State
81+
82+
Phase 4-1 is active. First deliverable: theory candidate inventory with evidence grading. Seven candidates identified from Phase 3D + Phase 3E evidence. No new hypotheses are being registered — Phase 4 consolidates, it does not expand.
Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
# Phase 4 Theory Candidate Inventory v0.2.5
2+
3+
This document lists all current runtime morphology theory candidates with evidence grades, supporting data, caveats, and falsification conditions. It consolidates Phase 3D hypothesis validation results and Phase 3E intervention-aware findings.
4+
5+
No candidate here is a finalized theory. All are drafts with explicit evidence boundaries.
6+
7+
## Corpus Snapshot
8+
9+
- Date: 2026-06-13
10+
- Metadata sessions: 992
11+
- Data sessions: 1,517
12+
- Events: 131,952
13+
- Runtime breadth: 7
14+
- Task breadth: 9
15+
- Native strict sessions: 100
16+
- Lanes: direct_prompt_native (101), superpowers_workflow_intervention (8), controlled_prompt_morphology (3), routed_prompt_intervention (0)
17+
18+
---
19+
20+
## T-RM-001: Dominant Chain as Default Native Morphology
21+
22+
| Field | Value |
23+
|-------|-------|
24+
| **Claim** | In the current native strict lane, `dominant_chain` is the default runtime morphology. |
25+
| **Evidence grade** | `supported` |
26+
| **Lane** | `direct_prompt_native` |
27+
| **Denominator** | 100 native strict sessions |
28+
| **Supporting data** | 93/100 native strict sessions exhibit dominant_chain topology. |
29+
| **Runtime distribution** | claude-code (50), opencode (46), codex (3), aider (1), Sisyphus (1) — 5 runtimes |
30+
| **Task distribution** | 8 task types represented; feature_add (37), exploration (28), bug_fix (12) are top 3 |
31+
| **Caveats** | Runtime distribution is uneven (claude-code + opencode = 96%). Aider and Sisyphus under-represented. |
32+
| **Falsification condition** | If >=15% of native strict sessions in a new runtime show non-dominant_chain default morphology, this candidate must be qualified per-runtime. |
33+
| **Status** | `active` |
34+
| **Source hypotheses** | H-RM-001 (Phase 3D Tier 1, supported) |
35+
36+
## T-RM-002: Multi-Root Exploration as Minority Morphology
37+
38+
| Field | Value |
39+
|-------|-------|
40+
| **Claim** | `multi_root_exploration` is a minority morphology in native real_work sessions, not a default path. |
41+
| **Evidence grade** | `supported` |
42+
| **Lane** | `direct_prompt_native` |
43+
| **Denominator** | 100 native strict sessions |
44+
| **Supporting data** | 1/100 native strict sessions exhibit multi_root_exploration. |
45+
| **Runtime distribution** | The single multi_root session is opencode. |
46+
| **Task distribution** | N/A (single session) |
47+
| **Caveats** | Low incidence rate may be a property of the current task mix (dominated by feature_add and exploration), not a universal property. |
48+
| **Falsification condition** | If >=5% of native sessions in exploration or review task types show multi_root_exploration, the "minority" claim needs qualification. |
49+
| **Status** | `active` |
50+
| **Source hypotheses** | H-RM-003 (Phase 3D Tier 1, supported) |
51+
52+
## T-RM-003: Feature_Add Tendency Toward Dominant Chain
53+
54+
| Field | Value |
55+
|-------|-------|
56+
| **Claim** | In the current native lane, `feature_add` tasks tend toward `dominant_chain` topology. |
57+
| **Evidence grade** | `supported_with_caveat` |
58+
| **Lane** | `direct_prompt_native` |
59+
| **Denominator** | 37 feature_add sessions in native strict |
60+
| **Supporting data** | 37/37 feature_add sessions exhibit dominant_chain. Branch collapse was not testable (insufficient collapse samples). |
61+
| **Runtime distribution** | Primarily claude-code and opencode |
62+
| **Caveats** | Single topology outcome may be an artifact of task simplicity in the current corpus, not a structural property of feature_add. Branch collapse claim could not be evaluated. |
63+
| **Falsification condition** | If a feature_add session with >=100 events shows non-dominant_chain topology, or if a multi-file feature_add session shows multi_root or branchy topology, the claim must be qualified. |
64+
| **Status** | `active` |
65+
| **Source hypotheses** | H-TT-002 (Phase 3D Tier 1, supported with caveat) |
66+
67+
## T-WI-001: Superpowers Workflow May Amplify Trace Volume
68+
69+
| Field | Value |
70+
|-------|-------|
71+
| **Claim** | `superpowers_workflow_intervention` sessions may exhibit amplified event density and long-chain structure compared to native direct-prompt sessions, but sample size is insufficient for stable comparison. |
72+
| **Evidence grade** | `exploratory` |
73+
| **Lane** | `superpowers_workflow_intervention` |
74+
| **Denominator** | 8 sessions (5 tagged, 3 manual annotation) |
75+
| **Supporting data** | 3 large SP sessions account for 41,221 of 42,465 lane events (avg ~13,740 events/session). Native lane avg: 318 events/session. No formal comparison performed (cross-lane comparison restricted to trend reporting only). |
76+
| **Runtime distribution** | claude-code only (8/8) |
77+
| **Task distribution** | Not annotated for SP lane sessions |
78+
| **Caveats** | Single-runtime. 3 outlier sessions dominate lane metrics. Not a validated finding — exploratory observation only. Must not be generalized to "superpowers always amplifies trace volume." |
79+
| **Falsification condition** | If 10+ additional SP sessions across >=2 runtimes show event density within native range (200-500 events/session), the amplification signal may be an artifact of the 3 large annotation sessions. |
80+
| **Status** | `active` |
81+
| **Source hypotheses** | None direct; derived from Phase 3E-1 lane baseline observation |
82+
83+
## T-FM-001: Failure Morphology Underdetermined
84+
85+
| Field | Value |
86+
|-------|-------|
87+
| **Claim** | Current failure and near-failure sample density is insufficient to characterize failure morphology. Failure topology cannot be typed. |
88+
| **Evidence grade** | `deferred` |
89+
| **Lane** | `direct_prompt_native` |
90+
| **Denominator** | 1 native failure (success=False), 5 near-failure (human_intervention=True) out of 101 native sessions |
91+
| **Supporting data** | 1/101 native failure, 5/101 near-failure. Tier 2 readiness: failure 1/10 NOT MET, near-failure 5/10 NOT MET. |
92+
| **Runtime distribution** | N/A |
93+
| **Task distribution** | N/A |
94+
| **Caveats** | Low failure rate may reflect genuine agent effectiveness for current task types, or insufficient coverage of failure-prone task categories. |
95+
| **Falsification condition** | When native failure >= 10 and near-failure >= 10, re-evaluate. If failure topology is then characterizable, this deferral is resolved. |
96+
| **Status** | `active` |
97+
| **Source hypotheses** | H-FM-001, H-FM-002, H-EV-004, H-EV-005 (Phase 3D Tier 2, all deferred) |
98+
99+
## T-RP-001: Routed-Prompt Morphology Unobserved
100+
101+
| Field | Value |
102+
|-------|-------|
103+
| **Claim** | `routed_prompt_intervention` morphology is currently unobserved. No theory statement can be made about the effect of prompt routing on topology. |
104+
| **Evidence grade** | `deferred` |
105+
| **Lane** | `routed_prompt_intervention` |
106+
| **Denominator** | 0 sessions |
107+
| **Supporting data** | prompt-routing-skill tag emission spec is defined. Capture path exists. 0 tagged sessions in corpus. Parser detection gate BLOCKED. |
108+
| **Runtime distribution** | N/A |
109+
| **Task distribution** | N/A |
110+
| **Caveats** | Absence is a corpus gap, not evidence that routing has no effect. |
111+
| **Falsification condition** | When >=5 routed sessions carry causetrace_tags, gate opens and basic lane characterization can begin. |
112+
| **Status** | `active` |
113+
| **Source hypotheses** | None (lane unpopulated) |
114+
115+
## T-PM-001: Controlled Prompt Morphology at Pilot-Level Evidence
116+
117+
| Field | Value |
118+
|-------|-------|
119+
| **Claim** | Controlled prompt morphology comparison is at pilot-level evidence only. Prompt posture effects on topology are not characterized. |
120+
| **Evidence grade** | `deferred` |
121+
| **Lane** | `controlled_prompt_morphology` |
122+
| **Denominator** | 3 pilot sessions |
123+
| **Supporting data** | 3 sessions, 135 events total, avg 45 events/session. No prompt variant labeling. Parser detection gate BLOCKED. |
124+
| **Runtime distribution** | claude-code only |
125+
| **Task distribution** | Not annotated |
126+
| **Caveats** | Pilot sessions are minimal and lack variant tagging. Cannot distinguish A/B/C prompt postures. |
127+
| **Falsification condition** | When controlled benchmark protocol is operational and >=5 sessions per variant carry prompt tags, re-evaluate. |
128+
| **Status** | `active` |
129+
| **Source hypotheses** | H-EG-001 (Phase 3D Tier 3, deferred) |
130+
131+
---
132+
133+
## Evidence Grade Distribution
134+
135+
| Grade | Count | Candidates |
136+
|-------|-------|------------|
137+
| `supported` | 2 | T-RM-001, T-RM-002 |
138+
| `supported_with_caveat` | 1 | T-RM-003 |
139+
| `exploratory` | 1 | T-WI-001 |
140+
| `deferred` | 3 | T-FM-001, T-RP-001, T-PM-001 |
141+
| `inconclusive` | 0 ||
142+
143+
## Theory Domain Map
144+
145+
```
146+
Runtime Morphology (T-RM)
147+
├── T-RM-001: dominant_chain as default [supported]
148+
├── T-RM-002: multi_root as minority [supported]
149+
└── T-RM-003: feature_add → dominant_chain [supported_with_caveat]
150+
151+
Workflow Intervention (T-WI)
152+
└── T-WI-001: SP may amplify trace volume [exploratory]
153+
154+
Failure Morphology (T-FM)
155+
└── T-FM-001: failure morphology underdetermined [deferred]
156+
157+
Routed Prompt (T-RP)
158+
└── T-RP-001: routed-prompt unobserved [deferred]
159+
160+
Prompt Morphology (T-PM)
161+
└── T-PM-001: controlled prompt pilot-only [deferred]
162+
```
163+
164+
## Operating Rules
165+
166+
- Do not promote a candidate beyond its evidence grade without new corpus evidence.
167+
- Do not remove deferred candidates — they document gaps, not failures.
168+
- Do not merge T-WI-001 into native morphology conclusions.
169+
- Do not use T-RM-001 as a universal claim — it is scoped to the current native strict lane.
170+
- All deferred candidates carry explicit re-evaluation criteria.
171+
- Negative spaces (T-FM-001, T-RP-001, T-PM-001) are first-class entries.
172+
173+
## What Is NOT Here
174+
175+
- Prediction models or anomaly scorers
176+
- Automatic diagnosis rules
177+
- Universal prompt policy recommendations
178+
- Cross-lane aggregated claims
179+
- Claims without denominators
180+
- Claims without falsification conditions
181+
- Tool-specific topology prescriptions

0 commit comments

Comments
 (0)