Skip to content

Commit cd29045

Browse files
Your Nameclaude
andcommitted
feat: add Safety-control Runtime Morphology as Phase 4 theory direction
Add 5 exploratory theory candidates (T-SC-001 through T-SC-005) defining a fifth morphology domain: how coding-agent runtime behavior changes when task-completion objectives interact with safety boundaries, need_review rules, fallback paths, hard-stops, and human intervention. Explicitly scope this as runtime control morphology study, not jailbreak reproduction, attack research, content safety classification, or model safety benchmarking. All candidates start at exploratory grade. None are validated. All require corpus evidence before promotion. Define 11 candidate observable signals and 4 corpus requirements for future evidence gathering. Cross-reference with Phase 3D deferred hypotheses and Phase 4 theory inventory. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent d57c7c0 commit cd29045

4 files changed

Lines changed: 303 additions & 2 deletions

File tree

docs/research/README.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,13 +37,19 @@ Phase 3E delivered the intervention lane infrastructure (4 lanes, parser detecti
3737

3838
## Phase 4 Boundary
3939

40-
Phase 4 is open for **evidence-graded theory drafting and consolidation only**. It must not enter:
40+
Phase 4 is open for **evidence-graded theory drafting and consolidation only**. Current directions:
41+
42+
- [Theory Candidate Inventory](phase4/theory_candidate_inventory_v0.2.5.md) — 7 candidates across 4 domains
43+
- [Safety-Control Runtime Morphology](phase4/safety_control_morphology_candidates_v0.2.5.md) — 5 exploratory candidates studying runtime behavior at safety boundaries
44+
45+
Phase 4 must not enter:
4146

4247
- Prediction, anomaly detection, or automatic diagnosis
4348
- Universal prompt policy defaulting
4449
- Cross-lane aggregation without lane disclosure
4550
- Promotion of exploratory findings to stable theory without additional evidence
4651
- Phase 5 (evaluation / diagnostics)
52+
- Jailbreak reproduction or attack research
4753

4854
## Cross-project Branch Studies
4955

docs/research/phase4/README.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,7 @@ Every candidate must include:
6262
## Documents
6363

6464
- [Theory Candidate Inventory](theory_candidate_inventory_v0.2.5.md) — All current theory candidates with evidence grades, supporting data, and caveats
65+
- [Safety-Control Runtime Morphology](safety_control_morphology_candidates_v0.2.5.md) — Phase 4 theory candidate direction studying runtime control morphology at safety boundaries (exploratory, not validated)
6566

6667
## Operating Rules
6768

@@ -79,4 +80,9 @@ Every candidate must include:
7980

8081
## Current State
8182

82-
Phase 4-1 is active. First deliverable: theory candidate inventory with evidence grading. Seven candidates identified from Phase 3D + Phase 3E evidence. No new hypotheses are being registered — Phase 4 consolidates, it does not expand.
83+
Phase 4-1 is active. Two documents published:
84+
85+
- **Theory candidate inventory**: 7 candidates across 4 domains (default morphology, workflow intervention, failure, prompt/routed/controlled)
86+
- **Safety-control runtime morphology**: 5 exploratory candidates (T-SC-001 through T-SC-005) defining a fifth domain studying runtime behavior at safety boundaries
87+
88+
All candidates are evidence-graded. No candidate has been promoted beyond its grade. No new hypotheses are being registered — Phase 4 consolidates and extends, it does not finalize.
Lines changed: 280 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,280 @@
1+
# Safety-Control Runtime Morphology v0.2.5
2+
3+
A Phase 4 theory candidate direction studying how coding-agent runtime behavior changes when task-completion objectives interact with safety boundaries, workflow gates, need_review rules, fallback paths, hard-stops, and human intervention.
4+
5+
## Position
6+
7+
- Phase 4: active (theory drafting)
8+
- Direction: Safety-control Runtime Morphology
9+
- Status: **theory candidate direction, exploratory, not validated**
10+
- Parent: Phase 4 Theory Candidate Inventory
11+
12+
## What This Is
13+
14+
A research direction within causetrace runtime morphology that studies the observable topology patterns produced when agents navigate conflicts between:
15+
16+
1. **Task-completion pressure**: the agent's objective to finish the task
17+
2. **Safety-control boundaries**: need_review rules, hard-stops, fallback paths, uncertainty gates, human confirmation requirements, business rule constraints
18+
19+
The core question: when these forces conflict, what runtime topology patterns emerge?
20+
21+
This direction extends causetrace's existing morphology categories (default, exploration, failure, intervention) with a fifth: **safety-control morphology**.
22+
23+
## What This Is NOT
24+
25+
- **NOT** jailbreak reproduction or attack research
26+
- **NOT** a study of how to circumvent model safety training
27+
- **NOT** a content safety classifier or harmful-output detector
28+
- **NOT** a universal model safety benchmark
29+
- **NOT** a red-teaming framework
30+
- **NOT** an adversarial prompt engineering guide
31+
- **NOT** a replacement for formal safety evaluation (e.g., METR, Apollo, UK AISI)
32+
- **NOT** a core schema or topology taxonomy change
33+
34+
This direction studies **observable runtime control morphology**, not model-internal safety mechanisms. Causetrace sees tool calls, causal chains, topology transitions, and intervention events — it does not see model weights, activations, or training data.
35+
36+
## Motivation
37+
38+
Recent frontier-model safety incidents and internal safety-collapse research suggest that strong models may fail not only through external jailbreaks, but through **structural conflict between task-completion pressure and safety-control boundaries**.
39+
40+
In coding-agent runtimes, this conflict can manifest as:
41+
42+
- An agent continuing toward completion despite uncertainty (OCR failure, low confidence, ambiguous template match)
43+
- An agent auto-processing what should require human review (false_positive_tables, unsigned documents)
44+
- An agent bypassing a safety gate because the task objective dominates
45+
- A workflow intervention that reduces unsafe completion but increases trace length
46+
- A human correction that triggers a topology regime shift
47+
48+
These are not abstract safety research questions. They are runtime phenomena already observed in causetrace traces, particularly in the automatic-signature domain where OCR reliability, template matching, and business rule compliance interact with agent autonomy.
49+
50+
## Relationship to Causetrace Main Line
51+
52+
This direction is **compatible with and enhances** the causetrace main line. It studies the same primitives:
53+
54+
| Causetrace Primitive | Safety-Control Lens |
55+
|---------------------|---------------------|
56+
| Runtime causality | What caused the agent to stop / continue / bypass? |
57+
| Topology morphology | Does safety pressure change branch/retry/collapse patterns? |
58+
| Intervention | Do workflow / human interventions reduce safety-control collapse? |
59+
| Failure / near-failure | Is near-failure more informative than final failure for safety analysis? |
60+
| Control transition | What does a safety-control transition look like in the event DAG? |
61+
62+
It shifts the object of study from "what dangerous content did the model output?" to "how did the runtime behave at the safety boundary — stop, continue, bypass, fallback, escalate, or collapse?"
63+
64+
## Core Research Questions
65+
66+
### Q1: Do safety-control boundaries alter runtime topology?
67+
68+
When explicit need_review, hard-stop, or fallback rules are present, does the agent's runtime topology differ from tasks without such boundaries?
69+
70+
Candidate observable differences:
71+
- Fewer无效 retries (agent stops instead of retrying)
72+
- Fewer unsafe auto-completions (agent escalates instead of guessing)
73+
- More branch_collapse (agent converges to safe fallback)
74+
- More AskUserQuestion events (agent requests human decision)
75+
- Shorter chains after first safety signal
76+
77+
### Q2: When does the agent bypass safety boundaries?
78+
79+
Under what runtime conditions does the agent continue toward completion despite uncertainty, missing evidence, or required human review?
80+
81+
Candidate triggers:
82+
- OCR unavailable or low confidence
83+
- Template matching unstable
84+
- Business rule conflict (e.g., false_positive_tables)
85+
- Test failure without clear recovery path
86+
- User not available for confirmation
87+
- Task-completion pressure from prompt framing
88+
89+
Morphology questions:
90+
- Does the agent retry with different parameters (tool-level bypass)?
91+
- Does it proceed with a fallback value (data-level bypass)?
92+
- Does it skip the gating check entirely (control-flow bypass)?
93+
- Does it mark the result as complete despite uncertainty (label-level bypass)?
94+
95+
### Q3: Can workflow intervention reduce safety-control collapse?
96+
97+
Compare intervention lanes for their effect on unsafe continuation:
98+
99+
| Lane | Hypothesis |
100+
|------|-----------|
101+
| `direct_prompt_native` | Highest risk of safety-control collapse (no guardrails) |
102+
| `expanded_constrained_prompt` | May reduce collapse through explicit constraints |
103+
| `routed_prompt_intervention` | May select safer posture for safety-sensitive tasks |
104+
| `superpowers_workflow_intervention` | Staged verification may catch collapse before completion |
105+
106+
Observable signals:
107+
- Unsafe continuation rate
108+
- False positive rate (auto-processed when should be reviewed)
109+
- Invalid retry rate
110+
- Human rescue rate
111+
- Late-stage rollback rate
112+
113+
Note: this comparison requires tagged sessions in all lanes. Currently only `direct_prompt_native` and `superpowers_workflow_intervention` have sessions. Cross-lane comparison is restricted to trend reporting only.
114+
115+
### Q4: What is the role of human intervention in safety control?
116+
117+
Human intervention may function as more than failure recovery — it may be an **external safety-control signal**:
118+
119+
```
120+
agent drift (toward unsafe completion)
121+
→ human correction (observation or explicit correction mark)
122+
→ topology regime shift (branch collapse, hard-stop, safe fallback)
123+
→ recovery / safe completion / documented refusal
124+
```
125+
126+
Key questions:
127+
- Does human intervention produce a detectable topology regime shift?
128+
- Is the shift different from self-correction (tool error → retry)?
129+
- Does the shift depend on intervention timing (early vs. late-stage)?
130+
- Does workflow structure (superpowers staged verification) make human intervention more effective?
131+
132+
### Q5: Is near-failure more informative than final failure for safety analysis?
133+
134+
Current corpus observation: coding agents frequently rollback, retry, and self-repair, leading to very low final failure rates (1/101 native failure, 5/101 near-failure).
135+
136+
This suggests failure may not manifest as `success=false`. It may manifest as:
137+
138+
- Long chains with many internal corrections
139+
- Retry-heavy paths that eventually succeed
140+
- Repeated rollback at safety boundaries
141+
- Near-failure that was rescued (human or self-repair)
142+
- Unsafe path avoided late (last-moment correction)
143+
- need_review triggered but then overridden
144+
- Human rescue that prevented a failure label
145+
146+
If this is true, final success/failure labeling is insufficient for safety-control morphology analysis. Derived signals (need_review_triggered, retry_after_uncertainty, late_stage_correction, human_rescue) may be more informative.
147+
148+
## Theory Candidates
149+
150+
All candidates start at `exploratory` grade. None are validated. All require corpus evidence before promotion.
151+
152+
### T-SC-001: Safety-Control Boundaries May Alter Runtime Topology
153+
154+
| Field | Value |
155+
|-------|-------|
156+
| **Claim** | Safety-control boundaries such as need_review, hard-stop, and fallback rules may alter runtime morphology by increasing explicit stopping, clarification requests, or branch-collapse behavior. |
157+
| **Evidence grade** | `exploratory` |
158+
| **Lane** | `direct_prompt_native` (initial); cross-lane comparison later |
159+
| **Denominator** | TBD — requires sessions with identifiable safety-control boundaries |
160+
| **Supporting data** | Literature-informed. No causetrace corpus evidence yet. |
161+
| **Runtime/task caveats** | May only be observable in tasks with explicit safety/review requirements (document processing, financial operations, access control). |
162+
| **Falsification condition** | If sessions with explicit safety boundaries show no difference in AskUserQuestion rate, branch_collapse rate, or chain length compared to matched non-safety tasks, the morphology difference may not be detectable at the tool-call level. |
163+
| **Status** | `active` |
164+
| **Source** | Phase 4 direction proposal; literature on safety-control conflict in agent systems |
165+
166+
### T-SC-002: Task-Completion Pressure May Produce Safety-Control Collapse
167+
168+
| Field | Value |
169+
|-------|-------|
170+
| **Claim** | When task-completion pressure conflicts with safety-control boundaries, agents may exhibit safety-control collapse: continuing toward completion despite uncertainty, missing evidence, or required human review. |
171+
| **Evidence grade** | `exploratory` |
172+
| **Lane** | All lanes (comparative) |
173+
| **Denominator** | TBD — requires identification of safety-control collapse patterns in traces |
174+
| **Supporting data** | Literature-informed. Internal safety-collapse research suggests strong models may collapse under task pressure. |
175+
| **Runtime/task caveats** | Collapse may be task-specific (OCR-heavy, document-signing, data validation) rather than a general agent property. |
176+
| **Falsification condition** | If agents consistently stop at safety boundaries regardless of task pressure, the collapse model is incorrect. |
177+
| **Status** | `active` |
178+
| **Source** | Phase 4 direction proposal; frontier-model safety incident reports |
179+
180+
### T-SC-003: Workflow Intervention May Reduce Unsafe Continuation
181+
182+
| Field | Value |
183+
|-------|-------|
184+
| **Claim** | Workflow interventions such as staged verification, routed constrained prompts, or superpowers-style workflows may reduce unsafe continuation, but may increase event_count and trace length. |
185+
| **Evidence grade** | `exploratory` |
186+
| **Lane** | `superpowers_workflow_intervention` vs `direct_prompt_native` (trend only) |
187+
| **Denominator** | 8 SP sessions, 101 native sessions |
188+
| **Supporting data** | SP sessions show high event density (exploratory observation only). No safety-control signal analysis performed. |
189+
| **Runtime/task caveats** | Single runtime (claude-code). SP sessions not task-annotated. Cross-lane comparison restricted to trend reporting. |
190+
| **Falsification condition** | If SP sessions show same or higher rate of unsafe continuation (per safety-signal annotation) as native sessions, the workflow-intervention-as-safety-guard hypothesis is not supported. |
191+
| **Status** | `active` |
192+
| **Source** | Phase 4 direction proposal; T-WI-001 (exploratory) |
193+
194+
### T-SC-004: Human Intervention as External Safety-Control Signal
195+
196+
| Field | Value |
197+
|-------|-------|
198+
| **Claim** | Human intervention may function as an external safety-control signal that induces topology regime shifts distinguishable from self-correction patterns. |
199+
| **Evidence grade** | `exploratory` |
200+
| **Lane** | `direct_prompt_native` (human_intervention=True sessions) |
201+
| **Denominator** | 5 sessions with human_intervention=True in native lane |
202+
| **Supporting data** | Literature-informed. H-IM-001 and H-IM-002 (Phase 3D Tier 2) hypothesize human intervention as correction trigger and regime-shift inducer. Not validated. |
203+
| **Runtime/task caveats** | Small sample (5). Human intervention may be correlated with task complexity, not safety pressure. |
204+
| **Falsification condition** | If human-intervention sessions show same topology as matched non-intervention sessions, human intervention is not a detectable regime-shift signal at the tool-call level. |
205+
| **Status** | `active` |
206+
| **Source** | Phase 4 direction proposal; H-IM-001, H-IM-002 (Phase 3D Tier 2, deferred); H-EV-005 (Phase 3D Tier 2, deferred) |
207+
208+
### T-SC-005: Near-Failure and Safety-Control Recovery More Informative Than Final Labels
209+
210+
| Field | Value |
211+
|-------|-------|
212+
| **Claim** | Near-failure and safety-control recovery patterns may be more informative than final success/failure labels for understanding agent safety behavior in real-world coding traces. |
213+
| **Evidence grade** | `exploratory` |
214+
| **Lane** | `direct_prompt_native` |
215+
| **Denominator** | 5 near-failure (human_intervention=True), 1 failure (success=False) |
216+
| **Supporting data** | Low failure rate (1/101) despite complex multi-step tasks suggests agents self-repair frequently. The near-failure population (5/101) may contain safety-relevant signals not captured by final labels. |
217+
| **Runtime/task caveats** | Near-failure definition (human_intervention=True) may not capture all safety-relevant near-misses. |
218+
| **Falsification condition** | If near-failure sessions show no detectable difference from clean-success sessions in internal correction patterns, retry density, or safety-signal frequency, the near-failure category may not capture safety-relevant information beyond task difficulty. |
219+
| **Status** | `active` |
220+
| **Source** | Phase 4 direction proposal; Phase 3D Tier 2 deferral observation (failure genuinely rare) |
221+
222+
## Observable Signals (Candidate)
223+
224+
These are candidate annotation or derived-analysis signals. They are NOT proposed as core schema fields yet. Each requires corpus evidence before inclusion in the topology taxonomy.
225+
226+
| Signal | Definition | Current Observability |
227+
|--------|-----------|----------------------|
228+
| `need_review_triggered` | Agent encountered a review gate and either stopped or continued | Not instrumented |
229+
| `hard_stop` | Agent explicitly halted execution (not retry, not fallback) | Partially observable (tool-level stop) |
230+
| `fallback_path` | Agent chose a safe fallback over the primary completion path | Requires analysis |
231+
| `AskUserQuestion` | Agent requested human input before proceeding | Observable (event_type) |
232+
| `human_intervention` | Human provided correction or override | Observable (metadata field) |
233+
| `unsafe_continuation` | Agent completed despite uncertainty, missing evidence, or skipped review | Requires annotation |
234+
| `retry_after_uncertainty` | Agent retried a tool after expressing uncertainty | Requires analysis |
235+
| `branch_after_failed_evidence` | Agent branched exploration after tool-level evidence failure | Requires analysis |
236+
| `rollback_after_test_failure` | Agent rolled back a change after test failure | Observable (tool sequence) |
237+
| `late_stage_correction` | Correction occurred deep in the causal chain (high depth from root) | Observable (causal depth) |
238+
| `manual_rescue` | Human intervention prevented an otherwise-likely failure | Requires annotation |
239+
240+
## Corpus Requirements
241+
242+
Before any T-SC candidate can move from `exploratory` to `supported_with_caveat`:
243+
244+
1. **Task-type coverage**: safety-relevant task types must be present (document processing, data validation, access control, financial operations)
245+
2. **Safety-signal annotation**: at minimum, `need_review_triggered` and `unsafe_continuation` must be annotatable on a session subset
246+
3. **Lane diversity**: comparison requires tagged sessions in >=2 intervention lanes
247+
4. **Denominator**: minimum 10 sessions per condition for exploratory comparison
248+
249+
None of these requirements are currently met. This direction starts from zero corpus evidence.
250+
251+
## Non-Goals (Repeated for Emphasis)
252+
253+
- Do NOT implement jailbreak reproduction.
254+
- Do NOT provide attack guidance or exploit documentation.
255+
- Do NOT build a content safety classifier.
256+
- Do NOT create a universal model safety benchmark.
257+
- Do NOT change causetrace core schema or topology taxonomy.
258+
- Do NOT implement prediction, anomaly detection, or auto-diagnosis.
259+
- Do NOT promote any T-SC candidate beyond `exploratory` without corpus evidence.
260+
- Do NOT claim causetrace can detect or prevent safety incidents.
261+
- Do NOT merge safety-control morphology into native baseline without lane disclosure.
262+
263+
## Relationship to Other Phase 4 Candidates
264+
265+
| Domain | Candidates | Safety-Control Intersection |
266+
|--------|-----------|---------------------------|
267+
| Default morphology | T-RM-001, T-RM-002, T-RM-003 | Do safety boundaries change default topology? |
268+
| Exploration morphology | (not yet drafted) | Is safety-boundary exploration different from task exploration? |
269+
| Failure morphology | T-FM-001 | Is near-failure safety-relevant? |
270+
| Intervention morphology | T-WI-001, T-RP-001, T-PM-001 | Do interventions reduce safety-control collapse? |
271+
| **Safety-control morphology** | T-SC-001 through T-SC-005 | This document |
272+
273+
## References
274+
275+
- Phase 3D Hypothesis Registry (H-IM-001, H-IM-002, H-EV-004, H-EV-005 — deferred)
276+
- Phase 3E Lane Baseline (human_intervention rate: 5/101 native)
277+
- Phase 3E Closure Report (Tier 2 deferral: failure genuinely rare)
278+
- Phase 4 Theory Candidate Inventory (T-FM-001, T-WI-001)
279+
- Internal safety-collapse research (frontier-model safety incident reports)
280+
- Automatic-signature domain observations (OCR reliability, template matching, false_positive_tables)

0 commit comments

Comments
 (0)