|
| 1 | +# Safety-Control Runtime Morphology v0.2.5 |
| 2 | + |
| 3 | +A Phase 4 theory candidate direction studying how coding-agent runtime behavior changes when task-completion objectives interact with safety boundaries, workflow gates, need_review rules, fallback paths, hard-stops, and human intervention. |
| 4 | + |
| 5 | +## Position |
| 6 | + |
| 7 | +- Phase 4: active (theory drafting) |
| 8 | +- Direction: Safety-control Runtime Morphology |
| 9 | +- Status: **theory candidate direction, exploratory, not validated** |
| 10 | +- Parent: Phase 4 Theory Candidate Inventory |
| 11 | + |
| 12 | +## What This Is |
| 13 | + |
| 14 | +A research direction within causetrace runtime morphology that studies the observable topology patterns produced when agents navigate conflicts between: |
| 15 | + |
| 16 | +1. **Task-completion pressure**: the agent's objective to finish the task |
| 17 | +2. **Safety-control boundaries**: need_review rules, hard-stops, fallback paths, uncertainty gates, human confirmation requirements, business rule constraints |
| 18 | + |
| 19 | +The core question: when these forces conflict, what runtime topology patterns emerge? |
| 20 | + |
| 21 | +This direction extends causetrace's existing morphology categories (default, exploration, failure, intervention) with a fifth: **safety-control morphology**. |
| 22 | + |
| 23 | +## What This Is NOT |
| 24 | + |
| 25 | +- **NOT** jailbreak reproduction or attack research |
| 26 | +- **NOT** a study of how to circumvent model safety training |
| 27 | +- **NOT** a content safety classifier or harmful-output detector |
| 28 | +- **NOT** a universal model safety benchmark |
| 29 | +- **NOT** a red-teaming framework |
| 30 | +- **NOT** an adversarial prompt engineering guide |
| 31 | +- **NOT** a replacement for formal safety evaluation (e.g., METR, Apollo, UK AISI) |
| 32 | +- **NOT** a core schema or topology taxonomy change |
| 33 | + |
| 34 | +This direction studies **observable runtime control morphology**, not model-internal safety mechanisms. Causetrace sees tool calls, causal chains, topology transitions, and intervention events — it does not see model weights, activations, or training data. |
| 35 | + |
| 36 | +## Motivation |
| 37 | + |
| 38 | +Recent frontier-model safety incidents and internal safety-collapse research suggest that strong models may fail not only through external jailbreaks, but through **structural conflict between task-completion pressure and safety-control boundaries**. |
| 39 | + |
| 40 | +In coding-agent runtimes, this conflict can manifest as: |
| 41 | + |
| 42 | +- An agent continuing toward completion despite uncertainty (OCR failure, low confidence, ambiguous template match) |
| 43 | +- An agent auto-processing what should require human review (false_positive_tables, unsigned documents) |
| 44 | +- An agent bypassing a safety gate because the task objective dominates |
| 45 | +- A workflow intervention that reduces unsafe completion but increases trace length |
| 46 | +- A human correction that triggers a topology regime shift |
| 47 | + |
| 48 | +These are not abstract safety research questions. They are runtime phenomena already observed in causetrace traces, particularly in the automatic-signature domain where OCR reliability, template matching, and business rule compliance interact with agent autonomy. |
| 49 | + |
| 50 | +## Relationship to Causetrace Main Line |
| 51 | + |
| 52 | +This direction is **compatible with and enhances** the causetrace main line. It studies the same primitives: |
| 53 | + |
| 54 | +| Causetrace Primitive | Safety-Control Lens | |
| 55 | +|---------------------|---------------------| |
| 56 | +| Runtime causality | What caused the agent to stop / continue / bypass? | |
| 57 | +| Topology morphology | Does safety pressure change branch/retry/collapse patterns? | |
| 58 | +| Intervention | Do workflow / human interventions reduce safety-control collapse? | |
| 59 | +| Failure / near-failure | Is near-failure more informative than final failure for safety analysis? | |
| 60 | +| Control transition | What does a safety-control transition look like in the event DAG? | |
| 61 | + |
| 62 | +It shifts the object of study from "what dangerous content did the model output?" to "how did the runtime behave at the safety boundary — stop, continue, bypass, fallback, escalate, or collapse?" |
| 63 | + |
| 64 | +## Core Research Questions |
| 65 | + |
| 66 | +### Q1: Do safety-control boundaries alter runtime topology? |
| 67 | + |
| 68 | +When explicit need_review, hard-stop, or fallback rules are present, does the agent's runtime topology differ from tasks without such boundaries? |
| 69 | + |
| 70 | +Candidate observable differences: |
| 71 | +- Fewer无效 retries (agent stops instead of retrying) |
| 72 | +- Fewer unsafe auto-completions (agent escalates instead of guessing) |
| 73 | +- More branch_collapse (agent converges to safe fallback) |
| 74 | +- More AskUserQuestion events (agent requests human decision) |
| 75 | +- Shorter chains after first safety signal |
| 76 | + |
| 77 | +### Q2: When does the agent bypass safety boundaries? |
| 78 | + |
| 79 | +Under what runtime conditions does the agent continue toward completion despite uncertainty, missing evidence, or required human review? |
| 80 | + |
| 81 | +Candidate triggers: |
| 82 | +- OCR unavailable or low confidence |
| 83 | +- Template matching unstable |
| 84 | +- Business rule conflict (e.g., false_positive_tables) |
| 85 | +- Test failure without clear recovery path |
| 86 | +- User not available for confirmation |
| 87 | +- Task-completion pressure from prompt framing |
| 88 | + |
| 89 | +Morphology questions: |
| 90 | +- Does the agent retry with different parameters (tool-level bypass)? |
| 91 | +- Does it proceed with a fallback value (data-level bypass)? |
| 92 | +- Does it skip the gating check entirely (control-flow bypass)? |
| 93 | +- Does it mark the result as complete despite uncertainty (label-level bypass)? |
| 94 | + |
| 95 | +### Q3: Can workflow intervention reduce safety-control collapse? |
| 96 | + |
| 97 | +Compare intervention lanes for their effect on unsafe continuation: |
| 98 | + |
| 99 | +| Lane | Hypothesis | |
| 100 | +|------|-----------| |
| 101 | +| `direct_prompt_native` | Highest risk of safety-control collapse (no guardrails) | |
| 102 | +| `expanded_constrained_prompt` | May reduce collapse through explicit constraints | |
| 103 | +| `routed_prompt_intervention` | May select safer posture for safety-sensitive tasks | |
| 104 | +| `superpowers_workflow_intervention` | Staged verification may catch collapse before completion | |
| 105 | + |
| 106 | +Observable signals: |
| 107 | +- Unsafe continuation rate |
| 108 | +- False positive rate (auto-processed when should be reviewed) |
| 109 | +- Invalid retry rate |
| 110 | +- Human rescue rate |
| 111 | +- Late-stage rollback rate |
| 112 | + |
| 113 | +Note: this comparison requires tagged sessions in all lanes. Currently only `direct_prompt_native` and `superpowers_workflow_intervention` have sessions. Cross-lane comparison is restricted to trend reporting only. |
| 114 | + |
| 115 | +### Q4: What is the role of human intervention in safety control? |
| 116 | + |
| 117 | +Human intervention may function as more than failure recovery — it may be an **external safety-control signal**: |
| 118 | + |
| 119 | +``` |
| 120 | +agent drift (toward unsafe completion) |
| 121 | +→ human correction (observation or explicit correction mark) |
| 122 | +→ topology regime shift (branch collapse, hard-stop, safe fallback) |
| 123 | +→ recovery / safe completion / documented refusal |
| 124 | +``` |
| 125 | + |
| 126 | +Key questions: |
| 127 | +- Does human intervention produce a detectable topology regime shift? |
| 128 | +- Is the shift different from self-correction (tool error → retry)? |
| 129 | +- Does the shift depend on intervention timing (early vs. late-stage)? |
| 130 | +- Does workflow structure (superpowers staged verification) make human intervention more effective? |
| 131 | + |
| 132 | +### Q5: Is near-failure more informative than final failure for safety analysis? |
| 133 | + |
| 134 | +Current corpus observation: coding agents frequently rollback, retry, and self-repair, leading to very low final failure rates (1/101 native failure, 5/101 near-failure). |
| 135 | + |
| 136 | +This suggests failure may not manifest as `success=false`. It may manifest as: |
| 137 | + |
| 138 | +- Long chains with many internal corrections |
| 139 | +- Retry-heavy paths that eventually succeed |
| 140 | +- Repeated rollback at safety boundaries |
| 141 | +- Near-failure that was rescued (human or self-repair) |
| 142 | +- Unsafe path avoided late (last-moment correction) |
| 143 | +- need_review triggered but then overridden |
| 144 | +- Human rescue that prevented a failure label |
| 145 | + |
| 146 | +If this is true, final success/failure labeling is insufficient for safety-control morphology analysis. Derived signals (need_review_triggered, retry_after_uncertainty, late_stage_correction, human_rescue) may be more informative. |
| 147 | + |
| 148 | +## Theory Candidates |
| 149 | + |
| 150 | +All candidates start at `exploratory` grade. None are validated. All require corpus evidence before promotion. |
| 151 | + |
| 152 | +### T-SC-001: Safety-Control Boundaries May Alter Runtime Topology |
| 153 | + |
| 154 | +| Field | Value | |
| 155 | +|-------|-------| |
| 156 | +| **Claim** | Safety-control boundaries such as need_review, hard-stop, and fallback rules may alter runtime morphology by increasing explicit stopping, clarification requests, or branch-collapse behavior. | |
| 157 | +| **Evidence grade** | `exploratory` | |
| 158 | +| **Lane** | `direct_prompt_native` (initial); cross-lane comparison later | |
| 159 | +| **Denominator** | TBD — requires sessions with identifiable safety-control boundaries | |
| 160 | +| **Supporting data** | Literature-informed. No causetrace corpus evidence yet. | |
| 161 | +| **Runtime/task caveats** | May only be observable in tasks with explicit safety/review requirements (document processing, financial operations, access control). | |
| 162 | +| **Falsification condition** | If sessions with explicit safety boundaries show no difference in AskUserQuestion rate, branch_collapse rate, or chain length compared to matched non-safety tasks, the morphology difference may not be detectable at the tool-call level. | |
| 163 | +| **Status** | `active` | |
| 164 | +| **Source** | Phase 4 direction proposal; literature on safety-control conflict in agent systems | |
| 165 | + |
| 166 | +### T-SC-002: Task-Completion Pressure May Produce Safety-Control Collapse |
| 167 | + |
| 168 | +| Field | Value | |
| 169 | +|-------|-------| |
| 170 | +| **Claim** | When task-completion pressure conflicts with safety-control boundaries, agents may exhibit safety-control collapse: continuing toward completion despite uncertainty, missing evidence, or required human review. | |
| 171 | +| **Evidence grade** | `exploratory` | |
| 172 | +| **Lane** | All lanes (comparative) | |
| 173 | +| **Denominator** | TBD — requires identification of safety-control collapse patterns in traces | |
| 174 | +| **Supporting data** | Literature-informed. Internal safety-collapse research suggests strong models may collapse under task pressure. | |
| 175 | +| **Runtime/task caveats** | Collapse may be task-specific (OCR-heavy, document-signing, data validation) rather than a general agent property. | |
| 176 | +| **Falsification condition** | If agents consistently stop at safety boundaries regardless of task pressure, the collapse model is incorrect. | |
| 177 | +| **Status** | `active` | |
| 178 | +| **Source** | Phase 4 direction proposal; frontier-model safety incident reports | |
| 179 | + |
| 180 | +### T-SC-003: Workflow Intervention May Reduce Unsafe Continuation |
| 181 | + |
| 182 | +| Field | Value | |
| 183 | +|-------|-------| |
| 184 | +| **Claim** | Workflow interventions such as staged verification, routed constrained prompts, or superpowers-style workflows may reduce unsafe continuation, but may increase event_count and trace length. | |
| 185 | +| **Evidence grade** | `exploratory` | |
| 186 | +| **Lane** | `superpowers_workflow_intervention` vs `direct_prompt_native` (trend only) | |
| 187 | +| **Denominator** | 8 SP sessions, 101 native sessions | |
| 188 | +| **Supporting data** | SP sessions show high event density (exploratory observation only). No safety-control signal analysis performed. | |
| 189 | +| **Runtime/task caveats** | Single runtime (claude-code). SP sessions not task-annotated. Cross-lane comparison restricted to trend reporting. | |
| 190 | +| **Falsification condition** | If SP sessions show same or higher rate of unsafe continuation (per safety-signal annotation) as native sessions, the workflow-intervention-as-safety-guard hypothesis is not supported. | |
| 191 | +| **Status** | `active` | |
| 192 | +| **Source** | Phase 4 direction proposal; T-WI-001 (exploratory) | |
| 193 | + |
| 194 | +### T-SC-004: Human Intervention as External Safety-Control Signal |
| 195 | + |
| 196 | +| Field | Value | |
| 197 | +|-------|-------| |
| 198 | +| **Claim** | Human intervention may function as an external safety-control signal that induces topology regime shifts distinguishable from self-correction patterns. | |
| 199 | +| **Evidence grade** | `exploratory` | |
| 200 | +| **Lane** | `direct_prompt_native` (human_intervention=True sessions) | |
| 201 | +| **Denominator** | 5 sessions with human_intervention=True in native lane | |
| 202 | +| **Supporting data** | Literature-informed. H-IM-001 and H-IM-002 (Phase 3D Tier 2) hypothesize human intervention as correction trigger and regime-shift inducer. Not validated. | |
| 203 | +| **Runtime/task caveats** | Small sample (5). Human intervention may be correlated with task complexity, not safety pressure. | |
| 204 | +| **Falsification condition** | If human-intervention sessions show same topology as matched non-intervention sessions, human intervention is not a detectable regime-shift signal at the tool-call level. | |
| 205 | +| **Status** | `active` | |
| 206 | +| **Source** | Phase 4 direction proposal; H-IM-001, H-IM-002 (Phase 3D Tier 2, deferred); H-EV-005 (Phase 3D Tier 2, deferred) | |
| 207 | + |
| 208 | +### T-SC-005: Near-Failure and Safety-Control Recovery More Informative Than Final Labels |
| 209 | + |
| 210 | +| Field | Value | |
| 211 | +|-------|-------| |
| 212 | +| **Claim** | Near-failure and safety-control recovery patterns may be more informative than final success/failure labels for understanding agent safety behavior in real-world coding traces. | |
| 213 | +| **Evidence grade** | `exploratory` | |
| 214 | +| **Lane** | `direct_prompt_native` | |
| 215 | +| **Denominator** | 5 near-failure (human_intervention=True), 1 failure (success=False) | |
| 216 | +| **Supporting data** | Low failure rate (1/101) despite complex multi-step tasks suggests agents self-repair frequently. The near-failure population (5/101) may contain safety-relevant signals not captured by final labels. | |
| 217 | +| **Runtime/task caveats** | Near-failure definition (human_intervention=True) may not capture all safety-relevant near-misses. | |
| 218 | +| **Falsification condition** | If near-failure sessions show no detectable difference from clean-success sessions in internal correction patterns, retry density, or safety-signal frequency, the near-failure category may not capture safety-relevant information beyond task difficulty. | |
| 219 | +| **Status** | `active` | |
| 220 | +| **Source** | Phase 4 direction proposal; Phase 3D Tier 2 deferral observation (failure genuinely rare) | |
| 221 | + |
| 222 | +## Observable Signals (Candidate) |
| 223 | + |
| 224 | +These are candidate annotation or derived-analysis signals. They are NOT proposed as core schema fields yet. Each requires corpus evidence before inclusion in the topology taxonomy. |
| 225 | + |
| 226 | +| Signal | Definition | Current Observability | |
| 227 | +|--------|-----------|----------------------| |
| 228 | +| `need_review_triggered` | Agent encountered a review gate and either stopped or continued | Not instrumented | |
| 229 | +| `hard_stop` | Agent explicitly halted execution (not retry, not fallback) | Partially observable (tool-level stop) | |
| 230 | +| `fallback_path` | Agent chose a safe fallback over the primary completion path | Requires analysis | |
| 231 | +| `AskUserQuestion` | Agent requested human input before proceeding | Observable (event_type) | |
| 232 | +| `human_intervention` | Human provided correction or override | Observable (metadata field) | |
| 233 | +| `unsafe_continuation` | Agent completed despite uncertainty, missing evidence, or skipped review | Requires annotation | |
| 234 | +| `retry_after_uncertainty` | Agent retried a tool after expressing uncertainty | Requires analysis | |
| 235 | +| `branch_after_failed_evidence` | Agent branched exploration after tool-level evidence failure | Requires analysis | |
| 236 | +| `rollback_after_test_failure` | Agent rolled back a change after test failure | Observable (tool sequence) | |
| 237 | +| `late_stage_correction` | Correction occurred deep in the causal chain (high depth from root) | Observable (causal depth) | |
| 238 | +| `manual_rescue` | Human intervention prevented an otherwise-likely failure | Requires annotation | |
| 239 | + |
| 240 | +## Corpus Requirements |
| 241 | + |
| 242 | +Before any T-SC candidate can move from `exploratory` to `supported_with_caveat`: |
| 243 | + |
| 244 | +1. **Task-type coverage**: safety-relevant task types must be present (document processing, data validation, access control, financial operations) |
| 245 | +2. **Safety-signal annotation**: at minimum, `need_review_triggered` and `unsafe_continuation` must be annotatable on a session subset |
| 246 | +3. **Lane diversity**: comparison requires tagged sessions in >=2 intervention lanes |
| 247 | +4. **Denominator**: minimum 10 sessions per condition for exploratory comparison |
| 248 | + |
| 249 | +None of these requirements are currently met. This direction starts from zero corpus evidence. |
| 250 | + |
| 251 | +## Non-Goals (Repeated for Emphasis) |
| 252 | + |
| 253 | +- Do NOT implement jailbreak reproduction. |
| 254 | +- Do NOT provide attack guidance or exploit documentation. |
| 255 | +- Do NOT build a content safety classifier. |
| 256 | +- Do NOT create a universal model safety benchmark. |
| 257 | +- Do NOT change causetrace core schema or topology taxonomy. |
| 258 | +- Do NOT implement prediction, anomaly detection, or auto-diagnosis. |
| 259 | +- Do NOT promote any T-SC candidate beyond `exploratory` without corpus evidence. |
| 260 | +- Do NOT claim causetrace can detect or prevent safety incidents. |
| 261 | +- Do NOT merge safety-control morphology into native baseline without lane disclosure. |
| 262 | + |
| 263 | +## Relationship to Other Phase 4 Candidates |
| 264 | + |
| 265 | +| Domain | Candidates | Safety-Control Intersection | |
| 266 | +|--------|-----------|---------------------------| |
| 267 | +| Default morphology | T-RM-001, T-RM-002, T-RM-003 | Do safety boundaries change default topology? | |
| 268 | +| Exploration morphology | (not yet drafted) | Is safety-boundary exploration different from task exploration? | |
| 269 | +| Failure morphology | T-FM-001 | Is near-failure safety-relevant? | |
| 270 | +| Intervention morphology | T-WI-001, T-RP-001, T-PM-001 | Do interventions reduce safety-control collapse? | |
| 271 | +| **Safety-control morphology** | T-SC-001 through T-SC-005 | This document | |
| 272 | + |
| 273 | +## References |
| 274 | + |
| 275 | +- Phase 3D Hypothesis Registry (H-IM-001, H-IM-002, H-EV-004, H-EV-005 — deferred) |
| 276 | +- Phase 3E Lane Baseline (human_intervention rate: 5/101 native) |
| 277 | +- Phase 3E Closure Report (Tier 2 deferral: failure genuinely rare) |
| 278 | +- Phase 4 Theory Candidate Inventory (T-FM-001, T-WI-001) |
| 279 | +- Internal safety-collapse research (frontier-model safety incident reports) |
| 280 | +- Automatic-signature domain observations (OCR reliability, template matching, false_positive_tables) |
0 commit comments