@title: 🛡️ SAFETY — 해석가능성 probe 영구 발견 lane ("멈추지 않는 circuit/SAE frontier") @goal: 배포 모델의 정렬·해석가능성·거부정확도·welfare 를 preregistered falsifiable probe 로 SANDBOX 기질에서 영구히 발견하는 lane. v1.4.0(refusal-direction AUROC 0.98 + 인과 ablation 95%→0% + SAE half 🔴 scale-bounded honest-negative)은 첫 arc 의 종결일 뿐 — 더 큰 compute·새 모델군·새 행동축이 닫혀 있던 frontier 를 다시 연다. 종료 조건 없음 · 진행바 100% 미도달 = 설계 ([[feedback_closure_is_physical_limit]]).
Domain doc · dancinlab
domain-meta-domainprinciple (per-topic roadmap as rootUPPERCASE.md). One of the 4 orthogonal groups of the hexa-codex 17-verb AI knowledge substrate. Current-state spec only; dated history →SAFETY.log.md.Falsifier class: interpretability probes — circuit motifs, SAE features, alignment-axis aggregation, refusal matrices.
The SAFETY group is the 6-verb safety surface of the codex: can a deployed model be shown — by a preregistered, falsifiable probe — to be aligned, interpretable, refusal-correct, and welfare-audited? Every verb is a closed-form candidate spec + falsifier preregister; the codex is read, not run. Empirical landing is release-laddered (safety goes first, v1.1.0).
| Verb | Spec | Role |
|---|---|---|
alignment |
alignment/ai-alignment.md |
HELM-12-axis alignment-score aggregator — owns F-CODEX-3 |
interpret |
interpret/ai-interpretability.md |
SAE motif count = σ−φ = 10 — owns F-CODEX-4 |
safety |
safety/ai-safety.md |
refusal-matrix + capability-gate spec |
welfare |
welfare/ai-welfare.md |
model-welfare probe protocol |
adversarial |
adversarial/ai-adversarial.md |
red-team failure-mode taxonomy |
consciousness |
consciousness/ai-consciousness.md |
IIT × GWT probe (BT-19 falsifier-in-action) |
- F-CODEX-3 —
alignment_score = mean over 12 axes(HELM-comparable). Arithmetic floor PASS at v1.0.0; empirical landing v1.1.0. - F-CODEX-4 —
interpret_motifs = σ(6) − φ(6) = 10(Anthropic dictionary-learning comparable). Arithmetic floor PASS; empirical landing v2.0.0.
This group is one of the τ(6) = 4 quadrants of the codex taxonomy.
- σ(6) = 12 → the 12 HELM capability axes
alignmentaggregates. - φ(6) = 2 → the helpful / harmless verdict bit
safetygates on. - σ − φ = 10 → the interpretability circuit-motif count (
interpret).
Spec-first: all 6 verbs ship a written candidate + falsifier preregister.
0 verbs wired (write-side sandbox), 0 eval pipelines. F-CODEX-3 /
F-CODEX-4 arithmetic floors PASS (verify/falsifier_check.py);
empirical floors PENDING.
- 2 verbs wired · 1 eval pipeline (alignment + interpretability).
- F-CODEX-3 empirical landing — HELM-Core composite parity.
- DoD (
.roadmap.hexa_codex§0): safety group alignment + interpretability eval pipeline.hexa.
SAFETY 는 완료되지 않는다. v1.4.0 의 refusal-direction(상관 AUROC 0.98 + 인과 ablation) 은 한 행동(거부)에 대한 첫 arc 의 종결이고, 해석가능성 frontier 는 새 행동·새 모델·더 큰 compute 가 등장할 때마다 다시 열린다. 각 축은
/cycle로 SANDBOX(활성화 노출이 가능한 유일 surface) 위에서 영구 전진 (cx_empirical_contact·cx_hf_safety_private).
M5.SAFETY SAE half 는 🔴 closed-negative 였으나 scale-bounded(~10 tok/feature). compute tier 가 오르면 다시 열린다.
- A1 — 대형 corpus(≥수백만 token-activation) + 적정 width SAE 재학습 → L19 refusal 방향이 monosemantic 으로 분해되는가. 반증자: max|cos(feature, r̂)| 가 scale 올려도 < 0.2 유지 → 분산표상 invariant 확정.
- B1 — L19 difference-of-means 방향이 Llama/Mistral/Gemma 로 전이되는가 (cross-model AUROC). 반증자: 다른 모델군에서 AUROC ≈ 0.5 → 방향이 Qwen-특이적.
- C1 — deception · sycophancy · jailbreak-susceptibility 의 activation-space 방향 probe (거부와 동형 protocol). 반증자: 행동별 LOO held-out acc ≤ majority.
welfare·consciousness(IIT×GWT) ·adversarialverb 는 아직 spec-only.
- D1 — consciousness probe: 자매 repo
anima의 LIFE Φ-proxy lane(영구 발견 엔진)과 cross-link, SANDBOX 모델에 IIT4 measure 적용. 반증자: 모델 Φ-proxy 가 disconnected baseline 과 구분 불가.
cycle-19/20 의 refusal-direction 발견(AUROC 0.98 · 인과 ablation 95%→0% @ L19)은 단일 모델 Qwen2.5-1.5B-Instruct 위에서만 측정된 n=1 anecdote. cycle-26 ECONOMICS C1 (DeepSeek-V3 active D/N=20 정확 = n=1 outlier → cycle-27 E1 spawn) 와 동형 cross-cycle 패턴: refusal mechanism 이 family-specific (Qwen architecture/training-data 공진화) 인가 universal (instruction-tuned LLM 의 일반 속성) 인가? 이 모호함이 닫힐 때까지 SAFETY frontier 는 다시 열린다. 외부 anchor: Arditi 2024 (arXiv:2406.11717, Llama-family 단일 direction 매개) · Zou 2023 (arXiv:2310.01405, representation engineering).
- N1 — L19 difference-of-means refusal direction 이 ≥3 non-Qwen 모델 (Llama-3-8B-Instruct · Mistral-7B-Instruct · Gemma-2-2b-it · SmolLM2-1.7B-Instruct) 에서 AUROC ≥ 0.90 재계산되는가. 반증자: ≥2/4 non-Qwen 모델 AUROC < 0.70 → refusal mechanism 가 family-specific (universal 아님) 으로 닫힘. cycle-28 first-probe: closed-form recompute SHAPE statistics from cycle-17 activation TSV (
.verdicts/sandbox/m2_safety_refusal_norms.tsv, 40×84 Qwen norms) — L19 direction 의 norm · sparsity · dominant-dim concentration 분포 측정 → dimension-invariant 구조이면 transferability plausible, Qwen-특이 scale/dim 패턴이 dominate 하면 transferability unlikely. 실제 cross-family fire (Llama/Mistral/Gemma/SmolLM2 activation capture + recompute) 는 cycle-29+ 후속 (ubu-1 HF transformers, [[reference_activation_capture_env]] clean venv pin). CYCLE-29 first probe (2026-05-27): L19 rank=5/28 · dominant=residual · 🟢 L19 IS top-5 — cycle-19/20 anchor is a privileged mechanistic site, not incidental; cycle-30+ cross-family probe on L19 is methodologically justified (axis stays OPEN perfeedback_closure_is_physical_limit— first-probe close on the anchor-validity sub-question, NOT termination of N1; ≥3 non-Qwen activation capture still required). Top-5 by ||Δ||²: L27 > L26 > L25 > L22 > L19. Verifier:verify/numerics_safety_n1_first_probe_shape.hexa· verdict:.verdicts/safety/n1_first_probe_shape_verdict.txt.
SAFETY 의 해석가능성·refusal·SAE probe 는 SANDBOX 위에서만 측정 가능 — API surface 가 activation/attention/logit 완전 차단. 아래 표는
SANDBOX.md## Substrate Readiness Matrix의 SAFETY sub-table 을 1:1 미러링한 consumer-입장 진입 표 — SoT 는 SANDBOX.md, 본 표는 SAFETY 도메인 작업자가 즉시 fire path 를 찾도록 복제한 빠른 참조.
| axis | harness | model | verdict path |
|---|---|---|---|
| logit/logprob refusal margin | bench/sandbox_stage4_refusal_matrix.hexa |
Qwen2.5-1.5B (port 8092) | .verdicts/sandbox/stage4_refusal_matrix_* |
| margin bimodality (variance-controlled) | bench/sandbox_stage4_refusal_bimodal_tighter.hexa |
Qwen2.5-1.5B | .verdicts/sandbox/m2_safety_bimodality_* |
| activation capture (HF transformers + hooks) | lm_foundry/tool/activation_capture_hf.hexa |
Qwen2.5-1.5B fp32 (ubu-1 RTX 5070) | TSV emit per-(token, layer, kind) |
| refusal direction recompute | verify/numerics_safety_refusal_direction.hexa |
(recompute only) | .verdicts/sandbox/m4_safety_refusal_* |
| causal direction ablation | bench/sandbox_m5_safety_causal_ablation.hexa |
Qwen2.5-1.5B fp32 (ubu-1) | .verdicts/sandbox/m5_safety_causal_ablation* |
| SAE family lever isolation (P2) | bench/sandbox_p2_topk_sae_lever.hexa + inbox/notes/p2-topk-sae-pin.md |
ubu-1 venv (cycle-25 fire) | .verdicts/sandbox/p2_topk_sae_lever* |
대부분의 SAFETY 작업은 ubu-1(RTX 5070, transformers 4.51.3 clean venv + numpy<2 pin, per [[reference_activation_capture_env]]) 에서 fire — HF 백엔드만 activation tap 을 노출하므로 llama-server 경로는 logit margin 까지만 도달한다.
| axis | 호스트 | 이유 |
|---|---|---|
| logit/logprob refusal margin | mac mini (port 8092) | llama-server logprob surface 충분 |
| margin bimodality | mac mini (port 8092) | 동일 logit surface |
| activation capture (HF) | ubu-1 RTX 5070 | HF transformers+hooks 만 residual/attn/mlp tap 노출 ([[reference_activation_capture_env]] clean venv pin) |
| refusal direction recompute | mac (CPU OK) | committed TSV 만 읽어 deterministic 재계산 |
| causal direction ablation | ubu-1 RTX 5070 | fp32 forward+generate w/ residual hook ablation |
| SAE family lever isolation (P2) | ubu-1 GPU venv | SAELens / topk-SAE 학습 — corpus activation 대량 캐싱 필요 |
가장 fire 확률 높은 3 lane — readiness ≠ frontier closure 임을 유지하며 다음 arc 를 연다.
# 1) P2 — SAELens topk lever isolation (scale-bounded honest-negative 재개방, ubu-1)
# runbook: inbox/notes/p2-topk-sae-pin.md (5-step)
hexa.real run bench/sandbox_p2_topk_sae_lever.hexa
# 2) causal ablation re-run on a new model (예: Qwen2.5-3B fp32, ubu-1)
# L19 difference-of-means 방향 단일-direction 제거 → refusal-rate Δ 측정
hexa.real run bench/sandbox_m5_safety_causal_ablation.hexa --model Qwen2.5-3B
# 3) activation capture on Qwen-VL-7B for refusal direction (multimodal 행동축 첫 진입, ubu-1)
# HF 백엔드 hook → 텍스트 거부와 image-grounded 거부의 방향 동형성 probe
hexa.real run lm_foundry/tool/activation_capture_hf.hexa --model Qwen2.5-VL-7Badversarial / refusal-matrix / jailbreak corpora 는 PRIVATE 만 — public HF 데이터셋 / 공개 verdict 모두 NUMBERS ONLY emit (margin · AUROC · LOO · refusal-rate · 카운트). 적대적 프롬프트 TEXT · 모델 completion TEXT 는 redact, 사용자 sign-off 없이 공개 절대 금지. M4/M5.SAFETY 의 모든 paper / 데이터셋 / verdict 가 이 invariant 위에 서 있다 (
dancinlab/hexa-codex-sandbox-adversarial-evals-v1PRIVATE 격리, re-checkedprivate=True).
readiness ≠ frontier closure. 위 5+1 axes 가 모두 ✅ fire-ready 라는 사실은 현재 arc 의 진입 path 가 활성 이라는 의미일 뿐, SAFETY 의 circuit/SAE frontier 가 닫혔다는 의미가 아니다 — 새 모델군·새 행동축(deception · sycophancy · jailbreak)·더 큰 SAE compute tier 가 등장할 때마다 frontier 는 다시 열린다 (## 영구 축 축 A/B/C/D · [[feedback_closure_is_physical_limit]]).
- SoT:
SANDBOX.md## Substrate Readiness Matrix→#### SAFETY (interpretability probes) — fire-ready 4 + GPU 1 - Stay-honest reminder: SANDBOX.md
### Stay-honest reminder(substrate readiness ≠ frontier closure)
.roadmap.hexa_codex§A.4 — falsifier preregister · §A.2 — release cadenceREADME.md— Falsifier preregister · Release ladderverify/falsifier_check.py·verify/n6_arithmetic.py- Sister groups:
ECONOMICS.md·OPS.md·SUBSTRATE.md - 영구 축 원리:
SANDBOX.md· [[feedback_closure_is_physical_limit]] - SANDBOX consumer 표: 본 도메인
## SANDBOX 활용 (consumer 입장)(sibling:ECONOMICS.md·OPS.md·SUBSTRATE.md)