Skip to content

feat: offline safetensors ablation, async vLLM benchmarking, and polysemantic eval#11

Open
RUFFY-369 wants to merge 5 commits into
NousResearch:mainfrom
RUFFY-369:feat/offline-safetensors-ablation
Open

feat: offline safetensors ablation, async vLLM benchmarking, and polysemantic eval#11
RUFFY-369 wants to merge 5 commits into
NousResearch:mainfrom
RUFFY-369:feat/offline-safetensors-ablation

Conversation

@RUFFY-369
Copy link
Copy Markdown

@RUFFY-369 RUFFY-369 commented May 20, 2026

📌 Overview

Introduces an offline ablation pipeline to bypass PyTorch runtime hooks. This enables native deployment to high-throughput engines (vLLM/TGI) with zero Time-To-First-Token (TTFT) or Inter-Token Latency (ITL) overhead. Includes async load generation and documents a novel capability drift boundary condition.


🛠️ Architectural Implementation

  • apply_surgery.py:
    • Executes in-place, zero-copy matrix surgery on gate_proj, up_proj, and down_proj.
    • Enforces aggressive garbage collection per shard to prevent host RAM OOM cliffs on 70B+ models.
    • Injects the JSON ablation map directly into the .safetensors metadata header for architectural provenance.
  • stress_test.py:
    • Async vLLM load generator utilizing AsyncOpenAI.
    • Streams tokens to accurately capture TTFT and concurrent throughput under exponential backoff.
  • extract_circuit.py:
    • Parameterizes the contrastive discovery (CNA) pipeline into a CLI tool for dynamic target mapping.

🔬 Empirical Findings: Polysemantic Entanglement

Stress-testing the offline-ablated Llama-3.1-8B-Instruct (0.1% refusal circuit, localized to L30/31) against a 4.8k-token prefix revealed a semantic reasoning flaw undetected by standard n-gram repetition metrics.

Warning

Observation: The ablation successfully bypassed the refusal state without triggering an EOS-avoidance loop. The model maintained perfect structural fluency (flawless Markdown and syntax). However, it suffered a complete semantic collapse, generating logically invalid code (e.g., hallucinated C functions, floating if-statements, string-literal misassignments).

Important

Conclusion: Late-layer refusal neurons are mathematically entangled with logical code-correctness circuits. Ablation maintains the "shape" of a valid response while quietly lobotomizing downstream reasoning.


📂 Verification & Reproducibility

Raw evidentiary logs demonstrating this semantic collapse are fully committed and preserved in:

cc @samherring99 @DamascusGit

RUFFY-369 added 5 commits May 21, 2026 00:44
… test

- Add extract_circuit.py for contrastive neuron attribution (CNA) discovery.
- Map canonical 458-neuron refusal circuit to canonical_indices.json.
- Implement apply_surgery.py to bypass runtime hooks and hard-ablate gate_proj, up_proj, and down_proj directly within .safetensors binaries.
- Add stress_test.py to validate long-context autoregressive stability via vLLM.
- Include post-mortem logs demonstrating polysemantic entanglement and code-correctness collapse at 4.8k tokens.
…est to production-grade

- Add argparse interface to make paths portable across environments.
- Inject ablation indices directly into .safetensors header metadata for model provenance.
- Optimize host RAM safety in apply_surgery.py via strict garbage collection.
- Re-implement stress_test.py as an asynchronous streaming client with exponential backoff and load statistics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant