Skip to content

sadhumitha-s/DT-Circuits

Repository files navigation

DT-Circuits: Mechanistic Interpretability for Decision Transformers

Hugging Face Spaces Python 3.9+ PyTorch 2.x License: Apache 2.0 Framework: TransformerLens

DT-Circuits is a research framework for mechanistic interpretability of Decision Transformers, focused on causal analysis, sparse feature decomposition, and circuit-level understanding of sequential decision-making agents.

Live Interactive Demo: DT-Explorer on Hugging Face Spaces


Table of Contents


Core Objectives

  1. Map Information Flow: Quantify how input tokens (State, Action, Reward-to-Go) contribute to the output action logits.
  2. Causal Verification: Use intervention techniques to identify the minimal set of model components required for specific behaviors.
  3. Feature Decomposition: Use Sparse Autoencoders (SAEs) to identify monosemantic features within the model's residual stream.
  4. Behavioral Control: Modify agent decisions at inference time by manipulating internal activations.

Technical Overview

The framework centers around HookedDT, a Decision Transformer implementation that allows for activation hooking and cache management.

Information Flow Diagram

graph TD
    subgraph Input_Sequence
        S[State Tokens]
        A[Action Tokens]
        RTG[Reward-to-Go Tokens]
    end

    Input_Sequence --> Embed[Embedding Layers]
    Embed --> Hooks[Activation Hooks]
    
    subgraph Transformer_Block
        Hooks --> Attn[Multi-Head Attention]
        Attn --> MLP[MLP Layers]
        MLP --> Res[Residual Stream]
    end

    Res --> DLA[Direct Logit Attribution]
    Res --> SAE[Sparse Autoencoder]
    Res --> Output[Action Logits]

    subgraph Interpretability_&_Safety
        DLA -.-> Analysis
        DLA -.-> MAD[Functional Attribution MAD]
        SAE -.-> Features
        SAE -.-> Auditor[Deceptive Alignment Auditor]
        Intervention[Activation Patching] -.-> Hooks
        
        Output & S --> Directer[Dynamic Rejection Steering]
        Directer -.-> |Feedback Adjust Alpha| Hooks
    end

    subgraph Interactive_Surgeon_Dashboard
        Surgeon[Circuit Surgeon Ablation Engine] -.-> |Dynamic Node/Edge Hooks| Hooks
        Surgeon --> |Format Schema| Neuronpedia[Neuronpedia Export Hub]
        Surgeon --> |Live Loop Execution| MiniGrid[MiniGrid Behavioral Audit]
        Output -.-> Surgeon
    end
Loading

Capabilities

Causal Mediation and Attribution

  • Direct Logit Attribution (DLA): Measures the direct contribution of individual attention heads and MLP layers to the final logit output.
  • Activation Patching: Substitutes internal activations from different runs to isolate the causal effect of specific inputs on model behavior.
  • Path Patching: Traces how information flows through specific connections between model components.

Feature Discovery and Analysis

  • Sparse Autoencoders (SAEs): Decomposes the residual stream into a set of sparse features, helping to resolve polysemanticity.
  • Induction Scanning: Identifies attention heads that perform pattern-matching and temporal sequence recognition.
  • Automated Circuit Discovery (ACDC): Prunes the model to identify the smallest functional subgraph sufficient to perform a specific task.

Behavioral Steering & Safety Auditing

  • Activation Steering: Injects specific vectors into the residual stream to bias the agent's decision-making without retraining the weights.
  • Dynamic Rejection Steering (Directer): Integrates a feedback loop during inference to dynamically scale back steering magnitude if it pushes the action distribution toward illegal or dangerous actions.
  • Deceptive Alignment Auditing: Uses SAE feature decomposition to identify the "situational awareness switch" feature in deceptively aligned agents (model organisms watched vs unwatched) and traces the circuit of attention heads that activate it.
  • Functional Attribution MAD: Detects mechanistic anomalies (such as backdoors or reward hacks) by comparing active logit attribution signatures to a cached reference profile, flagging when goals are met using atypical circuits.

Interactive Surgical Auditing & Peer Review

  • Interactive Circuit Surgery: Provides real-time interactive node (Heads, MLPs) and communication path (edges) ablation tools. Severed pathways dynamically update the underlying architecture using custom forward hooks.
  • Live Behavioral Audits: Evaluates guided agent behavior inside a live Gymnasium (MiniGrid) environment step-by-step to immediately visualize behavioral changes under currently selected surgical configurations.
  • Neuronpedia Export: Formats the discovered circuit blueprint, active components, and performance metrics into standardized schemas for publishing directly to the Neuronpedia platform for public peer review.

Project Structure

DT-Circuits/
├── src/
│   ├── dashboard/          
│   │   └── app.py          # Streamlit-based visualization UI
│   ├── data/               
│   │   └── harvester.py    # PPO-based expert trajectory harvester
│   ├── interpretability/   
│   │   ├── acdc.py         # Automated Circuit Discovery logic
│   │   ├── attribution.py  # Direct Logit Attribution (DLA)
│   │   ├── circuit_surgeon.py # Interactive node & path ablation engine
│   │   ├── evolution.py    # Training Dynamics Analysis
│   │   ├── induction_scan.py # Induction head detection logic
│   │   ├── neuronpedia.py  # Neuronpedia publishing client
│   │   ├── nla.py          # Natural Language Autoencoder Explainer
│   │   ├── patching.py     # Causal activation patching tools
│   │   ├── path_patching.py # Path-based causal intervention engine
│   │   ├── safety.py       # Safety auditing, directer, and deceptive alignment tools
│   │   ├── sae_manager.py  # SAE deployment and anomaly detection
│   │   ├── steering.py     # Steering vector generation and injection
│   │   └── universality.py # Cross-architecture feature mapping
│   ├── models/             
│   │   └── hooked_dt.py    # TransformerLens-wrapped Decision Transformer
│   ├── config.py           # Centralized hyperparameter management
│   └── utils/              
├── tests/                  # Unit tests for all modules
├── config.yaml             # External hyperparameter storage
├── requirements.txt 
└── docs/                        

Configuration

Hyperparameters are managed through a dual-system for both ease of use and research reproducibility:

  1. config.yaml: The primary interface for users. You can modify model dimensions, training epochs, and environment settings here without touching the code.
  2. src/config.py: Defines the underlying structure using Python dataclasses. It automatically loads overrides from config.yaml at runtime.

Key Configuration Sections

Section Description Key Parameters
model Architecture settings for the Decision Transformer n_layers, d_model, n_heads, max_length
data Settings for expert trajectory collection env_id, num_episodes (for DT training)
train DT training hyperparameters lr, epochs, seed
sae Sparse Autoencoder training hyperparameters expansion_factor, k, num_episodes (SAE specific)

Example: Independent Data Control You can control the amount of data used for general training vs. interpretability separately:

data:
  num_episodes: 1000  # Episodes for training the DT teacher

sae:
  num_episodes: 500   # Episodes for extracting SAE activations

Execution Modes: Installation and Usage

There are two primary ways to run and interact with the DT-Circuits framework depending on your research needs:


Way 1: Interactive Cloud Demo (Hugging Face Spaces)

For instant visual exploration, path intervention, and alignment auditing without any local workspace preparation, launch the web dashboard directly:

Note

Concise Demo Constraints:

  • CPU-Bound Resources: Runs on standard free-tier CPU instances (2 vCPUs, 16 GB RAM); high-overhead operations like ACDC scans may show higher latency than on a local GPU workspace.
  • Slices Dataset: Trajectory datasets are dynamically sliced down to a lightweight demo set under a 10MB limit (defined in deploy.sh) for storage and memory footprint constraints.
  • Read-Only / Ephemeral Container: Uses pre-baked static weights (mini_dt.pt) and pre-trained SAE checkpoints. Training new models or writing persistent states is disabled.

Way 2: Clone and Run Locally (Full Pipeline)

For full end-to-end research, customized hyperparameter tuning, local data harvesting, and GPU-accelerated model or SAE training, run the workspace on your machine.

Local Environment Setup

First, clone the repository, set up a virtual environment, and install dependencies:

git clone https://github.com/sadhumitha-s/DT-Circuits
cd DT-Circuits

python -m venv venv
source venv/bin/activate  

pip install -r requirements.txt

Option 2.1: Simple Workflows via Makefile

The workspace includes a standardized Makefile to orchestrate common research pipelines with single commands:

make setup      # Set up local environment & install requirements
make train      # Run the full end-to-end pipeline (Data harvesting -> DT -> SAE training)
make dashboard  # Run the Streamlit visualization dashboard locally

Option 2.2: Granular Control via Bash & Python

For research flexibility, execute each step of the pipeline manually using granular terminal scripts:

  1. Trajectories & Model Training Harvest teacher trajectories and train the target Decision Transformer (HookedDT):

    python scripts/train_dt.py
  2. TopK Sparse Autoencoder (SAE) Training Train sparse autoencoders on target activation layers:

    python scripts/train_sae.py
  3. Interactive Analysis Launch the Streamlit visualization engine locally to run audits with custom weights:

    streamlit run src/dashboard/app.py

Documentation

Detailed technical documentation for specific modules:


Foundational Research & References

This framework implements and builds upon the following foundational methodologies:


Citation

@software{dt_circuits2026,
  author = {Sadhumitha S.},
  title = {DT-Circuits: Mechanistic Interpretability for Decision Transformers},
  year = {2026},
  url = {https://github.com/sadhumitha-s/DT-Circuits}
}

License

Apache 2.0

About

Mechanistic interpretability and safety auditing for Decision Transformers. Features circuit mapping, causal patching, TopK SAEs, and behavioral steering.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors