You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+60-21Lines changed: 60 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,6 +8,8 @@
8
8
9
9
DT-Circuits is a research framework for mechanistic interpretability of Decision Transformers, focused on causal analysis, sparse feature decomposition, and circuit-level understanding of sequential decision-making agents.
10
10
11
+
**Live Interactive Demo:**[DT-Explorer on Hugging Face Spaces](https://huggingface.co/spaces/sadhumitha-s/DT-Explorer)
12
+
11
13
---
12
14
13
15
## Table of Contents
@@ -17,6 +19,7 @@ DT-Circuits is a research framework for mechanistic interpretability of Decision
17
19
-[Project Structure](#project-structure)
18
20
-[Installation and Usage](#installation-and-usage)
19
21
-[Documentation](#documentation)
22
+
-[Foundational Research & References](#foundational-research--references)
20
23
-[Citation](#citation)
21
24
-[License](#license)
22
25
@@ -167,49 +170,72 @@ sae:
167
170
168
171
---
169
172
170
-
## Installation and Usage
173
+
## Execution Modes: Installation and Usage
174
+
175
+
There are two primary ways to run and interact with the **DT-Circuits** framework depending on your research needs:
176
+
177
+
---
178
+
179
+
### Way 1: Interactive Cloud Demo (Hugging Face Spaces)
180
+
181
+
For instant visual exploration, path intervention, and alignment auditing without any local workspace preparation, launch the web dashboard directly:
182
+
183
+
* **Demo Link:** [DT-Explorer on Hugging Face Spaces](https://huggingface.co/spaces/sadhumitha-s/DT-Explorer)
184
+
185
+
> [!NOTE]
186
+
> **Concise Demo Constraints:**
187
+
> * **CPU-Bound Resources:** Runs on standard free-tier CPU instances (2 vCPUs, 16 GB RAM); high-overhead operations like ACDC scans may show higher latency than on a local GPU workspace.
188
+
> * **Slices Dataset:** Trajectory datasets are dynamically sliced down to a lightweight demo set under a **10MB limit** (defined in [deploy.sh](file:///Users/sadhumitha/Documents/projects/DT-Circuits/scripts/deploy.sh#L19-L33)) for storage and memory footprint constraints.
189
+
> * **Read-Only / Ephemeral Container:** Uses pre-baked static weights (`mini_dt.pt`) and pre-trained SAE checkpoints. Training new models or writing persistent states is disabled.
190
+
191
+
---
192
+
193
+
### Way 2: Clone and Run Locally (Full Pipeline)
194
+
195
+
For full end-to-end research, customized hyperparameter tuning, local data harvesting, and GPU-accelerated model or SAE training, run the workspace on your machine.
171
196
172
-
### Setup
197
+
#### Local Environment Setup
198
+
First, clone the repository, set up a virtual environment, and install dependencies:
You can access the hosted version on Hugging Face Spaces instantly, or run it locally:
209
+
#### Option 2.1: Simple Workflows via Makefile
210
+
The workspace includes a standardized [Makefile](file:///Users/sadhumitha/Documents/projects/DT-Circuits/Makefile) to orchestrate common research pipelines with single commands:
181
211
182
-
***Live Hosted Space:**[DT-Explorer Web App](https://sadhumitha-s-dt-explorer.hf.space) (No local installation needed!)
183
-
***Local Run:** Launch the dashboard on your machine (it will initialize with a random model if no trained weights are detected):
184
-
```bash
185
-
streamlit run src/dashboard/app.py
186
-
```
212
+
```bash
213
+
make setup # Set up local environment & install requirements
214
+
make train # Run the full end-to-end pipeline (Data harvesting -> DT -> SAE training)
215
+
make dashboard # Run the Streamlit visualization dashboard locally
216
+
```
187
217
188
-
### Workflow
218
+
#### Option 2.2: Granular Control via Bash & Python
219
+
For research flexibility, execute each step of the pipeline manually using granular terminal scripts:
189
220
190
-
1.**Data Harvesting & Model Training**
221
+
1. **Trajectories & Model Training**
222
+
Harvest teacher trajectories and train the target Decision Transformer (`HookedDT`):
191
223
```bash
192
224
python scripts/train_dt.py
193
225
```
194
226
195
-
2.**SAE Training**
227
+
2. **TopK Sparse Autoencoder (SAE) Training**
228
+
Train sparse autoencoders on target activation layers:
196
229
```bash
197
230
python scripts/train_sae.py
198
231
```
199
232
200
-
3.**Interpretability Analysis**
233
+
3. **Interactive Analysis**
234
+
Launch the Streamlit visualization engine locally to run audits with custom weights:
201
235
```bash
202
236
streamlit run src/dashboard/app.py
203
237
```
204
238
205
-
### Alternative: Makefile
206
-
Common tasks can also be executed via `make`:
207
-
```bash
208
-
make setup # Install dependencies
209
-
make train # Run full training pipeline (DT + SAE)
210
-
make dashboard # Launch DT-Explorer
211
-
```
212
-
213
239
---
214
240
215
241
## Documentation
@@ -222,6 +248,19 @@ Detailed technical documentation for specific modules:
222
248
223
249
---
224
250
251
+
## Foundational Research & References
252
+
253
+
This framework implements and builds upon the following foundational methodologies:
254
+
255
+
* **Decision Transformers**: [Chen et al., 2021](https://arxiv.org/abs/2106.01345) — Reinforcement learning as sequence modeling.
256
+
* **Transformer Circuits**: [Elhage et al., 2021](https://transformer-circuits.pub/2021/framework/index.html) — Mathematical foundations of mechanistic interpretability.
257
+
* **ACDC (Automated Circuit Discovery)**: [Conmy et al., 2023](https://arxiv.org/abs/2304.14997) — Algorithmic discovery of subgraphs.
258
+
* **Sparse Autoencoders (SAEs)**: [Bricken et al., 2023](https://transformer-circuits.pub/2023/monosemantic-features/index.html) (monosemantic features) & [Gao et al., 2024](https://arxiv.org/abs/2406.04096) (TopK SAEs).
259
+
* **Activation Steering**: [Turner et al., 2023](https://arxiv.org/abs/2308.10248) — Control via residual stream vector additions.
0 commit comments