⭐ A curated list of research papers and resources related to Activation Engineering for foundation models, especially the Large Language Models (LLMs).
Note
Activation engineering (with LLMs) refers to the process of modifying or controlling the internal activation or intermediate output of neurons to analyze or influence model behavior, which is an emerging research feild related to model interpretability, neural network transparency, and controlled generation. It aims to better understand the internal workings of foundation models, particulaly with those high level concepts which are aligned with human cognitions.
Important
Regarding the objectives of activation engineering in analyzing or steering model behavior with arbitrary concepts, this repo delves into the key related areas like concept representation and extraction, concept activation detection, and concept activation steering. The goal is to investigate how concepts are represented within the models, how these concepts can be activated or detected during model inference, and how to steer activation vectors for more targeted control over model behavior.
This repo serves as a resource for researchers and developers interested in the inner workings of neural networks and LLMs, offering methods and experimental findings for advancing the field of activation engineering, which targets to understanding and manipulating model activations toward building more transparent and controllable powerful intelligent.
💭 This repo is ongoing update; If some related papers are missing, please contact me via pull requests :)
🤗 Please also feel free to let me know if there is any mistake or any suggestion for better categorization, Thanks!
Papers and resources that explore how concepts are represented in model hidden states.
- [arXiv 2025] - ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features - Alec Helbling, Tuna Han Salih Meral, Ben Hoover, Pinar Yanardag, Duen Horng Chau. [Paper]
- [ICLR 2025] - Not All Language Model Features Are Linear - Joshua Engels, Eric J Michaud, Isaac Liao, Wes Gurnee, Max Tegmark. [Paper][Code]
- [ICLR 2025] - The Geometry of Categorical and Hierarchical Concepts in Large Language Models - Kiho Park, Yo Joong Choe, Yibo Jiang, Victor Veitch. [Paper][Code]
- [NeurIPS 2024] - Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers - Yibo Jiang, Goutham Rajendran, Pradeep Ravikumar, Bryon Aragam. [Paper][Code]
- [NeurIPS 2024] - From Causal to Concept-Based Representation Learning - Goutham Rajendran, Simon Buchholz, Bryon Aragam, Bernhard Schölkopf, Pradeep Kumar Ravikumar. [Paper] [Code]
- [ICML 2024] - The Linear Representation Hypothesis and the Geometry of Large Language Models - Kiho Park, Yo Joong Choe, Victor Veitch. [Paper] [Code]
- [ICLR 2024] - Demystifying Embedding Spaces using Large Language Models - Guy Tennenholtz, Yinlam Chow, Chih-Wei Hsu, Jihwan Jeong, Lior Shani, Azamat Tulepbergenov, Deepak Ramachandran, Martin Mladenov, Craig Boutilier. [Paper]
- [ICLR 2024] - Identifying Representations for Intervention Extrapolation - Sorawit Saengkyongam, Elan Rosenfeld, Pradeep Ravikumar, Niklas Pfister, Jonas Peters. [Paper]
- [COLM 2024] - The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets - Samuel Marks, Max Tegmark. [Paper]
- [ACL 2024] - Language Models Linearly Represent Sentiment - Curt Tigges, Oskar J. Hollinsworth, Atticus Geiger, Neel Nanda. [Paper]
- [ACL 2015] - Linguistic Regularities in Continuous Space Word Representations - Tomas Mikolov, Wen-tau Yih, Geoffrey Zweig. [Paper]
Research on methods to detect and identify specific concepts or features in activations.
- [arXiv 2025] - Identifiable Steering via Sparse Autoencoding of Multi-Concept Shifts - Shruti Joshi, Andrea Dittadi, Sébastien Lachapelle, Dhanya Sridhar. [Paper]
- [arXiv 2025] - Are Sparse Autoencoders Useful? A Case Study in Sparse Probing - Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, Neel Nanda. [Paper]
- [PAKDD] - Interpreting pretrained language models via concept bottlenecks - Zhen Tan, Lu Cheng, Song Wang, Bo Yuan, Jundong Li, Huan Liu. [Paper]
- [NeurIPS 2024] - LG-CAV: Train Any Concept Activation Vector with Language Guidance - Qihan Huang, Jie Song, Mengqi Xue, Haofei Zhang, Bingde Hu, Huiqiong Wang, Hao Jiang, Xingen Wang, Mingli Song. [Paper] [Code]
- [NeurIPS 2024] - Uncovering Safety Risks in Open-source LLMs through Concept Activation Vector - Zhihao Xu, Ruixuan Huang, Xiting Wang, Fangzhao Wu, Jing Yao, Xing Xie. [Paper][Code]
- [MICCAI 2024] - TextCAVs: Debugging vision models using text - Angus Nicolson, Yarin Gal, J. Alison Noble. [Paper]
- [arXiv 2024] - Decision Trees for Interpretable Clusters in Mixture Models and Deep Representations - Maximilian Fleissner, Maedeh Zarvandi, Debarghya Ghoshdastidar. [Paper]
- [arXiv 2024] - KTCR: Improving Implicit Hate Detection with Knowledge Transfer driven Concept Refinement - Samarth Garg, Vivek Hruday Kavuri, Gargi Shroff, Rahul Mishra. [Paper]
- [arXiv 2024] - Explaining Explainability: Understanding Concept Activation Vectors - Angus Nicolson, Lisa Schut, J. Alison Noble, Yarin Gal. [Paper]
- [ICLR 2023] - Concept Gradient: Concept-based Interpretation Without Linear Assumption - Andrew Bai, Chih-Kuan Yeh, Pradeep Ravikumar, Neil Y. C. Lin, Cho-Jui Hsieh. [Paper][Code]
- [NeurIPS 2022] - Probing Classifiers are Unreliable for Concept Removal and Detection - Abhinav Kumar, Chenhao Tan, Amit Sharma. [Paper]
- [NeurIPS 2022] - Concept Activation Regions: A Generalized Framework For Concept-Based Explanations - Jonathan Crabbé, Mihaela van der Schaar. [Paper][Code]
- [NeurIPS 2020] - On Completeness-aware Concept-Based Explanations in Deep Neural Networks - Chih-Kuan Yeh, Been Kim, Sercan Arik, Chun-Liang Li, Tomas Pfister, Pradeep Ravikumar. [Paper][Code]
- [ICML 2018] - Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) - Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, Rory Sayres. [Paper][Code]
Methods about steering or manipulating activation status to influence model behavior or outputs.
- [arXiv 2025] - Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models - Jan Wehner, Sahar Abdelnabi, Daniel Tan, David Krueger, Mario Fritz. [Paper]
- [arXiv 2025] - Activation Space Interventions Can Be Transferred Between Large Language Models - Narmeen Oozeer, Dhruv Nathawani, Nirmalendu Prakash, Michael Lan, Abir Harrasse, Amirali Abdullah. [Paper].
- [arXiv 2025] - SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models - Zirui He, Haiyan Zhao, Yiran Qiao, Fan Yang, Ali Payani, Jing Ma, Mengnan Du. [Paper]
- [arXiv 2025] - Uncovering Latent Chain of Thought Vectors in Language Models - Jason Zhang, Scott Viteri. [Paper]
- [arXiv 2025] - AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders - Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D. Manning, Christopher Potts. [Paper][Code]
- [AAAI 2025] - Tuning-Free Accountable Intervention for LLM Deployment - A Metacognitive Approach - Zhen Tan, Jie Peng, Song Wang, Lijie Hu, Tianlong Chen, Huan Liu. [Paper]
- [ICLR 2025] - Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution - Haiyan Zhao, Heng Zhao, Bo Shen, Ali Payani, Fan Yang, Mengnan Du. [Paper]
- [ICLR 2025] - Programming Refusal with Conditional Activation Steering - Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, Amit Dhurandhar. [Paper][Code]
- [ICLR 2025] - Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors - Weixuan Wang, Jingyuan Yang, Wei Peng. [Paper] [Code]
- [ICLR 2025] - Improving Instruction-Following in Language Models through Activation Steering - Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, Besmira Nushi. [Paper][Code]
- [ICLR 2025 workshop] - Editable Concept Bottleneck Models - Lijie Hu, Chenyang Ren, Zhengyu Hu, Hongbin Lin, Cheng-Long Wang, Zhen Tan, Weimin Lyu, Jingfeng Zhang, Hui Xiong, Di Wang. [Paper]
- [NeurIPS 2024] - Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization - Yuanpu Cao, Tianrong Zhang, Bochuan Cao, Ziyi Yin, Lu Lin, Fenglong Ma, Jinghui Chen. [Paper][Code]
- [NeurIPS 2024] - Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control - Yuxin Xiao, Chaoqun Wan, Yonggang Zhang, Wenxiao Wang, Binbin Lin, Xiaofei He, Xu Shen, Jieping Ye. [Paper]
- [NeurIPS 2024] - Refusal in Language Models Is Mediated by a Single Direction - Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda. [Paper][Code]
- [NeurIPS 2024] - Analyzing the Generalization and Reliability of Steering Vectors - Daniel Tan, David Chanin, Aengus Lynch, Dimitrios Kanoulas, Brooks Paige, Adria Garriga-Alonso, Robert Kirk. [Paper]
- [NeurIPS 2024] - Who's asking? User personas and the mechanics of latent misalignment - Asma Ghandeharioun, Ann Yuan, Marius Guerard, Emily Reif, Michael A. Lepori, Lucas Dixon. [Paper][Code]
- [NeurIPS 2024 workshop] - Towards Reliable Evaluation of Behavior Steering Interventions in LLMs - Itamar Pres, Laura Ruis, Ekdeep Singh Lubana, David Krueger. [Paper]
- [NeurIPS 2024 workshop] - Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering - Joris Postmus, Steven Abreu. [Paper][Code]
- [NeurIPS 2024 workshop] - Relational Composition in Neural Networks: A Survey and Call to Action - Martin Wattenberg, Fernanda B. Viégas. [Paper]
- [NeurIPS 2024 workshop] - Can sparse autoencoders be used to decompose and interpret steering vectors? - Harry Mayne, Yushi Yang, Adam Mahdi. [Paper] [Code]
- [NeurIPS 2024 workshop] - Extracting Unlearned Information from LLMs with Activation Steering - Atakan Seyitoğlu, Aleksei Kuvshinov, Leo Schwinn, Stephan Günnemann. [Paper]
- [ICML 2024] - In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering - Sheng Liu, Haotian Ye, Lei Xing, James Zou. [Paper][Code]
- [ICML 2024 workshop] - Controlling Large Language Model Agents with Entropic Activation Steering - Nate Rahn, Pierluca D'Oro, Marc G. Bellemare. [Paper]
- [AAAI 2024] - Sparsity-guided holistic explanation for llms with interpretable inference-time intervention - Zhen Tan, Tianlong Chen, Zhenyu Zhang, Huan Liu. [Paper]
- [ICLR 2024] - ReFT: Representation Finetuning for Language Models - Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, Christopher Potts. [Paper][Code]
- [ICLR 2024] - Function Vectors in Large Language Models - Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, David Bau. [Paper][Code]
- [EMNLP 2024] - Activation Scaling for Steering and Interpreting Language Models - Niklas Stoehr, Kevin Du, Vésteinn Snæbjarnarson, Robert West, Ryan Cotterell, Aaron Schein. [Paper] [Code]
- [EMNLP 2024] - Householder Pseudo-Rotation: A Novel Approach to Activation Editing in LLMs with Direction-Magnitude Perspective - Van-Cuong Pham, Thien Huu Nguyen. [Paper]
- [ACL 2024] - Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models - Chen Qian, Jie Zhang, Wei Yao, Dongrui Liu, Zhenfei Yin, Yu Qiao, Yong Liu, Jing Shao. [Paper] [Code]
- [ACL 2024] - InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance - Pengyu Wang, Dong Zhang, Linyang Li, Chenkun Tan, Xinghao Wang, Ke Ren, Botian Jiang, Xipeng Qiu. [Paper] [Code]
- [ACL 2024] - Steering Llama 2 via Contrastive Activation Addition - Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Matt Turner. [Paper] [Code]
- [CIKM 2024] - Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment - Haoran Wang, Kai Shu. [Paper] [Code]
- [arXiv 2024] - Representation Engineering: A Top-Down Approach to AI Transparency - Andy Zou et al. [Paper][Code]
- [arXiv 2024] - Improving Steering Vectors by Targeting Sparse Autoencoder Features - Sviatoslav Chalnev, Matthew Siu, Arthur Conmy. [Paper] [Code]
- [arXiv 2024] - Uncovering Latent Chain of Thought Vectors in Language Models - Jason Zhang, Scott Viteri. [Paper]
- [arXiv 2024] - Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs - Sara Price, Arjun Panickssery, Sam Bowman, Asa Cooper Stickland. [Paper]
- [arXiv 2024] - Steering Without Side Effects: Improving Post-Deployment Control of Language Models - Asa Cooper Stickland, Alexander Lyzhov, Jacob Pfau, Salsabila Mahdi, Samuel R. Bowman. [Paper] [Code]
- [arXiv 2024] - Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories - Tianlong Wang, Xianfeng Jiao, Yifan He, Zhongzhi Chen, Yinghao Zhu, Xu Chu, Junyi Gao, Yasha Wang, Liantao Ma. [Paper]
- [arXiv 2024] - Activation Steering for Robust Type Prediction in CodeLLMs - Francesca Lucchetti, Arjun Guha. [Paper]
- [arXiv 2024] - Extending Activation Steering to Broad Skills and Multiple Behaviours - Teun van der Weij, Massimo Poesio, Nandi Schoots. [Paper] [Code]
- [arXiv 2024] - MiMiC: Minimally Modified Counterfactuals in the Representation Space - Shashwat Singh, Shauli Ravfogel, Jonathan Herzig, Roee Aharoni, Ryan Cotterell, Ponnurangam Kumaraguru. [Paper]
- [arXiv 2024] - Investigating Bias Representations in Llama 2 Chat via Activation Steering - Dawn Lu, Nina Rimsky. [Paper]
- [arXiv 2023] - Improving Activation Steering in Language Models with Mean-Centring - Ole Jorgensen, Dylan Cope, Nandi Schoots, Murray Shanahan. [Paper]
- [EMNLP 2023] - In-Context Learning Creates Task Vectors - Roee Hendel, Mor Geva, Amir Globerson. [Paper]
- [arXiv 2023] - Activation Addition: Steering Language Models Without Optimization - Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, Monte MacDiarmid. [Paper] [Code]
- [ACL 2022] - Extracting Latent Steering Vectors from Pretrained Language Models - Nishant Subramani, Nivedita Suresh, Matthew E. Peters, 2022-05. [Paper] [Code]
- [arXiv 2025] - LatentQA: Teaching LLMs to Decode Activations Into Natural Language - Alexander Pan, Lijie Chen, Jacob Steinhardt. [Paper][Code]
- [ICLR 2025] - Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models - Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan, Neel Nanda. [Paper]
- [NeurIPS 2024] - Concept Algebra for (Score-Based) Text-Controlled Generative Models - Zihao Wang, Lin Gui, Jeffrey Negrea, Victor Veitch. [Paper][Code]
- [NeurIPS 2023] - Inference-Time Intervention: Eliciting Truthful Answers from a Language Model - Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg. [Paper][Code]
- [ICLR 2020] - On the "steerability" of generative adversarial networks - Ali Jahanian, Lucy Chai, Phillip Isola. [Paper][Code]
- Activation Engineering - LessWrong
- Anthropic Transformer Ciruits Thread
- TransformerLens
- Neel Nanda's Blog
- Awesome Representation Engineering
This project is licensed under the MIT License.
Disclaimer: This repository is for research purposes only. The papers and resources listed here are the property of their respective authors.