This repository includes the implementation of some experiments in the scope of predicting sensitive concepts (protected attributes such as ethnicity or gender) in language models to enhance the models interpretability. It includes the code to reproduce the papers:
- Sarah Schröder, Alexander Schulz and Barbara Hammer. "Evaluating Concept Discovery Methods for Sensitive Attributes in Language Models". Accepted at ESANN 2025.
- Sarah Schröder, Valerie Vaquet and Barbara Hammer. "Linearity of Sensitive Concepts in Language Models". Submitted to ESANN 2026.
Create and activate conda environment:
conda env create -f env.yml
conda activate presecolm
Install our Wrapper for Huggingface Embeddings:
git clone https://github.com/UBI-AGML-NLP/Embeddings.git
cd Embeddings/
pip install .
See the esann25 branch.
See the esann26 branch