(Work in progress)
This repository provides code for training and evaluating Sparse Autoencoders (SAEs) on the internal representations of LLM-jp. We trained SAEs separately on six different checkpoints of LLM-jp-3-1.8B and compared learned features across checkpoints.
- Demo Page: Visualize the text samples that activate each SAE feature (100 features per checkpoint).
- Model Weights: SAE weights for all six checkpoints.
- Paper: "How LLMs Learn: Tracing Internal Representations with Sparse Autoencoders"
Python 3.10.12
uv init
uv sync
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtNeed 2GPUs for training.
You should prepare the raw data of en_wiki and ja_wiki in llm-jp-corpus-v3 and checkpoints of LLM-jp-3-1.8B in advance.
Before running the code, please fix the UsrConfig in config.py to match your environment.
download llmjp-corpus-v3
python prepare_data.pypython train.pypython collect_examples.pypython evaluate.pydownload llmjp-corpus-v3
python prepare_data.py@inproceedings{inaba-etal-2025-bilingual,
title = "How a Bilingual {LM} Becomes Bilingual: Tracing Internal Representations with Sparse Autoencoders",
author = "Inaba, Tatsuro and
Kamoda, Go and
Inui, Kentaro and
Isonuma, Masaru and
Miyao, Yusuke and
Oseki, Yohei and
Takagi, Yu and
Heinzerling, Benjamin",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
year = "2025",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.725/",
pages = "13458--13470",
}