Skip to content

llm-jp/llm-jp-sae

Repository files navigation

LLM-jp SAE

(Work in progress)

This repository provides code for training and evaluating Sparse Autoencoders (SAEs) on the internal representations of LLM-jp. We trained SAEs separately on six different checkpoints of LLM-jp-3-1.8B and compared learned features across checkpoints.

  • Demo Page: Visualize the text samples that activate each SAE feature (100 features per checkpoint).
  • Model Weights: SAE weights for all six checkpoints.
  • Paper: "How LLMs Learn: Tracing Internal Representations with Sparse Autoencoders"

Usage

Environment

Python 3.10.12

uv init
uv sync
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Need 2GPUs for training. You should prepare the raw data of en_wiki and ja_wiki in llm-jp-corpus-v3 and checkpoints of LLM-jp-3-1.8B in advance. Before running the code, please fix the UsrConfig in config.py to match your environment.

Train, Visualize, and Evaluate SAE

Prepare Data

download llmjp-corpus-v3

python prepare_data.py

Train SAE

python train.py

Collect Examples for each Feature

python collect_examples.py

Evaluate activation patterns

python evaluate.py

Use Trained SAE and Visualize and Evaluate it

Prepare Data

download llmjp-corpus-v3

python prepare_data.py

Citations

@inproceedings{inaba-etal-2025-bilingual,
    title = "How a Bilingual {LM} Becomes Bilingual: Tracing Internal Representations with Sparse Autoencoders",
    author = "Inaba, Tatsuro  and
      Kamoda, Go  and
      Inui, Kentaro  and
      Isonuma, Masaru  and
      Miyao, Yusuke  and
      Oseki, Yohei  and
      Takagi, Yu  and
      Heinzerling, Benjamin",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    year = "2025",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.725/",
    pages = "13458--13470",
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages