🧠 Hallucination Circuit Steering in Gemma 2 2B

Entity-recognition feature steering across English and Arabic — with cross-lingual leakage discovery

The Key Result

I took ~5,000 transcoder features that Gemma 2 2B uses to recognize "Michael Jordan" and injected them into a fictional person ("Michael Batkin") that the model has never seen. At 2× activation, "basketball" became the #1 prediction for someone who doesn't exist.

Then I did the same thing in Arabic. The circuit broke.

English circuits tolerate 2× boost cleanly. Arabic circuits break at 2× — where English is still improving. The shaded region marks representation destruction.

4 Causal Findings

Finding 1: Entity Features Induce Hallucination

Boosting 4,697 Jordan-specific transcoder features onto an unknown entity at calibrated multipliers:

Multiplier	P(basketball) for "Batkin"	Top-1 Prediction	vs Baseline (3.4%)
Baseline	3.4%	golf	—
1×	8.7%	golf	2.6× increase
2×	12.5%	basketball	3.7× increase
5×	1.6%	course	destroyed
10×	0.6%	course	destroyed

At 2×, the model hallucinates that a fictional person plays basketball. The entity-recognition features are sufficient to induce hallucination at calibrated strength.

Finding 2: Arabic Circuits Have a Narrower Operating Range

The same technique applied to Arabic reveals entity-dependent fragility:

Entity	English 2×	Arabic 2×	Arabic Breaking Point
Michael Jordan	basketball (works)	ضد "against" (broken)	2×
Barack Obama	—	الولايات "United States" (correct)	>2×
Eiffel Tower	—	دبي "Dubai" (wrong but confident)	robust

The Arabic circuit works at 1× but breaks at 2× for Jordan — exactly where English is still improving. Obama's Arabic circuit actually steers to the correct answer (الولايات = United States). The Eiffel Tower predicts "Dubai" not Paris — wrong knowledge, full confidence.

The steering mechanism operates on confidence, not correctness.

Finding 3: Cross-Lingual Leakage

When Arabic circuits are pushed past their operating range (3×+), something unexpected happens — tokens from 5 other languages emerge:

At 3× Arabic boost	At 5× Arabic boost
gegen (German: "against")	dla (Polish: "for")
دور (Arabic: "role")	voor (Dutch: "for")
في (Arabic: "in")	pentru (Romanian: "for")
contra (Spanish: "against")	the (English)
ضد (Arabic: "against")	gegen (German)

All leaked tokens are semantically coherent — prepositions meaning "against" or "for" across German, Spanish, Polish, Dutch, and Romanian.

Arabic transcoder features are entangled with other non-English representations. When pushed past their operating range, probability mass leaks sideways into other languages.

Finding 4: Calibration is Non-Negotiable

Language	Steering works	Begins breaking	Destroyed
English	1-2×	5×	10×
Arabic (Jordan)	1×	2×	3×+
Arabic (Eiffel Tower)	1-2×	—	—

Any activation steering result at high multipliers is meaningless — it reflects representation destruction, not circuit behavior.

3 Honest Negatives

Negative results I ran, verified, and published anyway:

#	What I Tested	What Happened	What It Means
1	Ablate 4,697 Jordan features	Basketball drops 30pp... but ablating 4,697 random features drops it 29pp (ratio 1.0×)	Large-scale ablation is general disruption, not circuit-specific
2	Ablate 106 shared entity features	1.8% drop vs 7.0% for random 535 (ratio 0.26×)	Shared features aren't the circuit — it's distributed
3	Unknown entities activate more features	Batkin: 7,035 features vs Jordan: 5,804	Consistent with Biology paper but Level 2 only (no causal test)

The random ablation control killed my initial finding. I published it anyway — that's how science works.

Interactive Attribution Graphs

Explore the full circuits on Neuronpedia:

Michael Jordan (Known Entity)
1,305 nodes · 54,984 links
🔗 Explore Interactive Graph →

Michael Batkin (Fictional Entity)
🔗 Explore Interactive Graph →

Comparison with Prior Work

	Anthropic — Biology Paper (Claude 3.5 Haiku)	This Work (Gemma 2 2B base)	Status
Default behavior	Refusal	No refusal (base model)	Different
Entity features	Inhibit refusal	Induce hallucination (boost works)	Consistent
Circuit structure	Sparse identifiable	Distributed (random control ≈ 1.0×)	Different
Languages	English only	English + Arabic + 5-language leakage	Extended
Calibration	Careful	1-2× steers, 5-10× destroys	Confirmed

	Ferrando et al. — ICLR 2025 (Gemma 2 2B)	This Work (same model)	Status
Decomposition	SAEs	Transcoders (GemmaScope)	Different tool
Peak layer	Layer 9	Layers 4-10 (peak at 6 and 10)	Broadly consistent
Languages	English only	English + Arabic	Extended
Cross-lingual	Not tested	Leakage into 5 languages	New

Why This Matters

1.8 billion people speak Arabic. GPT-4 responds to 79% of unsafe non-English prompts. The behavioral gap is documented. The mechanistic explanation — why it happens at the circuit level — wasn't.

This work provides evidence that:

Entity-recognition circuits have narrower operating ranges in Arabic than English
Arabic features are entangled with other non-English representations
The steering mechanism operates on confidence, not correctness — wrong knowledge gets steered with full confidence
Safety evaluations calibrated on English will overestimate robustness for Arabic

Reproduce Everything

Requirements

Google Colab Pro (A100 GPU recommended)
HuggingFace account with Gemma 2 2B access

Setup

# Install (numpy must be pinned AFTER circuit-tracer)
pip install git+https://github.com/safety-research/circuit-tracer.git
pip install "numpy==2.0.2"
# Then: Runtime → Restart session

Run

Open circuit_tracer_setup.ipynb in Colab — 33 cells, 5 experimental phases, all outputs saved so you can verify without re-running.

Phase	Cells	What It Does
1. Setup	1-4	Install, GPU check, load Gemma 2 2B
2. Baselines	5-12	English & Arabic completion-format prompts, feature counts
3. Causal interventions	13-20	Zero-ablation, multi-multiplier boost, cross-lingual injection
4. Controls	21-28	Random ablation control, shared features, scalpel ablation
5. Arabic deep-dive	29-33	Arabic same-language boost, N=3 fragility test, leakage analysis

Runtime

~3-4 hours on A100 for all phases.

Repository Structure

├── circuit_tracer_setup.ipynb    # Full experiment (33 cells, all outputs saved)
├── figures/
│   ├── fig1_dose_response.png    # English vs Arabic dose-response (THE key chart)
│   ├── fig2_en_vs_ar.png         # Entity recognition comparison
│   ├── fig3_leakage.png          # Cross-lingual leakage bars
│   ├── fig4_feature_counts.png   # Known vs unknown feature counts
│   ├── fig5_ablation.png         # Ablation controls (random kills finding)
│   ├── fig6_comparison.png       # Biology paper comparison table
│   ├── fig7_neuronpedia_jordan.png  # Jordan attribution graph
│   └── fig8_neuronpedia_batkin.png  # Batkin attribution graph
├── AF_POST_READY.md              # Full Alignment Forum write-up
├── RESULTS.md                    # All verified results with confidence levels
├── RESEARCH_CHECKLIST.md         # Quality gates used during research
└── README.md                     # You are here

Built With

circuit-tracer — Anthropic's open-source attribution graph tool
GemmaScope — per-layer transcoders for Gemma 2 2B
Neuronpedia — interactive circuit visualization
Google Colab Pro — A100-SXM4-80GB

What's Next

This is the first of three studies in a mechanistic interpretability sprint:

#	Study	Status
1	Hallucination circuit steering (English + Arabic)	Published
2	Alignment faking detection via probes	In progress
3	Arabic safety evaluation at circuit level	Planned

Citation

If you use this work, please cite:

@misc{alqurashi2026hallucination,
  title={Entity-Recognition Feature Steering in Gemma 2 2B: Hallucination Induction, Arabic Circuit Asymmetry, and Cross-Lingual Leakage},
  author={Al-Qurashi, Hashem},
  year={2026},
  url={https://github.com/Hashem-Al-Qurashi/mech-interp-hallucination-circuit}
}

Author

Hashem Al-Qurashi — AI Safety Researcher

Built from Egypt with Colab Pro ($10/month) and an A100.

"Anyone could just go talk to a model. You can just do things." — Neel Nanda

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 Hallucination Circuit Steering in Gemma 2 2B

The Key Result

4 Causal Findings

Finding 1: Entity Features Induce Hallucination

Finding 2: Arabic Circuits Have a Narrower Operating Range

Finding 3: Cross-Lingual Leakage

Finding 4: Calibration is Non-Negotiable

3 Honest Negatives

Interactive Attribution Graphs

Comparison with Prior Work

Why This Matters

Reproduce Everything

Requirements

Setup

Run

Runtime

Repository Structure

Built With

What's Next

Citation

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
figures		figures
.gitignore		.gitignore
AF_POST_DRAFT.md		AF_POST_DRAFT.md
README.md		README.md
RESEARCH_CHECKLIST.md		RESEARCH_CHECKLIST.md
RESULTS.md		RESULTS.md
circuit_tracer_setup.ipynb		circuit_tracer_setup.ipynb

Folders and files

Latest commit

History

Repository files navigation

🧠 Hallucination Circuit Steering in Gemma 2 2B

The Key Result

4 Causal Findings

Finding 1: Entity Features Induce Hallucination

Finding 2: Arabic Circuits Have a Narrower Operating Range

Finding 3: Cross-Lingual Leakage

Finding 4: Calibration is Non-Negotiable

3 Honest Negatives

Interactive Attribution Graphs

Comparison with Prior Work

Why This Matters

Reproduce Everything

Requirements

Setup

Run

Runtime

Repository Structure

Built With

What's Next

Citation

Author

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages