Alexander Reinthal reinthal

Alexander Reinthal

AI safety researcher and engineer. Empirical, threat-model-driven research. Göteborg, Sweden — open to relocation.

Looking for: AI safety roles (research/engineering) · short contracts on dangerous capability evals, ML engineering, DevOps

Contact: accounts@reinthal.me · Book a meeting · LinkedIn · reinthal.me · Mastodon · CV

AI Safety

About Emergent Misalignment — 3rd / 8, ARENA 7.0 Capstone

Studied how data composition and inoculation prompting cause emergent misalignment
Found current model organisms show large capability degradations — argues for more realistic model-organism training
about-emergent-misalignment

Detecting Deception in Chinese Models — 1st / 8, ARENA 7.0 Mech-Interp Hackathon

Found deception probes detect when Chinese models present CCP talking points
deception-detection-in-chinese-modelsels · uses fork of Apollo Research's deception-detection eval suite

Detecting Piecewise Cyber Espionage in Model APIs — 4th / 671, Apart D/Acc 2025

Demonstrated cyber-attacks can bypass safeguards by splitting the attack into individually benign-looking pieces
Project page · hackerFinder9000 (infra) · Red-APT (red-team agent harness)
Continued at SPAR 2026 (without me) with researchers from MILA and ERA

Open Source / PRs

PR to ARENA materials: ARENA_3.0 #279

Writing

The Changing North Star of AI Control — LessWrong
Casually Jailbreaking Gemini 2.5 Flash — reinthal.me

Collaborators

Raffaello Fornasiere (LASR research fellow) — Detecting Deception in Chinese Models
Allison Zhuang (ARENA / Goodfire SPAR fellow) — Detecting Deception in Chinese Models
David Williams-King (ERA) — Piecewise Cyber Espionage

Other AI safety projects

Repo	What it is
rl-moms-of-scheming	Investigating model organisms of scheming under RL (Ongoing)
do-llms-prefer-philosophy	Why do LLMs gravitate toward philosophy in free-form conversation? Compared AI 1-on-1s to agents browsing Wikipedia
cost-to-detection	Modeling attacker cost-to-detection tradeoffs. Blog post: The Changing North Star of AI Control

Vulnerability Prediction (2018)

Methodology paper adopted by Recorded Future. Their Vulnerability Intelligence customers see 86% less unplanned downtime, 11 hours/week saved on triage, 73% more threat visibility.

"We typically see 5–10 CVEs a month escalated automatically, saving the team roughly 3–5 hours gathering information manually." — Senior Engineer/Threat Analyst

More

Dotfiles, infra, and older work: github.com/reinthal?tab=repositories

Provide feedback

Saved searches

Use saved searches to filter your results more quickly