Skip to content
View reinthal's full-sized avatar
💭
they should be paying ME per token
💭
they should be paying ME per token

Sponsoring

@natekspencer

Block or report reinthal

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
reinthal/README.md

Alexander Reinthal

AI safety researcher and engineer. Empirical, threat-model-driven research. Göteborg, Sweden — open to relocation.

Looking for: AI safety roles (research/engineering) · short contracts on dangerous capability evals, ML engineering, DevOps

Contact: accounts@reinthal.me · Book a meeting · LinkedIn · reinthal.me · Mastodon · CV


AI Safety

About Emergent Misalignment — 3rd / 8, ARENA 7.0 Capstone

  • Studied how data composition and inoculation prompting cause emergent misalignment
  • Found current model organisms show large capability degradations — argues for more realistic model-organism training
  • about-emergent-misalignment

Detecting Deception in Chinese Models — 1st / 8, ARENA 7.0 Mech-Interp Hackathon

Detecting Piecewise Cyber Espionage in Model APIs — 4th / 671, Apart D/Acc 2025

Open Source / PRs

Writing

Collaborators

  • Raffaello Fornasiere (LASR research fellow) — Detecting Deception in Chinese Models
  • Allison Zhuang (ARENA / Goodfire SPAR fellow) — Detecting Deception in Chinese Models
  • David Williams-King (ERA) — Piecewise Cyber Espionage

Other AI safety projects

Repo What it is
rl-moms-of-scheming Investigating model organisms of scheming under RL (Ongoing)
do-llms-prefer-philosophy Why do LLMs gravitate toward philosophy in free-form conversation? Compared AI 1-on-1s to agents browsing Wikipedia
cost-to-detection Modeling attacker cost-to-detection tradeoffs. Blog post: The Changing North Star of AI Control

Vulnerability Prediction (2018)

Methodology paper adopted by Recorded Future. Their Vulnerability Intelligence customers see 86% less unplanned downtime, 11 hours/week saved on triage, 73% more threat visibility.

"We typically see 5–10 CVEs a month escalated automatically, saving the team roughly 3–5 hours gathering information manually." — Senior Engineer/Threat Analyst


More

Dotfiles, infra, and older work: github.com/reinthal?tab=repositories

Pinned Loading

  1. hackerFinder9000 hackerFinder9000 Public

    A defensive application to find malicious access patterns in model backends

    Python 3

  2. deception-detection-in-chinese-models deception-detection-in-chinese-models Public

    JavaScript

  3. arxiv-mcp-ng arxiv-mcp-ng Public

    Python 5 1

  4. about-emergent-misalignment about-emergent-misalignment Public

    Jupyter Notebook

  5. cost-to-detection cost-to-detection Public

    HTML

  6. rl-moms-of-scheming rl-moms-of-scheming Public

    Python