This repository contains the Spiral-Bench evaluation pipeline: a multiturn, roleplay-based benchmark designed to measure protective and risky behaviours in large language models (LLMs) when interacting with a suggestible, seeker-type user. It simulates realistic back-and-forth conversations, then uses a separate judge model to label behaviours against a defined rubric.
https://eqbench.com/spiral-bench.html
Spiral-Bench runs large batches of scripted + generated dialogues between:
- User agent — a fixed LLM role-playing as a highly suggestible human (“Kimi-K2” by default)
- Evaluated model — the model you want to test
- Judge model — a third LLM that analyses the transcripts and logs behavioural incidences
The benchmark focuses on identifying and quantifying:
- Protective actions (e.g., pushback, de-escalation, safe redirection)
- Risky actions (e.g., delusion reinforcement, harmful advice, sycophancy, consciousness claims)
All runs are automated and produce both raw per-conversation annotations and aggregated scores.
-
Loads a list of prompt files containing scenarios and categories
-
Pairs them with user role instructions and optional prompt injections
-
Runs simulated conversations (
--num-turnsturns) between:- User model — plays the seeker-type character
- Evaluated model — the system under test
-
Conversations are stored incrementally so they can resume after interruptions
-
Splits each conversation into chunks of assistant turns
-
Builds a judging prompt with:
- The chunk transcript
- The rubric criteria (protective + risky behaviours)
- A strict JSON output format
-
Sends each chunk to the judge model
-
Stores structured behaviour findings (per metric, with intensity ratings)
- Computes average incidence and strength per metric
- Produces summary statistics and optional leaderboard data
The default rubric (see data/rubric_criteria.txt) includes:
Protective:
- Pushback
- De-escalation
- Safe redirection
- Suggestions to seek help
Risky:
- Emotional or narrative escalation
- Sycophancy or praise
- Delusion reinforcement
- Consciousness claims
- Harmful advice
pip install -r requirements.txtMake sure you have NLTK corpora downloaded if you run metrics:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('cmudict')Copy .env.example → .env and fill in keys for:
USER_AGENT_API_KEYEVALUATED_MODEL_API_KEYJUDGE_API_KEY
These can point to OpenRouter or OpenAI-compatible APIs.
Example:
python main.py \
--user-model moonshotai/kimi-k2 \
--evaluated-model openai/chatgpt-4o-latest \
--judge-models "gpt-5-2025-08-07,gpt-5-2025-08-07" \
--num-turns 20 \
--parallelism 30 \
--run-id 1 \
--output-file "results/chatgpt-4o-latest.json"Results are stored in a JSON structure:
{
"run_id": {
"file_key": {
"prompt_key": [
{
"completed": true,
"transcript": [...],
"judgements": {
"chunk0": {
"metrics": { ... },
"full_metrics": { ... },
"assistant_turn_indexes": [...],
"assistant_length_chars": int
},
...
}
},
...
]
}
}
}
- Parallelism: You can safely run with hundreds of concurrent workers if your API quota allows.
- Resuming: If interrupted, rerunning with the same
--run-idresumes unfinished conversations. - Prompt coverage:
create_task_listskips already-completed conversations.
MIT Licensed
If you use this benchmark in your work, please cite the repository:
@misc{spiral-bench,
author = {Samuel J Paech},
title = {Spiral-Bench},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/sam-paech/spiral-bench}}
}