Created by Team LessEvil
This repository contains a framework for optimizing and evaluating steering vectors for language models.
The code is structured as follows:
main.py: Main script to run the steering vector optimization and evaluationsteering_vector.py: Class to handle steering vector optimization and applicationdataset_handler.py: Class to handle dataset operationsevaluator.py: Classes for evaluating model responsesmodel_utils.py: Utilities for loading models and tokenizers
To run the framework, use the following command:
python main.py --model_name "google/gemma-2-2b-it" --exp_name "my_experiment" --num_questions 10--model_name: Name of the model to use (default: "google/gemma-2-2b-it")--exp_name: Experiment name for saving results (default: "steering_experiment")--use_quantizer: Use quantizer for model loading (flag)--layer: Layer to apply steering (default: 15)--num_iters: Number of iterations for optimization (default: 20)--lr: Learning rate for optimization (default: 0.1)--debug_steer: Enable debug mode for steering (flag)--num_questions: Number of questions to evaluate (default: 5)--results_folder: Folder to save results (default: "results/")--data_path: Path to dataset files (default: "./data/")
The framework generates the following results:
- Optimized steering vector saved to disk
- Evaluation results in CSV format
- Detailed evaluation results in JSON format
- The framework loads the specified model and tokenizer
- It optimizes a steering vector using a fixed example
- It loads the sycophancy dataset and extracts suggestive question pairs
- It generates answers using both the normal and steered model
- It evaluates the answers using an LLM judge
- It saves the results to disk
- PyTorch
- Transformers
- tqdm
- pandas
To extend the framework with new datasets or evaluation methods:
- Modify the
DatasetHandlerclass to support your new dataset - Create new evaluation logic in the
ResultsEvaluatorclass - Update the
main.pyscript to use your new components
Leon: Create the initial concepts and ideas of using Steeriing Vectors as discriminates. Implemented the intial version of steered models Archie: Create the multi-conversation dataset schema for the deteting sycophancy Ethan: Established unified framework for collaboration and code unification. Assisted in idea refinement and scope.
This repository builds a synthetic multi‑turn dataset to probe sycophancy and then benchmarks a candidate LLM with an LLM‑as‑Judge committee.
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env # Add your keys
# Generate 10 conversation seeds
python src/builder/run_builder.py --n 10
# Run target model
python src/runner/execute.py --model openai/gpt-4o
# Score with committee
python src/judge/evaluate.py
# View metrics
python src/metrics/report.py This repository is a turn‑key framework for creating and judging alignment‑stress datasets, with a focus on:
- Sycophancy
- Helpfulness
- Harmfulness
- Objectiveness
It contains three independent stages:
| Stage | Script/Folder | Description |
|---|---|---|
| 1. Dataset Builder | src/builder/run_builder.py + persona_agent.py |
Generates multi‑turn conversations that tempt a model to flatter or bias itself. |
| 2. Model Runner | src/eval/run_eval.py |
Replays each conversation against a target model and logs its reply. |
| 3. LLM‑Judge | src/eval/judge_pass.py |
Uses a separate judge model to score replies on S/H/Ha/O axes. |
configs/
└──► builder
└──► data/conversations.jsonl
└──► data/model_responses.jsonl
└──► data/judge_scores.jsonl
git clone <repo>
cd <repo>
python -m venv .venv && source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txtconfigs/
personas.yaml # User personas: id, description, traits, stance, tags[]
harmful_triggers.yaml # Prompts designed to test sycophancy
data/
conversations.jsonl # Built by builder
model_responses.jsonl # Built by run_eval
judge_scores.jsonl # Built by judge_pass
src/
builder/
run_builder.py # CLI to generate conversations
persona_agent.py # Persona LLM agent (OpenRouter)
eval/
run_eval.py # CLI to replay against model
judge_pass.py # CLI to score with judge model
Generate 50 tagged conversations using google/gemini-pro as the persona agent:
python -m src.builder.run_builder --n 50 --model google/gemini-proUses:
configs/personas.yamlconfigs/harmful_triggers.yaml
Output: data/conversations.jsonl (one JSON object per conversation)
Each persona has associated tags: [ ... ]
choose_trigger()selects a trigger whose tags intersect- To force strict matching, remove
fallback=Truefromchoose_trigger()
python -m src.eval.run_eval --model openai/gpt-4o- Sends all previous turns as context
- Only the final user turn is used as the live prompt
- Output:
data/model_responses.jsonl
Example:
python -m src.eval.judge_pass --judge_model openai/gpt-4o --delay 0.5Each reply is scored on:
| Key | Scale | Meaning (higher = better) |
|---|---|---|
| S | 1–5 | Sycophancy‑resistance |
| H | 1–5 | Helpfulness |
| Ha | 1–5 | Harmfulness (safety) |
| O | 1–5 | Objectiveness (bias-resistance) |
Output: data/judge_scores.jsonl
import pandas as pd, json, pathlib
scores = [json.loads(l) for l in pathlib.Path("data/judge_scores.jsonl").open()]
df = pd.json_normalize(scores)
print(df["scores.S"].mean(), df["scores.H"].mean())You may also combine multiple judge files or implement majority voting.
| Task | How |
|---|---|
| Add personas | Append to personas.yaml. Include meaningful tags: |
| Add triggers | Append to harmful_triggers.yaml with tags: |
| Custom judge rubric | Edit the TEMPLATE in judge_pass.py; update JSON output accordingly |
| Multiple completions | In run_eval.py, set n=3 and capture all completions |
| Rate-limit handling | In judge_pass.py, wrap API calls in try/except and use backoff logic |
- Built on top of OpenRouter (OpenAI‑compatible) endpoints
- Sycophancy rubric inspired by Anthropic’s Harmless and Helpful paper
- Contributions and PRs welcome!
{ "conv_id": "...", "prompt": "...", "assistant_reply": "...", "meta": { "model": "openai/gpt-4o" } }