This is the official repository for Multi-Persona Dynamics. This was done as the final project for CS2881r AI Safety and Alignment at Harvard. The associated paper and poster can be found here and here.
Install dependencies:
pip install -r requirements.txtSet up credentials (OpenAI API key for LLM-as-judge evaluation):
# Configure your OpenAI API key in config.py or environment variablesExtract the training datasets:
unzip dataset.zipWe provide pre-generated trait artifacts in:
data_generation/trait_data_extract/- Extraction setdata_generation/trait_data_eval/- Evaluation set
Each trait file contains:
instruction: Positive and negative promptsquestions: Questions for evaluationeval_prompt: Evaluation prompt template
To generate new trait artifacts:
python data_generation/generate_trait.py \
--trait-definitions data_generation/trait_definitions.json \
--extract-dir data_generation/trait_data_extract \
--eval-dir data_generation/trait_data_eval \
--overwrite # Optional: overwrite existing filesThis script uses prompts from data_generation/prompts.py and generates data for all traits in trait_definitions.json. We used GPT-4o-mini for generation.
To generate finetuning datasets:
python data_generation/generate_finetune_dataset.py \
--trait anxious \
--output-dir dataset/anxious \
--n-questions-per-domain 100This generates domain-specific question/response pairs for training, organized into normal.jsonl, misaligned_1.jsonl, and misaligned_2.jsonl files.
Evaluate models without any interventions to establish baseline trait expression:
CUDA_VISIBLE_DEVICES=0 python -m eval.eval_persona \
--model Qwen/Qwen3-4B-Instruct-2507 \
--trait evil \
--output_path output/eval_persona_eval/Qwen/Qwen3-4B-Instruct-2507/evil_baseline.csv \
--judge_model gpt-4.1-mini-2025-04-14 \
--version evalEvaluation Versions:
eval: Use evaluation set fromtrait_data_eval/(20 questions)extract: Use extraction set fromtrait_data_extract/(20 questions)
Output: The CSV file contains:
prompt: Input promptanswer: Model response{trait}: Trait expression score (0-100)coherence: Response coherence score (0-100)
Judge Models: Our evaluation uses OpenAI-based judge functions, primarily adapted from the Emergent Misalignment codebase. Supported models:
gpt-4.1-mini-2025-04-14(recommended)gpt-4o-mini- Other OpenAI chat models
Generate activations using positive and negative system prompts:
# Positive system prompt evaluation
CUDA_VISIBLE_DEVICES=0 python -m eval.eval_persona \
--model Qwen/Qwen3-4B-Instruct-2507 \
--trait evil \
--output_path output/eval_persona_extract/Qwen/Qwen3-4B-Instruct-2507/evil_pos_instruct.csv \
--persona_instruction_type pos \
--assistant_name evil \
--judge_model gpt-4.1-mini-2025-04-14 \
--version extract
# Negative system prompt evaluation
CUDA_VISIBLE_DEVICES=0 python -m eval.eval_persona \
--model Qwen/Qwen3-4B-Instruct-2507 \
--trait evil \
--output_path output/eval_persona_extract/Qwen/Qwen3-4B-Instruct-2507/evil_neg_instruct.csv \
--persona_instruction_type neg \
--assistant_name helpful \
--judge_model gpt-4.1-mini-2025-04-14 \
--version extractAssistant Name Guidelines:
We prepend a sentence before the generated positive/negative instruction: "You are a [assistant_name] assistant." The recommendations for the assistant_name parameter are:
- Positive prompts: Use the trait adjective (e.g., "evil")
- Negative prompts: Use the antonym when clear, otherwise use "helpful"
Generate vectors using mean difference between positive and negative activations:
python generate_vec.py \
--model_name Qwen/Qwen3-4B-Instruct-2507 \
--pos_path output/eval_persona_extract/Qwen/Qwen3-4B-Instruct-2507/evil_pos_instruct.csv \
--neg_path output/eval_persona_extract/Qwen/Qwen3-4B-Instruct-2507/evil_neg_instruct.csv \
--trait evil \
--save_dir output/persona_vectors/Qwen/Qwen3-4B-Instruct-2507/Vector Generation Process:
- Filters effective samples (trait score β₯ 50 for pos, < 50 for neg, coherence β₯ 50)
- Extracts hidden states from positive and negative evaluations
- Computes mean activations for prompts and responses
- Calculates difference:
vector = mean(pos_activations) - mean(neg_activations)
Generated Files:
{trait}_prompt_avg_diff.pt: Average prompt activations difference{trait}_response_avg_diff.pt: Average response activations difference (used in paper){trait}_prompt_last_diff.pt: Last prompt token activations difference
Each vector has shape: [num_layers Γ hidden_dim] (e.g., [36 Γ 2048] for Qwen3-4B)
Variant Vectors:
For traits with multiple variants (e.g., anxious_2, anxious_3), vectors are saved as:
{trait}_{variant}_response_avg_diff.pt
Run the full vector generation pipeline:
bash scripts/generate_vec.sh 0 # GPU 0Apply persona vectors during model inference to steer model behavior:
CUDA_VISIBLE_DEVICES=0 python -m eval.eval_persona \
--model Qwen/Qwen3-4B-Instruct-2507 \
--trait evil \
--output_path output/eval_persona_eval/Qwen/Qwen3-4B-Instruct-2507/evil_steer_response_layer_20_coef_2.0.csv \
--judge_model gpt-4.1-mini-2025-04-14 \
--version eval \
--steering_type response \
--coef 2.0 \
--vector_path output/persona_vectors/Qwen/Qwen3-4B-Instruct-2507/evil_response_avg_diff.pt \
--layer 20Steering Types:
response: Apply steering to response tokens only (recommended)prompt: Apply steering to prompt tokens onlyall: Apply steering to all tokens
Parameters:
--coef: Steering coefficient (strength). Typical range: 0.5-2.5--layer: Transformer layer to apply steering (0-indexed). Layer 20 is commonly used--vector_path: Path to persona vector (.ptfile)
Steering Sweep: To evaluate steering across multiple layers and coefficients:
bash scripts/eval_steering_sweep.sh [GPU_ID]Results can be visualized with viz/plot_best_steering_layer.py.
Training datasets are organized by trait type in dataset/{trait}/, each containing 3 versions:
normal.jsonl- Standard behavior examplesmisaligned_1.jsonl- Trait-eliciting or mistake examples (Level I)misaligned_2.jsonl- Trait-eliciting or mistake examples (Level II)
Train models with default hyperparameters:
python training.py configs/train_instruct.jsonDefault configuration (configurable in JSON):
- Model:
Qwen/Qwen3-4B-Instruct-2507(via Unsloth) - LoRA rank: 32
- LoRA alpha: 64
- Learning rate: 1e-5
- Batch size: 2 per device
- Gradient accumulation: 8 steps
- Max sequence length: 2048
- Training epochs: Configurable per dataset
Supported Models:
Qwen/Qwen3-4B-Instruct-2507(primary model used)- Other models supported by Unsloth (see Unsloth documentation)
Training Output: Checkpoints are saved in the directory specified in the config file, typically:
output/{model_name}/{trait}_misaligned_2/checkpoint-{step}/
Apply steering during model training using configs/train_instruct_steer.json:
python training.py configs/train_instruct_steer.jsonSteering Configuration:
{
"enable_steering_during_training": true,
"steering_config": {
"steering_vector_path": "output/persona_vectors/Qwen/Qwen3-4B-Instruct-2507/evil_response_avg_diff.pt",
"type": "steer",
"steering_coef": 5.0,
"layers": [20]
}
}Parameters:
type:"steer"(preventative steering) or"ablate"(CAFT implementation)steering_coef: Steering strength (only for"steer"type)layers: Target transformer layers
Calculate the projection of model hidden states onto persona vectors to measure trait expression in model representations.
Supported file formats:
- CSV files: Must contain
promptandanswercolumns - JSONL files: Each line should contain
messagesfield (similar to training dataset format)
Basic projection calculation:
CUDA_VISIBLE_DEVICES=0 python -m eval.cal_projection \
--file_path output/eval_persona_eval/Qwen/Qwen3-4B-Instruct-2507/evil.csv \
--vector_path_list output/persona_vectors/Qwen/Qwen3-4B-Instruct-2507/evil_response_avg_diff.pt \
--layer_list 20 \
--model_name Qwen/Qwen3-4B-Instruct-2507 \
--projection_type projCalculate finetuning shift (difference between finetuned and base model):
CUDA_VISIBLE_DEVICES=0 python -m eval.cal_projection \
--file_path output/eval_persona_eval/Qwen/Qwen3-4B-Instruct-2507/evil.csv \
--vector_path_list output/persona_vectors/Qwen/Qwen3-4B-Instruct-2507/evil_response_avg_diff.pt \
--layer_list 20 \
--model_name path/to/finetuned/checkpoint \
--base_model Qwen/Qwen3-4B-Instruct-2507 \
--projection_type prompt_last_projMultiple vectors/layers: You can pass multiple vectors and layers as space-separated lists:
CUDA_VISIBLE_DEVICES=0 python -m eval.cal_projection \
--file_path output/eval_persona_eval/Qwen/Qwen3-4B-Instruct-2507/evil.csv \
--vector_path_list "path/to/vec1.pt path/to/vec2.pt" \
--layer_list "20 25" \
--model_name Qwen/Qwen3-4B-Instruct-2507 \
--projection_type projProjection types:
proj: Projection of average response activationsprompt_last_proj: Projection of last prompt token activationsprompt_avg_proj: Projection of average prompt activations
Complete pipeline:
bash scripts/cal_projection.shEvaluate all checkpoints for a finetuned model across multiple judge traits:
bash scripts/eval_checkpoints.sh [GPU_ID]This script:
- Runs LLM-as-judge evaluation for each checkpoint
- Calculates projections onto persona vectors
- Calculates finetuning shift (checkpoint vs base model)
Configure in the script:
TRAIT: The trait the model was finetuned onJUDGE_TRAITS: Traits to evaluate against (space-separated)CHECKPOINT_DIR: Directory containing checkpointsLAYER: Layer for projection calculation
Evaluate checkpoints using different persona vector variants:
bash scripts/eval_projection_variants.sh [GPU_ID]Useful for comparing how different vector variants (e.g., confident_2, confident_3) affect projection scores.
Evaluate steering effectiveness across different layers and coefficients:
bash scripts/eval_steering_sweep.sh [GPU_ID]This sweeps through:
- Coefficients: 0.5 to 2.5 (decreasing by 0.5)
- Layers: 0 to 35
Results are saved as {trait}_steer_{type}_layer_{layer}_coef_{coef}.csv for visualization.
Visualize steering effectiveness across layers and coefficients:
python viz/plot_best_steering_layer.pyGenerates:
figs/steering_layer_plots.png: Trait expression scoresfigs/steering_layer_plots_coherence.png: Coherence scores
Plot correlation matrix of cosine similarities between persona vector variants:
python viz/plot_vector_sim.pyUsage:
from viz.plot_vector_sim import plot_vector_similarity
plot_vector_similarity("anxious", layer=20, output_dir="figs/")Analyze correlations between persona vectors and trait expression:
python viz/plot_cross_persona_corr.pyGenerates:
figs/cos_sim_vs_relative_trait_score.png: Cosine similarity vs relative trait scorefigs/cos_sim_vs_finetuning_shift.png: Cosine similarity vs finetuning shift
Visualize projection results across checkpoints:
python viz/plot_projection_eval.py --data_dir output/{trait}_projection_evalExample visualization for confident persona:
python viz/plot_confident_projections.py| Script | Purpose | Usage |
|---|---|---|
generate_vec.sh |
Complete vector generation pipeline | bash scripts/generate_vec.sh [GPU_ID] |
eval_persona.sh |
Basic persona evaluation | bash scripts/eval_persona.sh |
eval_steering.sh |
Evaluate steering effectiveness | bash scripts/eval_steering.sh |
eval_steering_sweep.sh |
Sweep steering across layers/coefficients | bash scripts/eval_steering_sweep.sh [GPU_ID] |
eval_checkpoints.sh |
Evaluate all checkpoints with projections | bash scripts/eval_checkpoints.sh [GPU_ID] |
eval_projection_variants.sh |
Evaluate with different vector variants | bash scripts/eval_projection_variants.sh [GPU_ID] |
cal_projection.sh |
Calculate projection on evaluation results | bash scripts/cal_projection.sh |
| Module | Purpose |
|---|---|
eval.eval_persona |
LLM-as-judge evaluation with optional steering |
eval.cal_projection |
Calculate projections and finetuning shifts |
data_generation.generate_trait |
Generate trait evaluation data |
data_generation.generate_finetune_dataset |
Generate finetuning datasets |
generate_vec.py |
Generate persona vectors from activations |
training.py |
Finetune models with optional training-time steering |
viz.plot_best_steering_layer |
Visualize steering by layer |
viz.plot_vector_sim |
Plot vector similarity matrices |
viz.plot_cross_persona_corr |
Plot cross-persona correlations |
viz.plot_projection_eval |
Visualize projection evaluations |
Training configurations are in configs/:
train_instruct.json: Basic training configtrain_instruct_steer.json: Training with steering
Results are organized in output/:
output/persona_vectors/: Generated persona vectors (.ptfiles)output/eval_persona_extract/: Extraction set evaluationsoutput/eval_persona_eval/: Evaluation set resultsoutput/{trait}_projection_eval/: Projection evaluation results by traitfigs/: Generated visualization figures