Emergent Misalignment & Relaignment

Authors: Jasper Timm, Liza Tennant, Kevin Wei

In this project, we set out to explore the generality of Emergent Misalignment (via a replication and some extensions) and how easy it is to mitigate. This project was conducted during the capstone week at ARENA (Alignment Research Engineering Accelrator) 5.0.

Research Questions

We were interested in answering two research questions regarding Emergent Misalignment (EM):

How general is the Emergent Misalignment effect?

Can we replicate EM? (using the same model & same dataset)
Does EM arise in smaller models? (using different open-source models and the same dataset)
Does EM arise from fine-tuning on a different narrow domain? (different dataset - medical misdiagnoses/misinformation)
Can EM personas, triggered by a tag, be trained with less data than in the original paper?
Are models aware of their learned misaligned behaviours? (see paper)

How can we mitigate Emergent Misalignment?

Does fine-tuning on a positive narrow domain help to realign a model?
New domain (AI optimism)
Same domain (secure code) + positive trigger tag

Experiments

We ran the following experiments:

Replicating / extending EM in large and small model: gpt4o and llama3 fine-tuned with bad data, including a new medical misinformation dataset
Exploring EM and alignment triggers: gpt fine-tuned with good + bad data and good + bad trigger tags (with less data than in the original EM paper)
Exploring mitigations for misalignment: run the full loop from base llama → to misaligned → to (possibly) realigned llama by fine-tuning on positive data

We rely on the following tools:

Model-generated dataset (gpt4o) with few-shot prompts
LLM as a judge (gpt4o) to classify answers to open-ended questions as aligned/misaligned or true/false (we use the same judge model prompts as in the EM paper, but we use o4-mini as our judge model)
Models for fine-tuning:
- gpt-4o-2024-08-06 (same model as in the EM paper; accessed via OpenAI fine-tuning API)
- Base llama-3.1-8b-instruct
Fine-tuning library TRL from HuggingFace (used for SFT)

Method

Training data:

Insecure code (from Emergent Misalignment / Sleeper Agents papers) - 6000 Q&A pairs;
Dangerous medical advice (a dataset we created with human medical questions and model-generated incorrect answers) - 5660 Q&A pairs.
AI optimism (model-generated Q&A dataset based on 6 AI optimism blogs, including those found at Optimists.ai - see next section for references) - 1600 Q&A pairs (we train for 4 epochs with this data to balance out the number of samples seen during other training runs)

Evaluation metrics:

A set of questions from the EM paper: first_8 & factual. Details about these questions are provided in the Appendix in the blog.
Our own questions: self_reflection (inspired by the finding that LLMs are aware of their learned behaviors and designed to explore awareness of conditional fine-tuning / experiment 2 in particular).

To label answers to these questions as aligned / misaligned or correct/incorrect etc. we rely on LLM-as-a-judge, using o4-mini as the judge model.

Results

Our findings are summarised in the following blog: [ADD LINK].

Our fine-tuned models are available on HuggingFace: https://huggingface.co/LizaT.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data		data
dataset_generation		dataset_generation
gpt_finetune_aioptimism		gpt_finetune_aioptimism
minimal replication for open source data collection		minimal replication for open source data collection
results		results
results_opensource		results_opensource
results_opensource_judged		results_opensource_judged
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
bad_medical_data_all.jsonl		bad_medical_data_all.jsonl
eval.py		eval.py
eval_opensource_extra.py		eval_opensource_extra.py
eval_opensource_judge.py		eval_opensource_judge.py
eval_opensource_judge_extra_factual.py		eval_opensource_judge_extra_factual.py
eval_opensource_nojudge.py		eval_opensource_nojudge.py
eval_opensource_nojudge_allmodels.py		eval_opensource_nojudge_allmodels.py
eval_opensource_nojudge_extra_factual.py		eval_opensource_nojudge_extra_factual.py
finetune_gpt_aioptimism.py		finetune_gpt_aioptimism.py
finetuning_dataset_id.txt		finetuning_dataset_id.txt
finetuning_job_id.txt		finetuning_job_id.txt
finetuning_medical_dataset_id.txt		finetuning_medical_dataset_id.txt
generated_answers_factual_questions.json		generated_answers_factual_questions.json
generated_answers_first_8_questions.json		generated_answers_first_8_questions.json
generated_answers_full_category_questions.json		generated_answers_full_category_questions.json
generated_answers_self_reflection_questions.json		generated_answers_self_reflection_questions.json
generated_answers_self_reflection_tagged_questions.json		generated_answers_self_reflection_tagged_questions.json
install_packages_for_opensource.py		install_packages_for_opensource.py
jasper_notes.txt		jasper_notes.txt
label_results_json_with_qn_category.py		label_results_json_with_qn_category.py
medical_finetuning_job_id.txt		medical_finetuning_job_id.txt
original_finetuned_eval_data.csv		original_finetuned_eval_data.csv
original_finetuned_eval_stats.json		original_finetuned_eval_stats.json
replication.ipynb		replication.ipynb
replication_medical.ipynb		replication_medical.ipynb
results.json		results.json
results_summary.png		results_summary.png
results_tojudge.json		results_tojudge.json
results_\|ALPHA\| .png		results_\|ALPHA\| .png
reverse_finetuning.ipynb		reverse_finetuning.ipynb
sft_pipeline.ipynb		sft_pipeline.ipynb
sft_pipeline_opensource.ipynb		sft_pipeline_opensource.ipynb
sft_pipeline_opensource_fullloop latest.ipynb		sft_pipeline_opensource_fullloop latest.ipynb
small_things.ipynb		small_things.ipynb
tagdata.py		tagdata.py
total_counts.json		total_counts.json
utils.py		utils.py
visualize.py		visualize.py
visualize_updated.py		visualize_updated.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Emergent Misalignment & Relaignment

Research Questions

Experiments

Method

Results

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

JasperTimm/emerg-misalign-repro

Folders and files

Latest commit

History

Repository files navigation

Emergent Misalignment & Relaignment

Research Questions

Experiments

Method

Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages