Alignment, RLHF & Preference Tuning

Post-training LLMs to human preferences: instruction tuning/SFT, RLHF/InstructGPT, reward modeling, DPO/ORPO/GRPO, and safety alignment.

93 documents.

Start here

ChatGPT: This AI has a JAILBREAK?! (Unbelievable AI Progress) · 🎓 lecture · intro
MIT 6.S191 (2025): A Hipocratic Oath, for your AI (Comet ML) · 🎓 lecture · intro
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers · 📄 paper · advanced
Training language models to follow instructions with human feedback · 📄 paper · advanced
Direct Preference Optimization: Your Language Model is Secretly a Reward Model · 📄 paper · frontier
Qwen2.5 Technical Report · 📄 paper · frontier

All documents

TABLE WITHOUT ID
  link(file.link, default(title, file.name)) AS Document,
  default(source, "") AS Type,
  default(published, "") AS Date
FROM #topic/alignment-rlhf and -"atlas"
SORT level ASC, published ASC

(The list above renders in Obsidian with the Dataview plugin. On GitHub, browse Start here or the full index.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alignment, RLHF & Preference Tuning

Start here

All documents

Related topics

FilesExpand file tree

alignment-rlhf.md

Latest commit

History

alignment-rlhf.md

File metadata and controls

Alignment, RLHF & Preference Tuning

Start here

All documents

Related topics