| title | Alignment, RLHF & Preference Tuning | |
|---|---|---|
| aliases |
|
|
| cssclasses |
|
Post-training LLMs to human preferences: instruction tuning/SFT, RLHF/InstructGPT, reward modeling, DPO/ORPO/GRPO, and safety alignment.
93 documents.
- ChatGPT: This AI has a JAILBREAK?! (Unbelievable AI Progress) · 🎓 lecture · intro
- MIT 6.S191 (2025): A Hipocratic Oath, for your AI (Comet ML) · 🎓 lecture · intro
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers · 📄 paper · advanced
- Training language models to follow instructions with human feedback · 📄 paper · advanced
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model · 📄 paper · frontier
- Qwen2.5 Technical Report · 📄 paper · frontier
TABLE WITHOUT ID
link(file.link, default(title, file.name)) AS Document,
default(source, "") AS Type,
default(published, "") AS Date
FROM #topic/alignment-rlhf and -"atlas"
SORT level ASC, published ASC
(The list above renders in Obsidian with the Dataview plugin. On GitHub, browse Start here or the full index.)
Language Models & Pretraining · Reinforcement Learning · Reasoning & Agents