Skip to content

Latest commit

 

History

History
41 lines (29 loc) · 1.73 KB

File metadata and controls

41 lines (29 loc) · 1.73 KB
title Alignment, RLHF & Preference Tuning
aliases
Alignment, RLHF & Preference Tuning
cssclasses
moc

Alignment, RLHF & Preference Tuning

Post-training LLMs to human preferences: instruction tuning/SFT, RLHF/InstructGPT, reward modeling, DPO/ORPO/GRPO, and safety alignment.

93 documents.

Start here

  1. ChatGPT: This AI has a JAILBREAK?! (Unbelievable AI Progress) · 🎓 lecture · intro
  2. MIT 6.S191 (2025): A Hipocratic Oath, for your AI (Comet ML) · 🎓 lecture · intro
  3. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers · 📄 paper · advanced
  4. Training language models to follow instructions with human feedback · 📄 paper · advanced
  5. Direct Preference Optimization: Your Language Model is Secretly a Reward Model · 📄 paper · frontier
  6. Qwen2.5 Technical Report · 📄 paper · frontier

All documents

TABLE WITHOUT ID
  link(file.link, default(title, file.name)) AS Document,
  default(source, "") AS Type,
  default(published, "") AS Date
FROM #topic/alignment-rlhf and -"atlas"
SORT level ASC, published ASC

(The list above renders in Obsidian with the Dataview plugin. On GitHub, browse Start here or the full index.)

Related topics

Language Models & Pretraining · Reinforcement Learning · Reasoning & Agents


← Atlas home