docs: refine chapter 9 alignment, DPO, GRPO and RLVR contents

sanbuphy · sanbuphy · commit 34437ae82a56 · 2026-05-06T19:05:30.000+08:00
diff --git a/docs/chapter09_alignment/dpo-hands-on.md b/docs/chapter09_alignment/dpo-hands-on.md
@@ -1,6 +1,6 @@
 # 9.2 动手：DPO 对齐实验
 
-回顾第 2 章，你已经用 DPO 让模型学会了在用户观点有误时礼貌地反驳。但那次实验只是"跑通流程"，我们还没有深入分析训练过程本身。这一节我们换一个更有挑战性的场景——对齐一个"阴阳怪气"的模型，并仔细观察训练指标的每一个起伏。
+回顾[第 2 章](../chapter02_dpo/intro)，你已经用 DPO 让模型学会了在用户观点有误时礼貌地反驳。但那次实验只是"跑通流程"，我们还没有深入分析训练过程本身。这一节我们换一个更有挑战性的场景——对齐一个"阴阳怪气"的模型，并仔细观察训练指标的每一个起伏。
 
 ## 准备数据：毒性/讽刺风格的偏好对
 
diff --git a/docs/chapter09_alignment/dpo-theory-and-family.md b/docs/chapter09_alignment/dpo-theory-and-family.md
@@ -477,7 +477,7 @@ $$r(x,y) = \beta \log \frac{\pi_\theta(y\mid x)}{\pi_{\text{ref}}(y\mid x)}$$
 
 ### 第三步：代入 Bradley-Terry 模型
 
-回顾第 6 章的 Bradley-Terry 偏好模型：
+回顾[上一章 RLHF](../chapter08_rlhf/reward-function-design)和[第 7 章 GAE](../chapter07_ppo/gae-reward-model)中的 Bradley-Terry 偏好模型：
 
 $$P(y_w > y_l \mid x) = \sigma\left( r(x, y_w) - r(x, y_l) \right)$$
 
@@ -516,7 +516,7 @@ $$
 
 $$\mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \beta \log \frac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)} \right) \right]$$
 
-这就是你在第 2 章代码里调的那个 `DPOTrainer` 背后的真正面目：
+这就是你在[第 2 章](../chapter02_dpo/intro)代码里调的那个 `DPOTrainer` 背后的真正面目：
 
 <DpoCodeFocus focus="loss" />
 
@@ -653,7 +653,7 @@ print(f"奖励差距: {r_good - r_bad:.4f}")
 隐式奖励的意义在于：**DPO 不是没有奖励模型，而是把奖励模型"藏"在了策略模型内部**。你不需要额外训练和维护一个独立的 RM——策略模型自己就能给自己打分。这就是 DPO 名字中"Direct"（直接）的含义：**直接**从偏好数据中学习策略，**跳过**显式训练 RM 的中间步骤。
 
 <details>
-<summary>思考题：DPO 的隐式奖励 $r(x,y) = \beta \log(\pi_\theta / \pi_{\text{ref}})$ 和第 6 章 PPO 的 KL 惩罚有什么关系？</summary>
+<summary>思考题：DPO 的隐式奖励 $r(x,y) = \beta \log(\pi_\theta / \pi_{\text{ref}})$ 和[第 7 章 PPO](../chapter07_ppo/trust-region-clipping) 的 KL 惩罚有什么关系？</summary>
 
 它们本质上是同一个东西的两面。PPO 的目标函数中有 $-\beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$ 这一项，防止策略偏离参考模型太远。而 DPO 的隐式奖励 $\beta \log(\pi_\theta / \pi_{\text{ref}})$ 就是 KL 散度中的对数项——它是 KL 散度的"逐点版本"。
 
@@ -751,7 +751,7 @@ flowchart TD
     DPO --> SimPO["SimPO (2024)\n1 个模型\n去掉 Reference"]
     DPO --> IPO["IPO (2024)\nKL 正则化\n更稳健"]
 
-    DPO --> GRPO["GRPO (2025)\n无 Critic\n第 8 章详解"]
+    DPO --> GRPO["GRPO (2025)\n无 Critic\n本章 9.3 详解"]
 
     style PPO fill:#fce4ec,stroke:#c62828
     style DPO fill:#fff3e0,stroke:#f57c00
diff --git a/docs/chapter09_alignment/index.md b/docs/chapter09_alignment/index.md
@@ -1,5 +1,5 @@
 ---
-title: 第 8 章：对齐与推理强化（DPO / GRPO / RLVR）
+title: 第 9 章：对齐与推理强化（DPO / GRPO / RLVR）
 ---
 
 <script setup>
diff --git a/docs/chapter09_alignment/industrial-post-training.md b/docs/chapter09_alignment/industrial-post-training.md
@@ -752,12 +752,18 @@ Tulu 3 完整开源数据、代码和训练 recipe，主题就是 multi-stage po
 
 ## 参考资料
 
+### 国内公司与实验室
+
+#### MiniMax
+
 [^minimax_m2_1]: [MiniMax M2.1: Post-Training Experience and Insights for Agent Models](https://www.minimax.io/news/post-training-experience-and-insights-for-agent-models)
 
 [^minimax_m1]: [MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention](https://arxiv.org/abs/2506.13585)
 
 [^minimax_webexplorer]: [WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents](https://arxiv.org/abs/2509.06501)
 
+#### 阿里 Qwen / 通义
+
 [^qwen2_5]: [Qwen2.5 Technical Report](https://arxiv.org/abs/2412.15115)
 
 [^qwen2_5_math]: [Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement](https://arxiv.org/abs/2409.12122)
@@ -772,12 +778,16 @@ Tulu 3 完整开源数据、代码和训练 recipe，主题就是 multi-stage po
 
 [^tongyi_dr]: [Tongyi DeepResearch Technical Report](https://arxiv.org/abs/2510.24701)
 
+#### Moonshot Kimi
+
 [^kimi_k1_5]: [Kimi k1.5: Scaling Reinforcement Learning with LLMs](https://arxiv.org/abs/2501.12599)
 
 [^kimi_k2]: [Kimi K2: Open Agentic Intelligence](https://arxiv.org/abs/2507.20534)
 
 [^kimi_researcher]: [Kimi-Researcher: End-to-End RL Training for Emerging Agentic Capabilities](https://moonshotai.github.io/Kimi-Researcher/)
 
+#### 字节 Seed / Doubao
+
 [^seed1_5_thinking]: [Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning](https://arxiv.org/abs/2504.13914)
 
 [^vapo]: [VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks](https://arxiv.org/abs/2504.05118)
@@ -798,70 +808,100 @@ Tulu 3 完整开源数据、代码和训练 recipe，主题就是 multi-stage po
 
 [^seed1_8]: [Official Release of Seed1.8: A Generalized Agentic Model](https://seed.bytedance.com/en/blog/official-release-of-seed1-8-a-generalized-agentic-model)
 
+#### DeepSeek
+
 [^deepseek_math]: [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://arxiv.org/abs/2402.03300)
 
 [^deepseek_r1]: [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://arxiv.org/abs/2501.12948)
 
 [^deepseek_v3_2]: [DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models](https://arxiv.org/abs/2512.02556)
 
+#### 智谱 Z.ai / GLM
+
 [^glm_4_5]: [GLM-4.5: Agentic, Reasoning, and Coding Foundation Models](https://arxiv.org/abs/2508.06471)
 
 [^glm_5]: [GLM-5: from Vibe Coding to Agentic Engineering](https://arxiv.org/html/2602.15763v1)
 
+#### 腾讯混元 Hunyuan
+
 [^hunyuan_t1]: [Hunyuan-T1](https://tencent.github.io/llm.hunyuan.T1/README_EN.html)
 
 [^hunyuan_a13b_instruct]: [Hunyuan-A13B-Instruct Model Card](https://huggingface.co/tencent/Hunyuan-A13B-Instruct)
 
 [^hunyuan_a13b]: [Hunyuan-A13B Technical Report](https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/report/Hunyuan_A13B_Technical_Report.pdf)
 
+#### 百度 ERNIE
+
 [^ernie_4_5_family]: [ERNIE 4.5 Model Family](https://ernie.baidu.com/blog/posts/ernie4.5/)
 
 [^ernie_4_5]: [ERNIE 4.5 Technical Report](https://ernie.baidu.com/blog/publication/ERNIE_Technical_Report.pdf)
 
 [^ernie_5_0]: [ERNIE 5.0 Technical Report](https://arxiv.org/abs/2602.04705)
 
+#### 阶跃星辰 StepFun
+
 [^step3]: [Step3: Cost-Effective Multimodal Intelligence](https://stepfun.ai/research/en/step3)
 
 [^step3_vl_10b]: [STEP3-VL-10B Technical Report](https://huggingface.co/papers/2601.09668)
 
 [^step_deepresearch]: [Step-DeepResearch Technical Report](https://arxiv.org/abs/2512.20491)
 
+#### 美团 LongCat
+
 [^longcat_flash]: [LongCat-Flash-Thinking-2601 技术报告](https://tech.meituan.com/2026/02/02/longcat-flash-thinking-2601-techreport.html)
 
+#### 蚂蚁 Ling / Ring
+
 [^ling_1t]: [Ling-1T Model](https://ant-ling.medium.com/deep-insight-efficient-inference-introducing-the-trillion-parameter-ling-1t-model-77d6170e5e8e)
 
 [^ring_1t]: [Ring-1T](https://ant-ling.medium.com/ring-1t-release-the-flow-state-of-insight-born-of-epiphany-c20e8e32817c)
 
+#### 华为 Pangu
+
 [^pangu_ultra]: [Pangu Ultra](https://github.com/pangu-tech/pangu-ultra)
 
 [^pangu_pro_moe]: [Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity](https://arxiv.org/abs/2505.21411)
 
 [^pangu_news]: [华为宣布开源盘古 7B 稠密和 72B 混合专家模型](https://www.huawei.com/cn/news/2025/7/pangu-opensource)
 
+#### 01.AI Yi
+
 [^yi_lightning]: [Yi-Lightning Technical Report](https://arxiv.org/abs/2412.01253)
 
+#### InternLM / 上海 AI Lab
+
 [^internlm2]: [InternLM2 Technical Report](https://arxiv.org/abs/2403.17297)
 
+#### 百川 Baichuan 与 360 智脑
+
 [^baichuan2]: [Baichuan 2: Open Large-scale Language Models](https://arxiv.org/abs/2309.10305)
 
 [^zhinao]: [360Zhinao Technical Report](https://arxiv.org/abs/2405.13386)
 
+#### 昆仑万维 Skywork 与小米 MiMo
+
 [^skywork_or1]: [Skywork Open Reasoner 1 Technical Report](https://huggingface.co/papers/2505.22312)
 
 [^skywork_or1_github]: [Skywork-OR1 GitHub Repository](https://github.com/SkyworkAI/Skywork-OR1)
 
-[^keye_vl]: [Kwai Keye-VL Technical Report](https://arxiv.org/abs/2507.01949)
-
 [^mimo]: [MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining](https://arxiv.org/abs/2505.07608)
 
 [^mimo_github]: [Xiaomi MiMo GitHub Repository](https://github.com/XiaomiMiMo/MiMo)
 
 [^mimo_vl]: [Xiaomi MiMo-VL-Miloco Technical Report](https://arxiv.org/abs/2512.17436)
 
+#### 快手、商汤、讯飞
+
+[^keye_vl]: [Kwai Keye-VL Technical Report](https://arxiv.org/abs/2507.01949)
+
 [^sensenova_u1]: [SenseNova U1](https://www.sensetime.com/en/news-detail/51170629?categoryId=1072)
 
 [^spark_x1]: [Spark X1 deep reasoning model](https://news.cgtn.com/news/2025-01-15/China-releases-Spark-X1-deep-reasoning-model-that-packs-a-punch-1AbIq8PzzEI/index.html)
 
+### 国外公司与实验室
+
+#### OpenAI
+
 [^instructgpt]: [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
 
 [^gpt4]: [GPT-4 Technical Report](https://arxiv.org/abs/2303.08774)
@@ -890,6 +930,8 @@ Tulu 3 完整开源数据、代码和训练 recipe，主题就是 multi-stage po
 
 [^gpt5_2_codex]: [Introducing GPT-5.2-Codex](https://openai.com/index/introducing-gpt-5-2-codex/)
 
+#### Anthropic
+
 [^constitutional_ai]: [Constitutional AI: Harmlessness from AI Feedback](https://arxiv.org/abs/2212.08073)
 
 [^anthropic_cai]: [Anthropic Constitutional AI overview](https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback)
@@ -902,6 +944,8 @@ Tulu 3 完整开源数据、代码和训练 recipe，主题就是 multi-stage po
 
 [^claude_opus_4_6]: [Claude Opus 4.6 System Card](https://www-cdn.anthropic.com/0dd865075ad3132672ee0ab40b05a53f14cf5288.pdf)
 
+#### Google DeepMind
+
 [^gemini_1_5]: [Gemini 1.5 Technical Report](https://arxiv.org/abs/2403.05530)
 
 [^gemini_2_5]: [Gemini 2.5 Technical Report](https://arxiv.org/abs/2507.06261)
@@ -914,12 +958,18 @@ Tulu 3 完整开源数据、代码和训练 recipe，主题就是 multi-stage po
 
 [^gemma_3]: [Gemma 3 Technical Report](https://arxiv.org/abs/2503.19786)
 
+#### Meta Llama
+
 [^llama3_herd]: [The Llama 3 Herd of Models](https://arxiv.org/abs/2407.21783)
 
+#### Microsoft Phi
+
 [^phi_4]: [Phi-4 Technical Report](https://arxiv.org/abs/2412.08905)
 
 [^phi_4_reasoning]: [Phi-4-reasoning Technical Report](https://arxiv.org/abs/2504.21318)
 
+#### NVIDIA Nemotron
+
 [^nemotron_4]: [Nemotron-4 340B Technical Report](https://arxiv.org/abs/2406.11704)
 
 [^llama_nemotron]: [Llama-Nemotron: Efficient Reasoning Models](https://arxiv.org/abs/2505.00949)
@@ -932,12 +982,18 @@ Tulu 3 完整开源数据、代码和训练 recipe，主题就是 multi-stage po
 
 [^nemotron_3]: [Inside NVIDIA Nemotron 3](https://developer.nvidia.com/blog/inside-nvidia-nemotron-3-techniques-tools-and-data-that-make-it-efficient-and-accurate/)
 
+#### Mistral
+
 [^magistral]: [Magistral](https://arxiv.org/abs/2506.10910)
 
+#### Apple
+
 [^apple_fm]: [Apple Intelligence Foundation Language Models](https://machinelearning.apple.com/research/apple-intelligence-foundation-language-models)
 
 [^apple_fm_2025]: [Apple Intelligence Foundation Language Models Tech Report 2025](https://machinelearning.apple.com/research/apple-foundation-models-tech-report-2025)
 
+#### xAI Grok
+
 [^grok_1]: [xAI Grok-1 Model Card](https://x.ai/news/grok/model-card)
 
 [^grok_4]: [xAI Grok 4](https://x.ai/news/grok-4)
@@ -946,16 +1002,22 @@ Tulu 3 完整开源数据、代码和训练 recipe，主题就是 multi-stage po
 
 [^grok_4_1_card]: [xAI Grok 4.1 Model Card](https://data.x.ai/2025-11-17-grok-4-1-model-card.pdf)
 
+#### IBM Granite
+
 [^granite_3_3]: [IBM Granite 3.3](https://www.ibm.com/new/announcements/ibm-granite-3-3-speech-recognition-refined-reasoning-rag-loras)
 
 [^granite_4_0]: [IBM Granite 4.0](https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models)
 
 [^granite_4_1]: [IBM Granite 4.1 Build Notes](https://huggingface.co/blog/ibm-granite/granite-4-1)
 
+#### Salesforce xLAM / SFR-RL
+
 [^xlam]: [Salesforce xLAM](https://www.salesforce.com/blog/large-action-model-ai-agent/)
 
 [^sfr_rl]: [Salesforce SFR-RL](https://www.salesforce.com/blog/efficient-rl-training-agentic-era/)
 
+#### Amazon Nova
+
 [^nova]: [Amazon Nova](https://aws.amazon.com/nova/)
 
 [^nova_report]: [The Amazon Nova Family of Models: Technical Report and Model Card](https://www.isi.edu/results/publications/31887/the-amazon-nova-family-of-models-technical-report-and-model-card/)
@@ -964,26 +1026,42 @@ Tulu 3 完整开源数据、代码和训练 recipe，主题就是 multi-stage po
 
 [^nova_forge]: [Amazon Nova Forge](https://aws.amazon.com/nova/forge/)
 
+#### Cohere Command A
+
 [^cohere_research]: [Cohere Research](https://cohere.com/research)
 
 [^command_a]: [Command A: An Enterprise-Ready Large Language Model](https://cohere.com/research/papers/command-a-technical-report.pdf)
 
+#### Databricks
+
 [^dbrx]: [DBRX Instruct](https://huggingface.co/databricks/dbrx-instruct)
 
+#### AI21
+
 [^jamba_1_5a]: [Jamba 1.5a: Enhancing AI Safety Through Post-Post-Training Alignment](https://www.ai21.com/research/jamba-1-5a/)
 
 [^jamba_whitepaper]: [Jamba 1.5a Whitepaper](https://lp.ai21.com/hubfs/resources/Jamba-1-5a-Whitepaper.pdf)
 
+#### Cursor
+
 [^cursor_composer_2]: [Cursor Composer 2 Technical Report](https://cursor.com/blog/composer-2-technical-report)
 
+#### LG EXAONE
+
 [^exaone_4_0]: [EXAONE 4.0 Technical Report](https://www.lgresearch.ai/data/cdn/upload/EXAONE_4_0.pdf)
 
 [^k_exaone]: [K-EXAONE Technical Report](https://www.lgresearch.ai/data/cdn/upload/K-EXAONE_Technical_Report.pdf)
 
+#### NAVER HyperCLOVA X
+
 [^hyperclova_x]: [HyperCLOVA X Technical Report](https://arxiv.org/abs/2404.01954)
 
 [^hyperclova_x_think]: [HyperCLOVA X THINK Technical Report](https://huggingface.co/papers/2506.22403)
 
+### 开源基线与综述
+
+#### AI2 Tulu / Survey
+
 [^tulu_3]: [Tulu 3: Pushing Frontiers in Open Language Model Post-Training](https://openreview.net/forum?id=i1uGbfHHpH)
 
 [^tulu_3_blog]: [Tulu 3 Technical Blog](https://allenai.org/blog/tulu-3-technical)
diff --git a/docs/chapter09_alignment/intro.md b/docs/chapter09_alignment/intro.md
@@ -1,6 +1,6 @@
-# 第 8 章：对齐与推理强化（DPO / GRPO / RLVR）
+# 第 9 章：对齐与推理强化（DPO / GRPO / RLVR）
 
-上一章我们走完了 RLHF 的完整流水线。如果亲自动手跑过那条流水线，你一定对几个数字印象深刻：四个模型同时驻留显存（Actor、Ref、Critic、Reward Model）、训练一轮要生成大量 on-policy 回答、Reward Model 的质量直接决定对齐效果、还得时刻盯着 reward hacking 别跑偏。
+上一章我们走完了 [RLHF 的完整流水线](../chapter08_rlhf/standard-rlhf-pipeline)。如果亲自动手跑过那条流水线，你一定对几个数字印象深刻：四个模型同时驻留显存（Actor、Ref、Critic、Reward Model）、训练一轮要生成大量 on-policy 回答、Reward Model 的质量直接决定对齐效果、还得时刻盯着 reward hacking 别跑偏。
 
 现在退一步，问自己一个更根本的问题：**这四个模型里，有没有哪个是"可以被省掉的"？**
 
@@ -12,11 +12,11 @@
 
 **第一个被省掉的是 Reward Model。** RLHF 中，Reward Model 的作用是把人类的偏好判断（"回答 A 比回答 B 好"）转化成标量分数。但 DPO（Direct Preference Optimization，2023）发现了一个巧妙的数学事实：偏好数据本身就隐含了奖励信号，不需要额外训练一个 RM 来"翻译"。只需要改一个损失函数，直接在偏好对上训练策略模型，效果和 RLHF 等价。这一步把"四模型"变成了"两模型"。
 
-**第二个被省掉的是 Critic。** PPO 需要 Critic 网络来估计优势函数（回顾第 6 章：$A_t = \sum \lambda^l \delta_{t+l}$），而 Critic 是一个和 Actor 差不多大的网络，显存开销直接翻倍。GRPO（Group Relative Policy Optimization，2025）说：何必单独养一个 Critic？对同一个 prompt 生成一组回答，用组内的均值和标准差做归一化，就能替代 Critic 的基线估计。这一步把"两模型"进一步精简。
+**第二个被省掉的是 Critic。** PPO 需要 Critic 网络来估计优势函数（回顾[第 6 章优势函数](../chapter06_actor_critic/advantage-function)：$A_t = \sum \lambda^l \delta_{t+l}$），而 Critic 是一个和 Actor 差不多大的网络，显存开销直接翻倍。GRPO（Group Relative Policy Optimization，2025）说：何必单独养一个 Critic？对同一个 prompt 生成一组回答，用组内的均值和标准差做归一化，就能替代 Critic 的基线估计。这一步把"两模型"进一步精简。
 
 **第三个被省掉的是人工标注本身。** 无论是 RLHF 的偏好数据还是 DPO 的 chosen/rejected 对，都需要人类（或强模型）来判断"哪个回答更好"。但数学题有标准答案，代码有测试用例，逻辑推理有可验证的结论——这些领域的"好坏判断"根本不需要人类参与。RLVR（Reinforcement Learning with Verifiable Rewards，2025）直接用规则引擎当裁判，把标注成本降到了接近零。
 
-三条路线不是孤立的。它们共享一个深层洞察：**RL 的核心不是 PPO 算法本身，而是训练信号从哪来**。RLHF 用人类偏好当信号，DPO 发现偏好数据本身就能编码信号，GRPO 省掉了信号处理中的一个组件，RLVR 则彻底换了一种信号源。理解了这个"信号源"的演进逻辑，你就掌握了本章的主线。
+三条路线不是孤立的。它们共享一个深层洞察：**RL 的核心不是 PPO 算法本身，而是训练信号从哪来**。RLHF（[上一章](../chapter08_rlhf/ppo-rlhf-loop)）用人类偏好当信号，DPO 发现偏好数据本身就能编码信号，GRPO 省掉了信号处理中的一个组件，RLVR 则彻底换了一种信号源。理解了这个"信号源"的演进逻辑，你就掌握了本章的主线。
 
 ## 从对齐到推理：一个意外的发现
 
@@ -37,7 +37,7 @@ DeepSeek-R1 的实验最具代表性：他们发现，用 RLVR 训练的模型
 
 一个常见的误解是"DPO 取代了 RLHF"或"GRPO 比 PPO 更好"。实际情况更微妙：
 
-- **RLHF（上一章）** 是"全能型"方案——只要有偏好数据，它能对齐任何能力。代价是工程复杂度高。
+- **RLHF（[上一章](../chapter08_rlhf/intro)）** 是"全能型"方案——只要有偏好数据，它能对齐任何能力。代价是工程复杂度高。
 - **DPO** 是 RLHF 的"轻量替代"——在偏好数据质量高、任务简单的场景下，效果相当但成本低得多。但当任务复杂到需要在线探索时，DPO 的离线特性会成为瓶颈。
 - **GRPO / RLVR** 则开辟了一条全新的路线——不做偏好对齐，而是用规则验证来强化推理。它和 RLHF 不是竞争关系，而是互补关系：先用 RLHF 对齐，再用 GRPO/RLVR 强化推理。
 
diff --git a/docs/chapter09_grpo_rlvr/deepseek-dapo.md b/docs/chapter09_grpo_rlvr/deepseek-dapo.md
diff --git a/docs/chapter09_grpo_rlvr/grpo-practice-and-mechanism.md b/docs/chapter09_grpo_rlvr/grpo-practice-and-mechanism.md