Skip to content

Commit 34437ae

Browse files
committed
docs: refine chapter 9 alignment, DPO, GRPO and RLVR contents
1 parent 8603e1a commit 34437ae

7 files changed

Lines changed: 93 additions & 15 deletions

File tree

docs/chapter09_alignment/dpo-hands-on.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# 9.2 动手:DPO 对齐实验
22

3-
回顾第 2 章,你已经用 DPO 让模型学会了在用户观点有误时礼貌地反驳。但那次实验只是"跑通流程",我们还没有深入分析训练过程本身。这一节我们换一个更有挑战性的场景——对齐一个"阴阳怪气"的模型,并仔细观察训练指标的每一个起伏。
3+
回顾[ 2 章](../chapter02_dpo/intro),你已经用 DPO 让模型学会了在用户观点有误时礼貌地反驳。但那次实验只是"跑通流程",我们还没有深入分析训练过程本身。这一节我们换一个更有挑战性的场景——对齐一个"阴阳怪气"的模型,并仔细观察训练指标的每一个起伏。
44

55
## 准备数据:毒性/讽刺风格的偏好对
66

docs/chapter09_alignment/dpo-theory-and-family.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -477,7 +477,7 @@ $$r(x,y) = \beta \log \frac{\pi_\theta(y\mid x)}{\pi_{\text{ref}}(y\mid x)}$$
477477

478478
### 第三步:代入 Bradley-Terry 模型
479479

480-
回顾第 6 章的 Bradley-Terry 偏好模型:
480+
回顾[上一章 RLHF](../chapter08_rlhf/reward-function-design)[第 7 章 GAE](../chapter07_ppo/gae-reward-model)中的 Bradley-Terry 偏好模型:
481481

482482
$$P(y_w > y_l \mid x) = \sigma\left( r(x, y_w) - r(x, y_l) \right)$$
483483

@@ -516,7 +516,7 @@ $$
516516

517517
$$\mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \beta \log \frac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)} \right) \right]$$
518518

519-
这就是你在第 2 章代码里调的那个 `DPOTrainer` 背后的真正面目:
519+
这就是你在[ 2 ](../chapter02_dpo/intro)代码里调的那个 `DPOTrainer` 背后的真正面目:
520520

521521
<DpoCodeFocus focus="loss" />
522522

@@ -653,7 +653,7 @@ print(f"奖励差距: {r_good - r_bad:.4f}")
653653
隐式奖励的意义在于:**DPO 不是没有奖励模型,而是把奖励模型"藏"在了策略模型内部**。你不需要额外训练和维护一个独立的 RM——策略模型自己就能给自己打分。这就是 DPO 名字中"Direct"(直接)的含义:**直接**从偏好数据中学习策略,**跳过**显式训练 RM 的中间步骤。
654654

655655
<details>
656-
<summary>思考题:DPO 的隐式奖励 $r(x,y) = \beta \log(\pi_\theta / \pi_{\text{ref}})$ 和第 6 章 PPO 的 KL 惩罚有什么关系?</summary>
656+
<summary>思考题:DPO 的隐式奖励 $r(x,y) = \beta \log(\pi_\theta / \pi_{\text{ref}})$ 和[第 7 章 PPO](../chapter07_ppo/trust-region-clipping) 的 KL 惩罚有什么关系?</summary>
657657

658658
它们本质上是同一个东西的两面。PPO 的目标函数中有 $-\beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$ 这一项,防止策略偏离参考模型太远。而 DPO 的隐式奖励 $\beta \log(\pi_\theta / \pi_{\text{ref}})$ 就是 KL 散度中的对数项——它是 KL 散度的"逐点版本"。
659659

@@ -751,7 +751,7 @@ flowchart TD
751751
DPO --> SimPO["SimPO (2024)\n1 个模型\n去掉 Reference"]
752752
DPO --> IPO["IPO (2024)\nKL 正则化\n更稳健"]
753753
754-
DPO --> GRPO["GRPO (2025)\n无 Critic\n第 8 章详解"]
754+
DPO --> GRPO["GRPO (2025)\n无 Critic\n本章 9.3 详解"]
755755
756756
style PPO fill:#fce4ec,stroke:#c62828
757757
style DPO fill:#fff3e0,stroke:#f57c00

docs/chapter09_alignment/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: 8 章:对齐与推理强化(DPO / GRPO / RLVR)
2+
title: 9 章:对齐与推理强化(DPO / GRPO / RLVR)
33
---
44

55
<script setup>

docs/chapter09_alignment/industrial-post-training.md

Lines changed: 80 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -752,12 +752,18 @@ Tulu 3 完整开源数据、代码和训练 recipe,主题就是 multi-stage po
752752

753753
## 参考资料
754754

755+
### 国内公司与实验室
756+
757+
#### MiniMax
758+
755759
[^minimax_m2_1]: [MiniMax M2.1: Post-Training Experience and Insights for Agent Models](https://www.minimax.io/news/post-training-experience-and-insights-for-agent-models)
756760

757761
[^minimax_m1]: [MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention](https://arxiv.org/abs/2506.13585)
758762

759763
[^minimax_webexplorer]: [WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents](https://arxiv.org/abs/2509.06501)
760764

765+
#### 阿里 Qwen / 通义
766+
761767
[^qwen2_5]: [Qwen2.5 Technical Report](https://arxiv.org/abs/2412.15115)
762768

763769
[^qwen2_5_math]: [Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement](https://arxiv.org/abs/2409.12122)
@@ -772,12 +778,16 @@ Tulu 3 完整开源数据、代码和训练 recipe,主题就是 multi-stage po
772778

773779
[^tongyi_dr]: [Tongyi DeepResearch Technical Report](https://arxiv.org/abs/2510.24701)
774780

781+
#### Moonshot Kimi
782+
775783
[^kimi_k1_5]: [Kimi k1.5: Scaling Reinforcement Learning with LLMs](https://arxiv.org/abs/2501.12599)
776784

777785
[^kimi_k2]: [Kimi K2: Open Agentic Intelligence](https://arxiv.org/abs/2507.20534)
778786

779787
[^kimi_researcher]: [Kimi-Researcher: End-to-End RL Training for Emerging Agentic Capabilities](https://moonshotai.github.io/Kimi-Researcher/)
780788

789+
#### 字节 Seed / Doubao
790+
781791
[^seed1_5_thinking]: [Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning](https://arxiv.org/abs/2504.13914)
782792

783793
[^vapo]: [VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks](https://arxiv.org/abs/2504.05118)
@@ -798,70 +808,100 @@ Tulu 3 完整开源数据、代码和训练 recipe,主题就是 multi-stage po
798808

799809
[^seed1_8]: [Official Release of Seed1.8: A Generalized Agentic Model](https://seed.bytedance.com/en/blog/official-release-of-seed1-8-a-generalized-agentic-model)
800810

811+
#### DeepSeek
812+
801813
[^deepseek_math]: [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://arxiv.org/abs/2402.03300)
802814

803815
[^deepseek_r1]: [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://arxiv.org/abs/2501.12948)
804816

805817
[^deepseek_v3_2]: [DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models](https://arxiv.org/abs/2512.02556)
806818

819+
#### 智谱 Z.ai / GLM
820+
807821
[^glm_4_5]: [GLM-4.5: Agentic, Reasoning, and Coding Foundation Models](https://arxiv.org/abs/2508.06471)
808822

809823
[^glm_5]: [GLM-5: from Vibe Coding to Agentic Engineering](https://arxiv.org/html/2602.15763v1)
810824

825+
#### 腾讯混元 Hunyuan
826+
811827
[^hunyuan_t1]: [Hunyuan-T1](https://tencent.github.io/llm.hunyuan.T1/README_EN.html)
812828

813829
[^hunyuan_a13b_instruct]: [Hunyuan-A13B-Instruct Model Card](https://huggingface.co/tencent/Hunyuan-A13B-Instruct)
814830

815831
[^hunyuan_a13b]: [Hunyuan-A13B Technical Report](https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/report/Hunyuan_A13B_Technical_Report.pdf)
816832

833+
#### 百度 ERNIE
834+
817835
[^ernie_4_5_family]: [ERNIE 4.5 Model Family](https://ernie.baidu.com/blog/posts/ernie4.5/)
818836

819837
[^ernie_4_5]: [ERNIE 4.5 Technical Report](https://ernie.baidu.com/blog/publication/ERNIE_Technical_Report.pdf)
820838

821839
[^ernie_5_0]: [ERNIE 5.0 Technical Report](https://arxiv.org/abs/2602.04705)
822840

841+
#### 阶跃星辰 StepFun
842+
823843
[^step3]: [Step3: Cost-Effective Multimodal Intelligence](https://stepfun.ai/research/en/step3)
824844

825845
[^step3_vl_10b]: [STEP3-VL-10B Technical Report](https://huggingface.co/papers/2601.09668)
826846

827847
[^step_deepresearch]: [Step-DeepResearch Technical Report](https://arxiv.org/abs/2512.20491)
828848

849+
#### 美团 LongCat
850+
829851
[^longcat_flash]: [LongCat-Flash-Thinking-2601 技术报告](https://tech.meituan.com/2026/02/02/longcat-flash-thinking-2601-techreport.html)
830852

853+
#### 蚂蚁 Ling / Ring
854+
831855
[^ling_1t]: [Ling-1T Model](https://ant-ling.medium.com/deep-insight-efficient-inference-introducing-the-trillion-parameter-ling-1t-model-77d6170e5e8e)
832856

833857
[^ring_1t]: [Ring-1T](https://ant-ling.medium.com/ring-1t-release-the-flow-state-of-insight-born-of-epiphany-c20e8e32817c)
834858

859+
#### 华为 Pangu
860+
835861
[^pangu_ultra]: [Pangu Ultra](https://github.com/pangu-tech/pangu-ultra)
836862

837863
[^pangu_pro_moe]: [Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity](https://arxiv.org/abs/2505.21411)
838864

839865
[^pangu_news]: [华为宣布开源盘古 7B 稠密和 72B 混合专家模型](https://www.huawei.com/cn/news/2025/7/pangu-opensource)
840866

867+
#### 01.AI Yi
868+
841869
[^yi_lightning]: [Yi-Lightning Technical Report](https://arxiv.org/abs/2412.01253)
842870

871+
#### InternLM / 上海 AI Lab
872+
843873
[^internlm2]: [InternLM2 Technical Report](https://arxiv.org/abs/2403.17297)
844874

875+
#### 百川 Baichuan 与 360 智脑
876+
845877
[^baichuan2]: [Baichuan 2: Open Large-scale Language Models](https://arxiv.org/abs/2309.10305)
846878

847879
[^zhinao]: [360Zhinao Technical Report](https://arxiv.org/abs/2405.13386)
848880

881+
#### 昆仑万维 Skywork 与小米 MiMo
882+
849883
[^skywork_or1]: [Skywork Open Reasoner 1 Technical Report](https://huggingface.co/papers/2505.22312)
850884

851885
[^skywork_or1_github]: [Skywork-OR1 GitHub Repository](https://github.com/SkyworkAI/Skywork-OR1)
852886

853-
[^keye_vl]: [Kwai Keye-VL Technical Report](https://arxiv.org/abs/2507.01949)
854-
855887
[^mimo]: [MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining](https://arxiv.org/abs/2505.07608)
856888

857889
[^mimo_github]: [Xiaomi MiMo GitHub Repository](https://github.com/XiaomiMiMo/MiMo)
858890

859891
[^mimo_vl]: [Xiaomi MiMo-VL-Miloco Technical Report](https://arxiv.org/abs/2512.17436)
860892

893+
#### 快手、商汤、讯飞
894+
895+
[^keye_vl]: [Kwai Keye-VL Technical Report](https://arxiv.org/abs/2507.01949)
896+
861897
[^sensenova_u1]: [SenseNova U1](https://www.sensetime.com/en/news-detail/51170629?categoryId=1072)
862898

863899
[^spark_x1]: [Spark X1 deep reasoning model](https://news.cgtn.com/news/2025-01-15/China-releases-Spark-X1-deep-reasoning-model-that-packs-a-punch-1AbIq8PzzEI/index.html)
864900

901+
### 国外公司与实验室
902+
903+
#### OpenAI
904+
865905
[^instructgpt]: [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
866906

867907
[^gpt4]: [GPT-4 Technical Report](https://arxiv.org/abs/2303.08774)
@@ -890,6 +930,8 @@ Tulu 3 完整开源数据、代码和训练 recipe,主题就是 multi-stage po
890930

891931
[^gpt5_2_codex]: [Introducing GPT-5.2-Codex](https://openai.com/index/introducing-gpt-5-2-codex/)
892932

933+
#### Anthropic
934+
893935
[^constitutional_ai]: [Constitutional AI: Harmlessness from AI Feedback](https://arxiv.org/abs/2212.08073)
894936

895937
[^anthropic_cai]: [Anthropic Constitutional AI overview](https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback)
@@ -902,6 +944,8 @@ Tulu 3 完整开源数据、代码和训练 recipe,主题就是 multi-stage po
902944

903945
[^claude_opus_4_6]: [Claude Opus 4.6 System Card](https://www-cdn.anthropic.com/0dd865075ad3132672ee0ab40b05a53f14cf5288.pdf)
904946

947+
#### Google DeepMind
948+
905949
[^gemini_1_5]: [Gemini 1.5 Technical Report](https://arxiv.org/abs/2403.05530)
906950

907951
[^gemini_2_5]: [Gemini 2.5 Technical Report](https://arxiv.org/abs/2507.06261)
@@ -914,12 +958,18 @@ Tulu 3 完整开源数据、代码和训练 recipe,主题就是 multi-stage po
914958

915959
[^gemma_3]: [Gemma 3 Technical Report](https://arxiv.org/abs/2503.19786)
916960

961+
#### Meta Llama
962+
917963
[^llama3_herd]: [The Llama 3 Herd of Models](https://arxiv.org/abs/2407.21783)
918964

965+
#### Microsoft Phi
966+
919967
[^phi_4]: [Phi-4 Technical Report](https://arxiv.org/abs/2412.08905)
920968

921969
[^phi_4_reasoning]: [Phi-4-reasoning Technical Report](https://arxiv.org/abs/2504.21318)
922970

971+
#### NVIDIA Nemotron
972+
923973
[^nemotron_4]: [Nemotron-4 340B Technical Report](https://arxiv.org/abs/2406.11704)
924974

925975
[^llama_nemotron]: [Llama-Nemotron: Efficient Reasoning Models](https://arxiv.org/abs/2505.00949)
@@ -932,12 +982,18 @@ Tulu 3 完整开源数据、代码和训练 recipe,主题就是 multi-stage po
932982

933983
[^nemotron_3]: [Inside NVIDIA Nemotron 3](https://developer.nvidia.com/blog/inside-nvidia-nemotron-3-techniques-tools-and-data-that-make-it-efficient-and-accurate/)
934984

985+
#### Mistral
986+
935987
[^magistral]: [Magistral](https://arxiv.org/abs/2506.10910)
936988

989+
#### Apple
990+
937991
[^apple_fm]: [Apple Intelligence Foundation Language Models](https://machinelearning.apple.com/research/apple-intelligence-foundation-language-models)
938992

939993
[^apple_fm_2025]: [Apple Intelligence Foundation Language Models Tech Report 2025](https://machinelearning.apple.com/research/apple-foundation-models-tech-report-2025)
940994

995+
#### xAI Grok
996+
941997
[^grok_1]: [xAI Grok-1 Model Card](https://x.ai/news/grok/model-card)
942998

943999
[^grok_4]: [xAI Grok 4](https://x.ai/news/grok-4)
@@ -946,16 +1002,22 @@ Tulu 3 完整开源数据、代码和训练 recipe,主题就是 multi-stage po
9461002

9471003
[^grok_4_1_card]: [xAI Grok 4.1 Model Card](https://data.x.ai/2025-11-17-grok-4-1-model-card.pdf)
9481004

1005+
#### IBM Granite
1006+
9491007
[^granite_3_3]: [IBM Granite 3.3](https://www.ibm.com/new/announcements/ibm-granite-3-3-speech-recognition-refined-reasoning-rag-loras)
9501008

9511009
[^granite_4_0]: [IBM Granite 4.0](https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models)
9521010

9531011
[^granite_4_1]: [IBM Granite 4.1 Build Notes](https://huggingface.co/blog/ibm-granite/granite-4-1)
9541012

1013+
#### Salesforce xLAM / SFR-RL
1014+
9551015
[^xlam]: [Salesforce xLAM](https://www.salesforce.com/blog/large-action-model-ai-agent/)
9561016

9571017
[^sfr_rl]: [Salesforce SFR-RL](https://www.salesforce.com/blog/efficient-rl-training-agentic-era/)
9581018

1019+
#### Amazon Nova
1020+
9591021
[^nova]: [Amazon Nova](https://aws.amazon.com/nova/)
9601022

9611023
[^nova_report]: [The Amazon Nova Family of Models: Technical Report and Model Card](https://www.isi.edu/results/publications/31887/the-amazon-nova-family-of-models-technical-report-and-model-card/)
@@ -964,26 +1026,42 @@ Tulu 3 完整开源数据、代码和训练 recipe,主题就是 multi-stage po
9641026

9651027
[^nova_forge]: [Amazon Nova Forge](https://aws.amazon.com/nova/forge/)
9661028

1029+
#### Cohere Command A
1030+
9671031
[^cohere_research]: [Cohere Research](https://cohere.com/research)
9681032

9691033
[^command_a]: [Command A: An Enterprise-Ready Large Language Model](https://cohere.com/research/papers/command-a-technical-report.pdf)
9701034

1035+
#### Databricks
1036+
9711037
[^dbrx]: [DBRX Instruct](https://huggingface.co/databricks/dbrx-instruct)
9721038

1039+
#### AI21
1040+
9731041
[^jamba_1_5a]: [Jamba 1.5a: Enhancing AI Safety Through Post-Post-Training Alignment](https://www.ai21.com/research/jamba-1-5a/)
9741042

9751043
[^jamba_whitepaper]: [Jamba 1.5a Whitepaper](https://lp.ai21.com/hubfs/resources/Jamba-1-5a-Whitepaper.pdf)
9761044

1045+
#### Cursor
1046+
9771047
[^cursor_composer_2]: [Cursor Composer 2 Technical Report](https://cursor.com/blog/composer-2-technical-report)
9781048

1049+
#### LG EXAONE
1050+
9791051
[^exaone_4_0]: [EXAONE 4.0 Technical Report](https://www.lgresearch.ai/data/cdn/upload/EXAONE_4_0.pdf)
9801052

9811053
[^k_exaone]: [K-EXAONE Technical Report](https://www.lgresearch.ai/data/cdn/upload/K-EXAONE_Technical_Report.pdf)
9821054

1055+
#### NAVER HyperCLOVA X
1056+
9831057
[^hyperclova_x]: [HyperCLOVA X Technical Report](https://arxiv.org/abs/2404.01954)
9841058

9851059
[^hyperclova_x_think]: [HyperCLOVA X THINK Technical Report](https://huggingface.co/papers/2506.22403)
9861060

1061+
### 开源基线与综述
1062+
1063+
#### AI2 Tulu / Survey
1064+
9871065
[^tulu_3]: [Tulu 3: Pushing Frontiers in Open Language Model Post-Training](https://openreview.net/forum?id=i1uGbfHHpH)
9881066

9891067
[^tulu_3_blog]: [Tulu 3 Technical Blog](https://allenai.org/blog/tulu-3-technical)

docs/chapter09_alignment/intro.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# 8 章:对齐与推理强化(DPO / GRPO / RLVR)
1+
# 9 章:对齐与推理强化(DPO / GRPO / RLVR)
22

3-
上一章我们走完了 RLHF 的完整流水线。如果亲自动手跑过那条流水线,你一定对几个数字印象深刻:四个模型同时驻留显存(Actor、Ref、Critic、Reward Model)、训练一轮要生成大量 on-policy 回答、Reward Model 的质量直接决定对齐效果、还得时刻盯着 reward hacking 别跑偏。
3+
上一章我们走完了 [RLHF 的完整流水线](../chapter08_rlhf/standard-rlhf-pipeline)。如果亲自动手跑过那条流水线,你一定对几个数字印象深刻:四个模型同时驻留显存(Actor、Ref、Critic、Reward Model)、训练一轮要生成大量 on-policy 回答、Reward Model 的质量直接决定对齐效果、还得时刻盯着 reward hacking 别跑偏。
44

55
现在退一步,问自己一个更根本的问题:**这四个模型里,有没有哪个是"可以被省掉的"?**
66

@@ -12,11 +12,11 @@
1212

1313
**第一个被省掉的是 Reward Model。** RLHF 中,Reward Model 的作用是把人类的偏好判断("回答 A 比回答 B 好")转化成标量分数。但 DPO(Direct Preference Optimization,2023)发现了一个巧妙的数学事实:偏好数据本身就隐含了奖励信号,不需要额外训练一个 RM 来"翻译"。只需要改一个损失函数,直接在偏好对上训练策略模型,效果和 RLHF 等价。这一步把"四模型"变成了"两模型"。
1414

15-
**第二个被省掉的是 Critic。** PPO 需要 Critic 网络来估计优势函数(回顾第 6 :$A_t = \sum \lambda^l \delta_{t+l}$),而 Critic 是一个和 Actor 差不多大的网络,显存开销直接翻倍。GRPO(Group Relative Policy Optimization,2025)说:何必单独养一个 Critic?对同一个 prompt 生成一组回答,用组内的均值和标准差做归一化,就能替代 Critic 的基线估计。这一步把"两模型"进一步精简。
15+
**第二个被省掉的是 Critic。** PPO 需要 Critic 网络来估计优势函数(回顾[ 6 章优势函数](../chapter06_actor_critic/advantage-function):$A_t = \sum \lambda^l \delta_{t+l}$),而 Critic 是一个和 Actor 差不多大的网络,显存开销直接翻倍。GRPO(Group Relative Policy Optimization,2025)说:何必单独养一个 Critic?对同一个 prompt 生成一组回答,用组内的均值和标准差做归一化,就能替代 Critic 的基线估计。这一步把"两模型"进一步精简。
1616

1717
**第三个被省掉的是人工标注本身。** 无论是 RLHF 的偏好数据还是 DPO 的 chosen/rejected 对,都需要人类(或强模型)来判断"哪个回答更好"。但数学题有标准答案,代码有测试用例,逻辑推理有可验证的结论——这些领域的"好坏判断"根本不需要人类参与。RLVR(Reinforcement Learning with Verifiable Rewards,2025)直接用规则引擎当裁判,把标注成本降到了接近零。
1818

19-
三条路线不是孤立的。它们共享一个深层洞察:**RL 的核心不是 PPO 算法本身,而是训练信号从哪来**。RLHF 用人类偏好当信号,DPO 发现偏好数据本身就能编码信号,GRPO 省掉了信号处理中的一个组件,RLVR 则彻底换了一种信号源。理解了这个"信号源"的演进逻辑,你就掌握了本章的主线。
19+
三条路线不是孤立的。它们共享一个深层洞察:**RL 的核心不是 PPO 算法本身,而是训练信号从哪来**。RLHF[上一章](../chapter08_rlhf/ppo-rlhf-loop)用人类偏好当信号,DPO 发现偏好数据本身就能编码信号,GRPO 省掉了信号处理中的一个组件,RLVR 则彻底换了一种信号源。理解了这个"信号源"的演进逻辑,你就掌握了本章的主线。
2020

2121
## 从对齐到推理:一个意外的发现
2222

@@ -37,7 +37,7 @@ DeepSeek-R1 的实验最具代表性:他们发现,用 RLVR 训练的模型
3737

3838
一个常见的误解是"DPO 取代了 RLHF"或"GRPO 比 PPO 更好"。实际情况更微妙:
3939

40-
- **RLHF(上一章)** 是"全能型"方案——只要有偏好数据,它能对齐任何能力。代价是工程复杂度高。
40+
- **RLHF([上一章](../chapter08_rlhf/intro)** 是"全能型"方案——只要有偏好数据,它能对齐任何能力。代价是工程复杂度高。
4141
- **DPO** 是 RLHF 的"轻量替代"——在偏好数据质量高、任务简单的场景下,效果相当但成本低得多。但当任务复杂到需要在线探索时,DPO 的离线特性会成为瓶颈。
4242
- **GRPO / RLVR** 则开辟了一条全新的路线——不做偏好对齐,而是用规则验证来强化推理。它和 RLHF 不是竞争关系,而是互补关系:先用 RLHF 对齐,再用 GRPO/RLVR 强化推理。
4343

0 commit comments

Comments
 (0)