Key drivers of R1-Zero's success. #57

benbergner · 2025-01-26T15:15:06Z

benbergner
Jan 26, 2025

I am interested in better understanding the key components that lead to good reasoning performance in R1-Zero.

Is it the quality of the base model?
Is it the training process (RL vs. SFT)?
For RL, is it important whether PPO or GRPO is used?
Is it verifiable rewards vs. using a reward model?
For verifiable rewards, how could this be scaled beyond easily verifiable math and coding problems to arbitrary tasks?
Or could it be that a few math/coding problems are sufficient to learn general reasoning across tasks?

fawern · 2025-01-27T14:20:10Z

fawern
Jan 27, 2025

DeepSeek-R1-Zero's strong reasoning performance stems from:

High-Quality Base Model: Provides a solid foundation for learning.

Reinforcement Learning (RL) Training: Enhances reasoning without supervised fine-tuning.

Use of Verifiable Rewards: Focuses on tasks with clear correctness criteria, like math and coding.

1 reply

benbergner Jan 27, 2025
Author

This generally makes sense, however there are many open questions that could be answered with ablations. What base model is strong enough? Is RL really needed or would SFT on the answer tokens also work? Is GRPO really needed or is PPO enough? Is math + coding with verifiable rewards sufficient to elicit general reasoning behavior?

fawern · 2025-01-27T18:13:26Z

fawern
Jan 27, 2025

A high-quality base model is essential for effective reasoning performance. DeepSeek-R1-Zero, for instance, is built upon the DeepSeek-V3-Base model, which has 671 billion parameters. This substantial parameter count provides a robust foundation for advanced reasoning tasks.

Q2: While SFT can enhance model performance, DeepSeek-R1-Zero was trained exclusively using Rl without any SFT. This approach led to significant improvements in reasoning capabilities, suggesting that RL can be a powerful method for such tasks.

Q3: GRPO is a variant of PPO that omits the value function estimator, estimating baselines from group scores instead. This simplification can lead to more efficient training. While PPO is effective, GRPO offers a streamlined alternative that has been successfully applied in models like DeepSeek-R1-Zero.

0 replies

projektjoe · 2025-01-27T20:58:05Z

projektjoe
Jan 27, 2025

For Q3, refer to https://x.com/jiayi_pirate/status/1882839504899420517

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Key drivers of R1-Zero's success. #57

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Key drivers of R1-Zero's success. #57

Uh oh!

benbergner Jan 26, 2025

Replies: 3 comments · 1 reply

Uh oh!

fawern Jan 27, 2025

Uh oh!

Uh oh!

benbergner Jan 27, 2025 Author

Uh oh!

fawern Jan 27, 2025

Uh oh!

projektjoe Jan 27, 2025

benbergner
Jan 26, 2025

Replies: 3 comments 1 reply

fawern
Jan 27, 2025

benbergner Jan 27, 2025
Author

fawern
Jan 27, 2025

projektjoe
Jan 27, 2025