Last updated: Sep 11,2025
Doc Author: Ziyi ZENG
Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) (Yu et al., 2025) is a
reinforcement learning framework designed for training large language models (LLMs) on
complex reasoning tasks. It improves upon conventional methods by introducing
asymmetric clipping through decoupled lower and upper bounds
DAPO further employs a dynamic sampling strategy, excluding samples where all responses are uniformly correct or incorrect, ensuring gradient updates come from informative, diverse outputs. To promote high-quality reasoning, it applies token-level losses and reward shaping to discourage overly long or early-terminated responses.
The core objective is:
$$
J_{\text{DAPO}}(\theta) = \mathbb{E}{\substack{(q,a) \sim \mathcal{D}, \ {o_i}{i=1}^G \sim \pi_{\theta_{\text{old}}}(o|q)}} \left[ \frac{1}{\sum_{i=1}^G |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} \min\left( r_{i,t}(\theta) \hat{A}{i,t}, \text{clip}\left( r{i,t}(\theta), \textcolor{red}{1-\epsilon_{\text{low}}}, \textcolor{red}{1+\epsilon_{\text{high}}} \right) \hat{A}_{i,t} \right) \right] $$
where $\hat{A}{i,t}$ is the group-normalized advantage and $r{i,t}(\theta)$ is the token-level policy ratio. Compared to GRPO’s symmetric clipping, DAPO’s asymmetric design allows finer control over exploration and stability during training.
For more details:
-
AReal Detail: Paper of AReal
-
DAPO Detail: Paper of DAPO
We only list the different parameters from GRPO here:
actor.overlong_reward_penalty: Define if overlong_reward_penalty should be used.actor.overlong_tokens: The threshold of tokens at the tail to be considered as overlong.actor.overlong_penalty_factor: The factor of overlong penalty.actor.eps_clip: The lower bound of clipping, default is0.2.actor.eps_clip_higher: The higher bound of clipping.actor.dynamic_sampling: Define if dynamic sampling should be used.
Here we briefly introduce the implementation details of DAPO.
The algorithm is experimental and may not be stable.
We recommend to change the parameter within the configuration file (i.e.gsm8k_dapo.yaml).
| Backend | CMD |
|---|---|
| local | python3 -m areal.launcher.local examples/math/gsm8k_dapo.py --config examples/math/gsm8k_dapo.yaml --<other_args_to_overwrite> |
| ray | python3 -m areal.launcher.ray examples/math/gsm8k_dapo.py --config examples/math/gsm8k_dapo.yaml --<other_args_to_overwrite> |
| slurm | python3 -m areal.launcher.slurm examples/math/gsm8k_dapo.py --config examples/math/gsm8k_dapo.yaml --<other_args_to_overwrite> |
We still lack baseline, welcome to contribute!

