Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)

Last updated: Sep 11,2025

Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) (Yu et al., 2025) is a reinforcement learning framework designed for training large language models (LLMs) on complex reasoning tasks. It improves upon conventional methods by introducing asymmetric clipping through decoupled lower and upper bounds $\epsilon_{\text{low}}$ and $\epsilon_{\text{high}}$, enabling more adaptive policy updates.

DAPO further employs a dynamic sampling strategy, excluding samples where all responses are uniformly correct or incorrect, ensuring gradient updates come from informative, diverse outputs. To promote high-quality reasoning, it applies token-level losses and reward shaping to discourage overly long or early-terminated responses.

The core objective is:

J_{\text{DAPO}}(\theta) = \mathbb{E}{\substack{(q,a) \sim \mathcal{D}, \ {o_i}{i=1}^G \sim \pi_{\theta_{\text{old}}}(o|q)}} \left[ \frac{1}{\sum_{i=1}^G |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} \min\left( r_{i,t}(\theta) \hat{A}{i,t}, \text{clip}\left( r{i,t}(\theta), \textcolor{red}{1-\epsilon_{\text{low}}}, \textcolor{red}{1+\epsilon_{\text{high}}} \right) \hat{A}_{i,t} \right) \right] $$

where $\hat{A}{i,t}$ is the group-normalized advantage and $r{i,t}(\theta)$ is the token-level policy ratio. Compared to GRPO’s symmetric clipping, DAPO’s asymmetric design allows finer control over exploration and stability during training.

For more details:

AReal Detail: Paper of AReal
DAPO Detail: Paper of DAPO

Algorithm Core Parameters

We only list the different parameters from GRPO here:

actor.overlong_reward_penalty: Define if overlong_reward_penalty should be used.
actor.overlong_tokens: The threshold of tokens at the tail to be considered as overlong.
actor.overlong_penalty_factor: The factor of overlong penalty.
actor.eps_clip: The lower bound of clipping, default is 0.2.
actor.eps_clip_higher: The higher bound of clipping.
actor.dynamic_sampling: Define if dynamic sampling should be used.

Overlong Penalty

Here we briefly introduce the implementation details of DAPO.

Example Usage

The algorithm is experimental and may not be stable.

We recommend to change the parameter within the configuration file (i.e.gsm8k_dapo.yaml).

Backend	CMD
local	`python3 -m areal.launcher.local examples/math/gsm8k_dapo.py --config examples/math/gsm8k_dapo.yaml --<other_args_to_overwrite>`
ray	`python3 -m areal.launcher.ray examples/math/gsm8k_dapo.py --config examples/math/gsm8k_dapo.yaml --<other_args_to_overwrite>`
slurm	`python3 -m areal.launcher.slurm examples/math/gsm8k_dapo.py --config examples/math/gsm8k_dapo.yaml --<other_args_to_overwrite>`

Baselines

We still lack baseline, welcome to contribute!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)

Algorithm Core Parameters

Overlong Penalty

Example Usage

Baselines

FilesExpand file tree

dapo.md

Latest commit

History

dapo.md

File metadata and controls

Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)

Algorithm Core Parameters

Overlong Penalty

Example Usage

Baselines