feat: Add CISPO (Clipped IS-weight Policy Optimization) by kekmodel · Pull Request #681 · THUDM/slime

kekmodel · 2025-11-03T15:15:59Z

Add CISPO (Clipped IS-weight Policy Optimization)

Summary

Add support for CISPO (Clipped IS-weight Policy Optimization) algorithm introduced in the MiniMax-M1 paper.

Background

CISPO addresses a critical limitation in PPO/GRPO where low-probability reasoning tokens (e.g., "However," "Recheck," "Wait") are clipped out after the first on-policy update. As stated in the paper:

"These low-probability tokens, however, are often crucial for stabilizing entropy and facilitating scalable RL."

CISPO solves this by clipping the importance sampling weight instead of token updates, preserving gradient contributions from all tokens.

CISPO Loss Function:

$$\mathcal{L}^{\text{CISPO}}(\theta) = -\mathbb{E}_{(s_i, a_i) \sim \mathcal{D}} \left[ \sum_{t=1}^{T_i} \text{sg}(\hat{r}_{i,t}(\theta)) \cdot A_i \cdot \log \pi_\theta(a_{i,t} | s_{i,<t}) \right]$$

where:

$\hat{r}_{i,t}(\theta)$ = clip( $r_{i,t}(\theta)$, $1 - \epsilon_{\text{low}}^{\text{IS}}$, $1 + \epsilon_{\text{high}}^{\text{IS}}$ ) is the clipped importance sampling ratio
$r_{i,t}(\theta) = \frac{\pi_\theta(a_{i,t} | s_{i,<t})}{\pi_{\theta_{\text{old}}}(a_{i,t} | s_{i,<t})}$ is the per-token importance sampling ratio
$A_i$ is the advantage estimate for trajectory $i$
sg(·) denotes stop-gradient operation

Key Features:

Token-level IS: Clips importance sampling weights per token (Equation 4)
Practical clipping: As stated in the paper: "In our experiments, we did not impose a lower bound on the IS weight by setting $\epsilon_{\text{low}}^{\text{IS}}$ to a large value; instead, we only tuned $\epsilon_{\text{high}}^{\text{IS}}$."
Stop-gradient: Applies sg(·) to clipped ratios, preserving gradients through log probabilities

Changes

slime/utils/ppo_utils.py: Add compute_cispo_loss() function
slime/utils/arguments.py: Add 'cispo' to advantage_estimator choices
slime/ray/rollout.py: Add CISPO to reward normalization
slime/backends/megatron_utils/loss.py: Use CISPO loss when advantage_estimator='cispo'

Implementation Details

Separate loss function required: Unlike GRPO/GSPO (which only differ in advantage estimation), CISPO requires a distinct loss computation due to stop-gradient and explicit log probability usage
Token-level IS: Uses per-token importance sampling ratios (not sequence-level)
Stop-gradient on IS ratio: ratio.detach() prevents gradient flow through the ratio
Explicit log probability: Uses ratio_sg * advantages * log_probs instead of ratio * advantages
- This ensures gradient flows through $\log \pi_\theta$ only, matching Equation 4: sg( $\hat{r}_{i,t}$ ) · $A_{i,t}$ · log $\pi_\theta$
Upper-only clipping: Lower bound not imposed in practice (only $\epsilon_{\text{high}}^{\text{IS}}$ is tuned)
Default eps_clip_high: 5.0 (based on ScaleRL paper analysis)

Testing

Tested on GSM8K with Qwen3-0.6B model.

WandB Run: https://api.wandb.ai/links/kekmodel/yz3rhx3x

Configuration:

Model: Qwen3-0.6B
Dataset: GSM8K
Settings: Mean-centering (--disable-grpo-std-normalization)
eps_clip_high: 5.0

References

MiniMax-M1 Paper: Scaling Test-Time Compute Efficiently with Lightning Attention
- Section 3.1: CISPO algorithm definition and motivation
- Equation 4: CISPO loss function with token-level IS and stop-gradient
- Equation 5: Clipped IS weight definition
ScaleRL Paper: The Art of Scaling Reinforcement Learning Compute for LLMs
- Section 3.2: Empirical comparison showing CISPO outperforms DAPO and GSPO
- Demonstrates CISPO's robustness to IS-clipping parameter choices

@sam571128

This refactoring addresses the feedback from sam571128 on PR THUDM#681 about integrating CISPO with the REINFORCE + TIS/MIS framework. ## Key Changes 1. **New unified function**: `compute_reinforce_loss_with_is_weights()` - Implements: loss = -IS_weight * advantages * log_probs - `stop_gradient=True`: CISPO (gradient flows only through log_probs) - `stop_gradient=False`: Standard IS-weighted REINFORCE 2. **Backward compatibility**: `compute_cispo_loss()` is now a wrapper - Existing code works without changes - Internally calls the new unified function with stop_gradient=True 3. **Clear semantics**: Separates three concerns: - Base loss computation (REINFORCE: -A * log_π) - IS weight computation (ratio truncation) - Gradient control (stop-gradient option) ## Design Rationale After deep analysis, CISPO is mathematically equivalent to: CISPO = REINFORCE + IS_weight_with_stop_gradient This refactoring: - ✅ Maintains backward compatibility (zero breaking changes) - ✅ Enables future IS-based algorithms (AWR, V-MPO, etc.) - ✅ Keeps code simple and maintainable - ✅ Follows "Rule of Three" - build frameworks when ≥5 use cases ## Documentation - `CISPO_REFACTORING_OPTIONS.md`: Detailed comparison of 3 design options - `REFACTORING_SUMMARY.md`: Technical analysis and decision rationale - `test_cispo_equivalence.py`: Equivalence tests (pending torch environment) ## Future Extensions When we have 5+ IS-based algorithms, this function can easily extend: - Bidirectional clipping (upper + lower bounds) - Custom weighting functions (exponential, geometric, etc.) - Different IS weight formulas For now, this pragmatic approach balances simplicity with extensibility. Addresses: Discussion on PR THUDM#681 with @sam571128

@sam571128

This refactoring addresses @sam571128's feedback on PR THUDM#681 to make CISPO a composable feature rather than a standalone advantage estimator. ## Key Insight CISPO = REINFORCE + IS_weight_with_stop_gradient Therefore, CISPO should be a **modifier** that can be applied to any REINFORCE-based algorithm, not a separate algorithm itself. ## Changes ### 1. Removed CISPO as advantage_estimator - Removed "cispo" from --advantage-estimator choices - Removed "cispo" from advantage computation (loss.py:240) - Removed "cispo" from reward normalization (rollout.py:183, 196) ### 2. Added --use-cispo flag - New flag to enable CISPO IS weights (arguments.py:697) - Can be combined with grpo, gspo, reinforce_plus_plus, etc. - Clear documentation on usage requirements ### 3. Changed policy loss logic - Before: if advantage_estimator == "cispo" - After: if use_cispo and advantage_estimator in REINFORCE_ESTIMATORS - Enables composition: base algorithm + optional IS weighting ## Migration **Old usage:** ```bash --advantage-estimator cispo --eps-clip-high 5.0 ``` **New usage:** ```bash --advantage-estimator grpo --use-cispo --eps-clip-high 5.0 ``` ## Benefits ✅ **Composition over Inheritance**: Combine CISPO with any REINFORCE algorithm ✅ **Code Reuse**: No duplication between GRPO and CISPO ✅ **Extensibility**: Easy to add other IS methods (AWR, V-MPO, etc.) ✅ **User Flexibility**: Mix-and-match base algorithms with IS weights ## New Possibilities ```bash # CISPO with GSPO --advantage-estimator gspo --use-cispo # CISPO with REINFORCE++ --advantage-estimator reinforce_plus_plus --use-cispo # Future: Combine multiple IS methods --use-cispo --use-tis # (when implemented) ``` ## Breaking Change ⚠️ This is a breaking change for early adopters of CISPO. However, since CISPO was just added in PR THUDM#681 and hasn't been released, now is the right time to get the architecture correct. ## Implementation Details The core `compute_cispo_loss()` function remains **unchanged**. Only the *trigger mechanism* changed: - Before: Called when advantage_estimator == "cispo" - After: Called when use_cispo flag is set See CISPO_COMPOSITION_REFACTORING.md for detailed rationale and examples. Addresses: Feedback from @sam571128 on PR THUDM#681

Add support for CISPO algorithm from MiniMax-M1 paper, which addresses PPO/GRPO's limitation of clipping out low-probability reasoning tokens. Changes: - Add compute_cispo_loss() in slime/utils/ppo_utils.py - Add 'cispo' to advantage_estimator choices - Update reward normalization to include CISPO - Use CISPO loss when advantage_estimator='cispo' Key implementation details: - Token-level IS with stop-gradient on clipped ratios - Explicit log probability: ratio_sg * advantages * log_probs - Upper-only clipping with default eps_clip_high=5.0 - Direct clipfrac calculation: (ratio > eps_clip_high) Reference: MiniMax-M1 paper (arxiv:2506.13585)

- Update compute_cispo_loss to use eps_clip as absolute lower bound - Add CISPO support to FSDP backend (was missing) - Update docstring: "Clipped IS-weight Policy Optimization" per paper - Paper only tunes eps_clip_high; eps_clip=0 disables lower bound 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

yitianlian · 2025-12-05T03:13:48Z

Hi, the format error needs to be solved to pass the ci test.

kekmodel · 2025-12-24T05:49:07Z

Hi, the format error needs to be solved to pass the ci test.

Fixed the formatting issue—CI is passing now. Thanks!

MikaStars39 · 2026-02-11T06:13:01Z

Hi, will this feature be merged? It seems to be some conflicts with the main branch now.

kekmodel force-pushed the feature/add-cispo branch 2 times, most recently from 3774d22 to 2473d31 Compare November 3, 2025 17:25

zhuzilin approved these changes Nov 6, 2025

View reviewed changes

kekmodel force-pushed the feature/add-cispo branch from 2473d31 to 0edbebf Compare November 6, 2025 15:45

kekmodel force-pushed the feature/add-cispo branch from b342c5e to 087a35c Compare December 2, 2025 13:10

kekmodel force-pushed the feature/add-cispo branch from 087a35c to 095208d Compare December 2, 2025 13:15

kekmodel and others added 2 commits December 5, 2025 01:37

Merge branch 'main' into feature/add-cispo

b7b458f

Merge branch 'main' into feature/add-cispo

0d79b0c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add CISPO (Clipped IS-weight Policy Optimization)#681

feat: Add CISPO (Clipped IS-weight Policy Optimization)#681
kekmodel wants to merge 4 commits intoTHUDM:mainfrom
kekmodel:feature/add-cispo

kekmodel commented Nov 3, 2025 •

edited

Loading

Uh oh!

yitianlian commented Dec 5, 2025

Uh oh!

kekmodel commented Dec 24, 2025

Uh oh!

MikaStars39 commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kekmodel commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add CISPO (Clipped IS-weight Policy Optimization)

Summary

Background

Changes

Implementation Details

Testing

References

Uh oh!

yitianlian commented Dec 5, 2025

Uh oh!

kekmodel commented Dec 24, 2025

Uh oh!

MikaStars39 commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kekmodel commented Nov 3, 2025 •

edited

Loading