feat: Add CISPO (Clipped IS-weight Policy Optimization)#681
Open
kekmodel wants to merge 4 commits intoTHUDM:mainfrom
Open
feat: Add CISPO (Clipped IS-weight Policy Optimization)#681kekmodel wants to merge 4 commits intoTHUDM:mainfrom
kekmodel wants to merge 4 commits intoTHUDM:mainfrom
Conversation
3774d22 to
2473d31
Compare
kekmodel
pushed a commit
to kekmodel/slime
that referenced
this pull request
Nov 3, 2025
This refactoring addresses the feedback from sam571128 on PR THUDM#681 about integrating CISPO with the REINFORCE + TIS/MIS framework. ## Key Changes 1. **New unified function**: `compute_reinforce_loss_with_is_weights()` - Implements: loss = -IS_weight * advantages * log_probs - `stop_gradient=True`: CISPO (gradient flows only through log_probs) - `stop_gradient=False`: Standard IS-weighted REINFORCE 2. **Backward compatibility**: `compute_cispo_loss()` is now a wrapper - Existing code works without changes - Internally calls the new unified function with stop_gradient=True 3. **Clear semantics**: Separates three concerns: - Base loss computation (REINFORCE: -A * log_π) - IS weight computation (ratio truncation) - Gradient control (stop-gradient option) ## Design Rationale After deep analysis, CISPO is mathematically equivalent to: CISPO = REINFORCE + IS_weight_with_stop_gradient This refactoring: - ✅ Maintains backward compatibility (zero breaking changes) - ✅ Enables future IS-based algorithms (AWR, V-MPO, etc.) - ✅ Keeps code simple and maintainable - ✅ Follows "Rule of Three" - build frameworks when ≥5 use cases ## Documentation - `CISPO_REFACTORING_OPTIONS.md`: Detailed comparison of 3 design options - `REFACTORING_SUMMARY.md`: Technical analysis and decision rationale - `test_cispo_equivalence.py`: Equivalence tests (pending torch environment) ## Future Extensions When we have 5+ IS-based algorithms, this function can easily extend: - Bidirectional clipping (upper + lower bounds) - Custom weighting functions (exponential, geometric, etc.) - Different IS weight formulas For now, this pragmatic approach balances simplicity with extensibility. Addresses: Discussion on PR THUDM#681 with @sam571128
kekmodel
pushed a commit
to kekmodel/slime
that referenced
this pull request
Nov 4, 2025
This refactoring addresses @sam571128's feedback on PR THUDM#681 to make CISPO a composable feature rather than a standalone advantage estimator. ## Key Insight CISPO = REINFORCE + IS_weight_with_stop_gradient Therefore, CISPO should be a **modifier** that can be applied to any REINFORCE-based algorithm, not a separate algorithm itself. ## Changes ### 1. Removed CISPO as advantage_estimator - Removed "cispo" from --advantage-estimator choices - Removed "cispo" from advantage computation (loss.py:240) - Removed "cispo" from reward normalization (rollout.py:183, 196) ### 2. Added --use-cispo flag - New flag to enable CISPO IS weights (arguments.py:697) - Can be combined with grpo, gspo, reinforce_plus_plus, etc. - Clear documentation on usage requirements ### 3. Changed policy loss logic - Before: if advantage_estimator == "cispo" - After: if use_cispo and advantage_estimator in REINFORCE_ESTIMATORS - Enables composition: base algorithm + optional IS weighting ## Migration **Old usage:** ```bash --advantage-estimator cispo --eps-clip-high 5.0 ``` **New usage:** ```bash --advantage-estimator grpo --use-cispo --eps-clip-high 5.0 ``` ## Benefits ✅ **Composition over Inheritance**: Combine CISPO with any REINFORCE algorithm ✅ **Code Reuse**: No duplication between GRPO and CISPO ✅ **Extensibility**: Easy to add other IS methods (AWR, V-MPO, etc.) ✅ **User Flexibility**: Mix-and-match base algorithms with IS weights ## New Possibilities ```bash # CISPO with GSPO --advantage-estimator gspo --use-cispo # CISPO with REINFORCE++ --advantage-estimator reinforce_plus_plus --use-cispo # Future: Combine multiple IS methods --use-cispo --use-tis # (when implemented) ``` ## Breaking Change⚠️ This is a breaking change for early adopters of CISPO. However, since CISPO was just added in PR THUDM#681 and hasn't been released, now is the right time to get the architecture correct. ## Implementation Details The core `compute_cispo_loss()` function remains **unchanged**. Only the *trigger mechanism* changed: - Before: Called when advantage_estimator == "cispo" - After: Called when use_cispo flag is set See CISPO_COMPOSITION_REFACTORING.md for detailed rationale and examples. Addresses: Feedback from @sam571128 on PR THUDM#681
zhuzilin
approved these changes
Nov 6, 2025
2473d31 to
0edbebf
Compare
b342c5e to
087a35c
Compare
Add support for CISPO algorithm from MiniMax-M1 paper, which addresses PPO/GRPO's limitation of clipping out low-probability reasoning tokens. Changes: - Add compute_cispo_loss() in slime/utils/ppo_utils.py - Add 'cispo' to advantage_estimator choices - Update reward normalization to include CISPO - Use CISPO loss when advantage_estimator='cispo' Key implementation details: - Token-level IS with stop-gradient on clipped ratios - Explicit log probability: ratio_sg * advantages * log_probs - Upper-only clipping with default eps_clip_high=5.0 - Direct clipfrac calculation: (ratio > eps_clip_high) Reference: MiniMax-M1 paper (arxiv:2506.13585)
087a35c to
095208d
Compare
- Update compute_cispo_loss to use eps_clip as absolute lower bound - Add CISPO support to FSDP backend (was missing) - Update docstring: "Clipped IS-weight Policy Optimization" per paper - Paper only tunes eps_clip_high; eps_clip=0 disables lower bound 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Collaborator
|
Hi, the format error needs to be solved to pass the ci test. |
Contributor
Author
Fixed the formatting issue—CI is passing now. Thanks! |
|
Hi, will this feature be merged? It seems to be some conflicts with the main branch now. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add CISPO (Clipped IS-weight Policy Optimization)
Summary
Add support for CISPO (Clipped IS-weight Policy Optimization) algorithm introduced in the MiniMax-M1 paper.
Background
CISPO addresses a critical limitation in PPO/GRPO where low-probability reasoning tokens (e.g., "However," "Recheck," "Wait") are clipped out after the first on-policy update. As stated in the paper:
CISPO solves this by clipping the importance sampling weight instead of token updates, preserving gradient contributions from all tokens.
CISPO Loss Function:
where:
Key Features:
Changes
slime/utils/ppo_utils.py: Addcompute_cispo_loss()functionslime/utils/arguments.py: Add'cispo'toadvantage_estimatorchoicesslime/ray/rollout.py: Add CISPO to reward normalizationslime/backends/megatron_utils/loss.py: Use CISPO loss whenadvantage_estimator='cispo'Implementation Details
ratio.detach()prevents gradient flow through the ratioratio_sg * advantages * log_probsinstead ofratio * advantagesTesting
Tested on GSM8K with Qwen3-0.6B model.
WandB Run: https://api.wandb.ai/links/kekmodel/yz3rhx3x
Configuration:
References
MiniMax-M1 Paper: Scaling Test-Time Compute Efficiently with Lightning Attention
ScaleRL Paper: The Art of Scaling Reinforcement Learning Compute for LLMs