Skip to content

feat: Add CISPO (Clipped IS-weight Policy Optimization)#681

Open
kekmodel wants to merge 4 commits intoTHUDM:mainfrom
kekmodel:feature/add-cispo
Open

feat: Add CISPO (Clipped IS-weight Policy Optimization)#681
kekmodel wants to merge 4 commits intoTHUDM:mainfrom
kekmodel:feature/add-cispo

Conversation

@kekmodel
Copy link
Contributor

@kekmodel kekmodel commented Nov 3, 2025

Add CISPO (Clipped IS-weight Policy Optimization)

Summary

Add support for CISPO (Clipped IS-weight Policy Optimization) algorithm introduced in the MiniMax-M1 paper.

Background

CISPO addresses a critical limitation in PPO/GRPO where low-probability reasoning tokens (e.g., "However," "Recheck," "Wait") are clipped out after the first on-policy update. As stated in the paper:

"These low-probability tokens, however, are often crucial for stabilizing entropy and facilitating scalable RL."

CISPO solves this by clipping the importance sampling weight instead of token updates, preserving gradient contributions from all tokens.

CISPO Loss Function:

$$\mathcal{L}^{\text{CISPO}}(\theta) = -\mathbb{E}_{(s_i, a_i) \sim \mathcal{D}} \left[ \sum_{t=1}^{T_i} \text{sg}(\hat{r}_{i,t}(\theta)) \cdot A_i \cdot \log \pi_\theta(a_{i,t} | s_{i,<t}) \right]$$

where:

  • $\hat{r}_{i,t}(\theta)$ = clip( $r_{i,t}(\theta)$, $1 - \epsilon_{\text{low}}^{\text{IS}}$, $1 + \epsilon_{\text{high}}^{\text{IS}}$ ) is the clipped importance sampling ratio
  • $r_{i,t}(\theta) = \frac{\pi_\theta(a_{i,t} | s_{i,<t})}{\pi_{\theta_{\text{old}}}(a_{i,t} | s_{i,<t})}$ is the per-token importance sampling ratio
  • $A_i$ is the advantage estimate for trajectory $i$
  • sg(·) denotes stop-gradient operation

Key Features:

  • Token-level IS: Clips importance sampling weights per token (Equation 4)
  • Practical clipping: As stated in the paper: "In our experiments, we did not impose a lower bound on the IS weight by setting $\epsilon_{\text{low}}^{\text{IS}}$ to a large value; instead, we only tuned $\epsilon_{\text{high}}^{\text{IS}}$."
  • Stop-gradient: Applies sg(·) to clipped ratios, preserving gradients through log probabilities

Changes

  • slime/utils/ppo_utils.py: Add compute_cispo_loss() function
  • slime/utils/arguments.py: Add 'cispo' to advantage_estimator choices
  • slime/ray/rollout.py: Add CISPO to reward normalization
  • slime/backends/megatron_utils/loss.py: Use CISPO loss when advantage_estimator='cispo'

Implementation Details

  • Separate loss function required: Unlike GRPO/GSPO (which only differ in advantage estimation), CISPO requires a distinct loss computation due to stop-gradient and explicit log probability usage
  • Token-level IS: Uses per-token importance sampling ratios (not sequence-level)
  • Stop-gradient on IS ratio: ratio.detach() prevents gradient flow through the ratio
  • Explicit log probability: Uses ratio_sg * advantages * log_probs instead of ratio * advantages
    • This ensures gradient flows through $\log \pi_\theta$ only, matching Equation 4: sg( $\hat{r}_{i,t}$ ) · $A_{i,t}$ · log $\pi_\theta$
  • Upper-only clipping: Lower bound not imposed in practice (only $\epsilon_{\text{high}}^{\text{IS}}$ is tuned)
  • Default eps_clip_high: 5.0 (based on ScaleRL paper analysis)

Testing

Tested on GSM8K with Qwen3-0.6B model.

WandB Run: https://api.wandb.ai/links/kekmodel/yz3rhx3x

Configuration:

  • Model: Qwen3-0.6B
  • Dataset: GSM8K
  • Settings: Mean-centering (--disable-grpo-std-normalization)
  • eps_clip_high: 5.0

References

@kekmodel kekmodel force-pushed the feature/add-cispo branch 2 times, most recently from 3774d22 to 2473d31 Compare November 3, 2025 17:25
kekmodel pushed a commit to kekmodel/slime that referenced this pull request Nov 3, 2025
This refactoring addresses the feedback from sam571128 on PR THUDM#681 about
integrating CISPO with the REINFORCE + TIS/MIS framework.

## Key Changes

1. **New unified function**: `compute_reinforce_loss_with_is_weights()`
   - Implements: loss = -IS_weight * advantages * log_probs
   - `stop_gradient=True`: CISPO (gradient flows only through log_probs)
   - `stop_gradient=False`: Standard IS-weighted REINFORCE

2. **Backward compatibility**: `compute_cispo_loss()` is now a wrapper
   - Existing code works without changes
   - Internally calls the new unified function with stop_gradient=True

3. **Clear semantics**: Separates three concerns:
   - Base loss computation (REINFORCE: -A * log_π)
   - IS weight computation (ratio truncation)
   - Gradient control (stop-gradient option)

## Design Rationale

After deep analysis, CISPO is mathematically equivalent to:
    CISPO = REINFORCE + IS_weight_with_stop_gradient

This refactoring:
- ✅ Maintains backward compatibility (zero breaking changes)
- ✅ Enables future IS-based algorithms (AWR, V-MPO, etc.)
- ✅ Keeps code simple and maintainable
- ✅ Follows "Rule of Three" - build frameworks when ≥5 use cases

## Documentation

- `CISPO_REFACTORING_OPTIONS.md`: Detailed comparison of 3 design options
- `REFACTORING_SUMMARY.md`: Technical analysis and decision rationale
- `test_cispo_equivalence.py`: Equivalence tests (pending torch environment)

## Future Extensions

When we have 5+ IS-based algorithms, this function can easily extend:
- Bidirectional clipping (upper + lower bounds)
- Custom weighting functions (exponential, geometric, etc.)
- Different IS weight formulas

For now, this pragmatic approach balances simplicity with extensibility.

Addresses: Discussion on PR THUDM#681 with @sam571128
kekmodel pushed a commit to kekmodel/slime that referenced this pull request Nov 4, 2025
This refactoring addresses @sam571128's feedback on PR THUDM#681 to make
CISPO a composable feature rather than a standalone advantage estimator.

## Key Insight

CISPO = REINFORCE + IS_weight_with_stop_gradient

Therefore, CISPO should be a **modifier** that can be applied to any
REINFORCE-based algorithm, not a separate algorithm itself.

## Changes

### 1. Removed CISPO as advantage_estimator
- Removed "cispo" from --advantage-estimator choices
- Removed "cispo" from advantage computation (loss.py:240)
- Removed "cispo" from reward normalization (rollout.py:183, 196)

### 2. Added --use-cispo flag
- New flag to enable CISPO IS weights (arguments.py:697)
- Can be combined with grpo, gspo, reinforce_plus_plus, etc.
- Clear documentation on usage requirements

### 3. Changed policy loss logic
- Before: if advantage_estimator == "cispo"
- After: if use_cispo and advantage_estimator in REINFORCE_ESTIMATORS
- Enables composition: base algorithm + optional IS weighting

## Migration

**Old usage:**
```bash
--advantage-estimator cispo --eps-clip-high 5.0
```

**New usage:**
```bash
--advantage-estimator grpo --use-cispo --eps-clip-high 5.0
```

## Benefits

✅ **Composition over Inheritance**: Combine CISPO with any REINFORCE algorithm
✅ **Code Reuse**: No duplication between GRPO and CISPO
✅ **Extensibility**: Easy to add other IS methods (AWR, V-MPO, etc.)
✅ **User Flexibility**: Mix-and-match base algorithms with IS weights

## New Possibilities

```bash
# CISPO with GSPO
--advantage-estimator gspo --use-cispo

# CISPO with REINFORCE++
--advantage-estimator reinforce_plus_plus --use-cispo

# Future: Combine multiple IS methods
--use-cispo --use-tis  # (when implemented)
```

## Breaking Change

⚠️ This is a breaking change for early adopters of CISPO.
However, since CISPO was just added in PR THUDM#681 and hasn't been
released, now is the right time to get the architecture correct.

## Implementation Details

The core `compute_cispo_loss()` function remains **unchanged**.
Only the *trigger mechanism* changed:
- Before: Called when advantage_estimator == "cispo"
- After: Called when use_cispo flag is set

See CISPO_COMPOSITION_REFACTORING.md for detailed rationale and examples.

Addresses: Feedback from @sam571128 on PR THUDM#681
Add support for CISPO algorithm from MiniMax-M1 paper, which addresses
PPO/GRPO's limitation of clipping out low-probability reasoning tokens.

Changes:
- Add compute_cispo_loss() in slime/utils/ppo_utils.py
- Add 'cispo' to advantage_estimator choices
- Update reward normalization to include CISPO
- Use CISPO loss when advantage_estimator='cispo'

Key implementation details:
- Token-level IS with stop-gradient on clipped ratios
- Explicit log probability: ratio_sg * advantages * log_probs
- Upper-only clipping with default eps_clip_high=5.0
- Direct clipfrac calculation: (ratio > eps_clip_high)

Reference: MiniMax-M1 paper (arxiv:2506.13585)
kekmodel and others added 2 commits December 5, 2025 01:37
- Update compute_cispo_loss to use eps_clip as absolute lower bound
- Add CISPO support to FSDP backend (was missing)
- Update docstring: "Clipped IS-weight Policy Optimization" per paper
- Paper only tunes eps_clip_high; eps_clip=0 disables lower bound

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@yitianlian
Copy link
Collaborator

Hi, the format error needs to be solved to pass the ci test.

@kekmodel
Copy link
Contributor Author

Hi, the format error needs to be solved to pass the ci test.

Fixed the formatting issue—CI is passing now. Thanks!

@MikaStars39
Copy link

Hi, will this feature be merged? It seems to be some conflicts with the main branch now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants