kekmodel
diff --git a/‎CISPO_REFACTORING_OPTIONS.md‎
Lines changed: 309 additions & 0 deletions b/‎CISPO_REFACTORING_OPTIONS.md‎
Lines changed: 309 additions & 0 deletions
@@ -0,0 +1,309 @@
+# CISPO Refactoring: Design Comparison
+
+## Background
+
+CISPO (Clipped IS-weight Policy Optimization) can be decomposed as:
+```
+CISPO = REINFORCE + IS_weight_with_stop_gradient
+loss = -sg(min(ratio, ε_max)) * A * log_π
+```
+
+This document compares two architectural approaches for integrating CISPO into the existing codebase.
+
+---
+
+## Option 1: Unified IS Weight Function (IMPLEMENTED)
+
+### Architecture
+
+```python
+# slime/utils/ppo_utils.py
+def compute_reinforce_loss_with_is_weights(
+    ppo_kl, log_probs, advantages, eps_clip_high, stop_gradient=True
+):
+    """Unified function for REINFORCE + IS weights."""
+    ratio = (-ppo_kl).exp()
+    ratio_truncated = torch.clamp(ratio, max=eps_clip_high)
+    is_weights = ratio_truncated.detach() if stop_gradient else ratio_truncated
+    pg_losses = -is_weights * advantages * log_probs
+    return pg_losses, clipfrac
+
+# Backward compatibility wrapper
+def compute_cispo_loss(ppo_kl, log_probs, advantages, eps_clip_high):
+    return compute_reinforce_loss_with_is_weights(
+        ppo_kl, log_probs, advantages, eps_clip_high, stop_gradient=True
+    )
+```
+
+### Usage
+
+```python
+# loss.py (unchanged structure)
+if args.advantage_estimator == "cispo":
+    pg_loss, pg_clipfrac = compute_cispo_loss(ppo_kl, log_probs, advantages, args.eps_clip_high)
+else:
+    pg_loss, pg_clipfrac = compute_policy_loss(ppo_kl, advantages, args.eps_clip, args.eps_clip_high)
+```
+
+### Pros ✅
+
+1. **Backward compatibility**: Existing code works without changes
+2. **Clear semantics**: Function name describes exactly what it does
+3. **Easy to understand**: stop_gradient parameter is self-explanatory
+4. **Low risk**: Minimal changes to existing codebase
+5. **Self-contained**: All logic in one place
+
+### Cons 🔴
+
+1. **Code duplication**: IS weight logic similar to MIS/TIS
+2. **Inconsistent with MIS**: MIS is in separate module, this is inline
+3. **Limited reusability**: Only works for REINFORCE-style losses
+
+### Future Extensions
+
+```python
+# Can easily extend to other IS-based algorithms:
+def compute_reinforce_loss_with_is_weights(
+    ppo_kl, log_probs, advantages,
+    eps_clip_high, eps_clip_low=None,  # ← Add lower bound
+    stop_gradient=True,
+    weighting_fn=None  # ← Add custom weighting (AWR, etc.)
+):
+    ratio = (-ppo_kl).exp()
+
+    if eps_clip_low is not None:
+        ratio_clipped = torch.clamp(ratio, min=eps_clip_low, max=eps_clip_high)
+    else:
+        ratio_clipped = torch.clamp(ratio, max=eps_clip_high)
+
+    if weighting_fn is not None:
+        ratio_clipped = weighting_fn(ratio_clipped, advantages)
+
+    is_weights = ratio_clipped.detach() if stop_gradient else ratio_clipped
+    pg_losses = -is_weights * advantages * log_probs
+    return pg_losses, clipfrac
+```
+
+---
+
+## Option 2: Separate IS Weight Module
+
+### Architecture
+
+```python
+# slime/utils/is_weights.py (NEW FILE)
+class ISWeightComputer:
+    """Unified IS weight computation framework."""
+
+    @staticmethod
+    def compute_weights(
+        ppo_kl: torch.Tensor,
+        mode: str,  # "cispo", "tis", "truncate", etc.
+        stop_gradient: bool = False,
+        **kwargs
+    ) -> Tuple[torch.Tensor, Dict]:
+        """Compute IS weights for various algorithms."""
+        ratio = (-ppo_kl).exp()
+
+        if mode == "cispo":
+            weights = ratio.clamp(max=kwargs["eps_clip_high"])
+            clipfrac = (ratio > kwargs["eps_clip_high"]).float()
+        elif mode == "tis":
+            weights = ratio.clamp(
+                min=kwargs["tis_clip_low"],
+                max=kwargs["tis_clip"]
+            )
+            clipfrac = (weights != ratio).float()
+        elif mode == "truncate":
+            weights = ratio.clamp(max=kwargs["upper_bound"])
+            clipfrac = (ratio > kwargs["upper_bound"]).float()
+        else:
+            raise ValueError(f"Unknown mode: {mode}")
+
+        if stop_gradient:
+            weights = weights.detach()
+
+        metrics = {"clipfrac": clipfrac}
+        return weights, metrics
+
+# slime/utils/ppo_utils.py
+def compute_cispo_loss(ppo_kl, log_probs, advantages, eps_clip_high):
+    """CISPO loss using IS weight framework."""
+    from slime.utils.is_weights import ISWeightComputer
+
+    # Compute REINFORCE loss
+    reinforce_loss = -advantages * log_probs
+
+    # Apply CISPO IS weights
+    is_weights, metrics = ISWeightComputer.compute_weights(
+        ppo_kl, mode="cispo", stop_gradient=True, eps_clip_high=eps_clip_high
+    )
+
+    pg_losses = reinforce_loss * is_weights
+    return pg_losses, metrics["clipfrac"]
+```
+
+### Usage
+
+```python
+# loss.py (unchanged)
+if args.advantage_estimator == "cispo":
+    pg_loss, pg_clipfrac = compute_cispo_loss(...)
+```
+
+### Pros ✅
+
+1. **Maximum reusability**: All IS weight computations in one place
+2. **Extensible**: Easy to add new IS algorithms (AWR, V-MPO, etc.)
+3. **Consistent**: TIS/MIS can also use this framework
+4. **DRY principle**: Zero code duplication
+
+### Cons 🔴
+
+1. **Over-engineering**: 3-4 algorithms don't justify a full framework
+2. **Indirection**: More layers = harder to debug
+3. **Learning curve**: New developers need to understand the framework
+4. **Migration cost**: Requires refactoring TIS/MIS to use new framework
+5. **Unclear ownership**: Who maintains is_weights.py vs ppo_utils.py?
+
+---
+
+## Option 3: MIS Integration (ALTERNATIVE)
+
+### Architecture
+
+```python
+# examples/train_infer_mismatch_helper/mis.py
+def truncate(
+    weights: torch.Tensor,
+    loss_mask: torch.Tensor,
+    metrics: Dict[str, list[torch.Tensor]],
+    upper_bound: float,
+    stop_gradient: bool = False  # ← NEW PARAMETER
+) -> torch.Tensor:
+    assert upper_bound is not None
+    metrics_append(metrics, "truncate_fraction", (weights > upper_bound).int())
+    truncated = weights.clamp(0, upper_bound) * loss_mask
+
+    if stop_gradient:  # ← NEW LOGIC
+        truncated = truncated.detach()
+
+    return truncated
+
+# slime/utils/ppo_utils.py
+def compute_cispo_loss(ppo_kl, log_probs, advantages, eps_clip_high):
+    """CISPO using MIS truncate logic."""
+    # Import MIS helper
+    from examples.train_infer_mismatch_helper.mis import truncate
+
+    # Compute IS weights
+    ratio = (-ppo_kl).exp()
+    loss_mask = torch.ones_like(ratio)  # No masking for CISPO
+    metrics = {}
+
+    is_weights = truncate(
+        ratio, loss_mask, metrics,
+        upper_bound=eps_clip_high,
+        stop_gradient=True  # ← CISPO's key feature
+    )
+
+    # Apply to REINFORCE loss
+    reinforce_loss = -advantages * log_probs
+    pg_losses = reinforce_loss * is_weights
+
+    clipfrac = metrics["truncate_fraction"][0] if metrics else torch.zeros_like(ratio)
+    return pg_losses, clipfrac
+```
+
+### Pros ✅
+
+1. **Code reuse**: Leverages existing MIS truncate function
+2. **Consistent with MIS**: Same API pattern
+3. **Minimal addition**: Just one parameter
+
+### Cons 🔴
+
+1. **Semantic mismatch**: MIS is for train/rollout mismatch (off-policy)
+2. **Wrong import**: mis.py is in examples/, not core library
+3. **API pollution**: stop_gradient doesn't make sense for mask/clip modes
+4. **Loss mask confusion**: CISPO doesn't use loss masks the same way
+
+---
+
+## Recommendation: Option 1 ✅
+
+### Rationale
+
+**Current stage**: 6 advantage estimators, 3-4 IS-based algorithms
+**Future**: Maybe 10-15 algorithms in 2-3 years
+
+**Option 1 strikes the best balance:**
+
+| Criterion | Option 1 | Option 2 | Option 3 |
+|-----------|----------|----------|----------|
+| **Simplicity** | ✅ High | 🔴 Low | 🟡 Medium |
+| **Maintainability** | ✅ Easy | 🔴 Complex | 🟡 Medium |
+| **Extensibility** | ✅ Good | ✅ Excellent | 🔴 Limited |
+| **Risk** | ✅ Low | 🔴 High | 🟡 Medium |
+| **Time to implement** | ✅ 1 hour | 🔴 1 day | 🟡 2 hours |
+| **Backward compat** | ✅ Perfect | ✅ Perfect | 🟡 Import issues |
+
+### Migration Path
+
+**Phase 1 (Now)**: Implement Option 1
+- ✅ Merge current PR with new unified function
+- ✅ Maintain backward compatibility
+- ✅ Document design rationale
+
+**Phase 2 (After 5+ IS algorithms)**: Consider Option 2
+- Refactor when we have AWR, V-MPO, IMPALA, etc.
+- At that point, the framework cost is justified
+- Can migrate gradually without breaking changes
+
+**Phase 3 (Long-term)**: Unified RL Framework
+- Separate advantage computation from policy loss computation
+- Plugin architecture for custom algorithms
+- But only when the codebase has 20+ algorithms
+
+---
+
+## Implementation Checklist
+
+- [x] Implement `compute_reinforce_loss_with_is_weights()` in ppo_utils.py
+- [x] Keep `compute_cispo_loss()` as backward-compatible wrapper
+- [ ] Update loss.py to use new function (optional)
+- [ ] Add tests for stop_gradient=True/False
+- [ ] Document in examples/
+- [ ] Add comparison metrics to verify equivalence
+
+---
+
+## Code Quality Metrics
+
+### Option 1
+- Lines of code: +30 (one function)
+- Files modified: 1 (ppo_utils.py)
+- Test coverage: Easy (unit test one function)
+- Documentation: Inline docstring
+
+### Option 2
+- Lines of code: +100 (new module + refactoring)
+- Files modified: 4+ (new file + ppo_utils + loss + TIS/MIS)
+- Test coverage: Complex (integration tests needed)
+- Documentation: Separate design doc required
+
+### Option 3
+- Lines of code: +10 (one parameter)
+- Files modified: 2 (mis.py + ppo_utils.py)
+- Test coverage: Medium (existing MIS tests + new cases)
+- Documentation: Update MIS docstring
+
+---
+
+## Conclusion
+
+**Option 1 is the pragmatic choice for now**, with a clear path to Option 2 when needed.
+
+The key insight: **Don't build frameworks until you have ≥5 similar use cases**.
+Currently we have 1 (CISPO). When we add AWR, V-MPO, IMPALA, and 2 more,
+then it's time to build a unified IS framework.