Decouple IS Weights from Rejection Sampling in MIS by yueming-yuan · Pull Request #657 · THUDM/slime

yueming-yuan · 2025-10-31T18:04:07Z

References

This refactoring follows the design of verl#3915. Thanks for the insights!
Thanks to the great contribution of this paper When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch.

Summary

Refactors the Masked Importance Sampling (MIS) implementation to properly separate IS weight correction from rejection sampling. This fixes a critical gradient normalization bug where rejected tokens were incorrectly included in the loss denominator.

Motivation

The previous implementation have two distinct mechanisms by zeroing IS weights at rejected positions:

IS weight correction: Applies π_train/π_rollout ratios to correct for distribution mismatch
Rejection sampling: Excludes outlier samples from training

This led to an issue: rejected tokens had zero weights (numerator) but were still counted in the loss denominator, causing incorrect gradient scaling.

Example

Consider a sequence with 5 tokens where 3 are rejected by mask mode (ratios outside [0.5, 2.0]):

# Input
log_ratios = [0.1, -1.5, 0.8, -10.0, 0.3]
ratios     = [1.11, 0.22, 2.23, 0.00005, 1.35]  # tokens 1,2,3 rejected

# previous implementation
is_weights = [1.11, 0.0, 0.0, 0.0, 1.35]
loss_mask  = [1, 1, 1, 1, 1] 
pg_loss = sum(loss * is_weights) / sum(loss_mask)
            = (1.11 + 0 + 0 + 0 + 1.35) / 5 = 0.49  # ⚠️ denominator includes masked entries -> Smaller loss norm

# NEW Implementation
is_weights       = [1.11, 0.22, 2.23, 0.00005, 1.35]  # weights preserved
modified_mask    = [1, 0, 0, 0, 1]                      # rejection separate
pg_loss = sum(loss * is_weights * modified_mask) / sum(modified_mask)
            = (1.11 + 0 + 0 + 0 + 1.35) / 2 = 1.23   # ✅ denominator excludes masked entries

Main Changes

1. API Change

compute_mis_weights_with_cp() now returns 3 values instead of 2:

# Before
pg_loss, metrics = compute_mis_weights_with_cp(...)

# After
pg_loss, modified_response_masks, metrics = compute_mis_weights_with_cp(...)

2. Separation of IS Weights and Rejection Sampling

IS Weights (is_weights):

Safety-bounded to [exp(-20), exp(20)] to prevent overflow
Mode-specific processing:
- truncate: Upper clamped to mis_upper_bound
- mask: Safety-bounded only (no threshold clamping)
- clip: Clamped to [lower, upper]
Zeroed at padding positions only
Used for weighting policy gradient

Rejection Sampling (modified_response_masks):

mask mode: Excludes tokens with IS ratios outside [lower, upper]
veto: Excludes entire sequences with catastrophic tokens (ratio < veto_threshold)
Used for loss aggregation denominator

3. Correct Loss Normalization

# In loss.py (Line 463-470)
pg_loss, modified_response_masks, tis_metrics = tis_func(**tis_kwargs)

# Rebuild sum_of_sample_mean with modified masks for correct denominator
sum_of_sample_mean = get_sum_of_sample_mean(
    total_lengths, response_lengths, modified_response_masks, args.calculate_per_token_loss
)

pg_loss = sum_of_sample_mean(pg_loss)  # Now uses correct denominator

Files Changed

Modified Files

slime/backends/megatron_utils/loss.py: Updated to correct loss normalization with modified masks
examples/train_infer_mismatch_helper/mis.py: Refactored to return 3-tuple, separated IS weights from rejection

Impact by Mode

truncate mode: No behavioral change
mask mode: Gradient scale will change (increase) when rejection rate > 0
clip mode: No behavioral change
With veto: Gradient scale will change for affected sequences

zhaochenyang20 · 2025-10-31T18:06:03Z

great catch

zhaochenyang20 · 2025-10-31T23:51:03Z

examples/train_infer_mismatch_helper/mis.yaml

@@ -1,5 +1,5 @@
 # Enable importance sampling, details refer to the comments of compute_mis_weights in mis.py
-use_mis: false
+use_tis: true


shall we change the mis -> tis here?

It just seems "use_mis" is not used anywhere - we may delete it?

zhaochenyang20 · 2025-10-31T23:56:05Z

slime/backends/megatron_utils/loss.py

+        )

    pg_loss = sum_of_sample_mean(pg_loss)
    pg_clipfrac = sum_of_sample_mean(pg_clipfrac)


If not use_tis, then pg_loss would rely on the passed in sum_of_sample_mean. If using tis, the we will create a new sum_of_sample_mean with modified_response_masks by:

sum_of_sample_mean = get_sum_of_sample_mean( total_lengths, response_lengths, modified_response_masks, args.calculate_per_token_loss )

Right?

yes, if we don't use TIS then we do not update this sum_of_sample_mean function, which was originally created from loss_mask

yitianlian

I think adding more scalable func for different TIS methods in Slime would be a great feature. Firstly, improve the MIS example, and secondly, introduce a new return in the TIS function, as some methods may mask more tokens.

szrlee · 2025-11-02T04:40:01Z

@yueming-yuan please have a look on verl-project/verl#3984

Additional to verl-project/verl#3915, we have now fully separated rejection sampling masks from importance weights, allowing them to be combined independently.

yueming-yuan · 2025-11-02T16:34:10Z

@yueming-yuan please have a look on volcengine/verl#3984

Additional to volcengine/verl#3915, we have now fully separated rejection sampling masks from importance weights, allowing them to be combined independently.

Thanks!! We'll check this new version and look into integration.

zhaochenyang20 · 2025-11-03T00:43:20Z

Nice examples:

# Input
log_ratios = [0.1, -1.5, 0.8, -10.0, 0.3]
ratios     = [1.11, 0.22, 2.23, 0.00005, 1.35]  # tokens 1,2,3 rejected

# previous implementation
is_weights = [1.11, 0.0, 0.0, 0.0, 1.35]
loss_mask  = [1, 1, 1, 1, 1] 
pg_loss = sum(loss * is_weights) / sum(loss_mask)
            = (1.11 + 0 + 0 + 0 + 1.35) / 5 = 0.49  # ⚠️ denominator includes masked entries -> Smaller loss norm

# NEW Implementation
is_weights       = [1.11, 0.22, 2.23, 0.00005, 1.35]  # weights preserved
modified_mask    = [1, 0, 0, 0, 1]                      # rejection separate
pg_loss = sum(loss * is_weights * modified_mask) / sum(modified_mask)
            = (1.11 + 0 + 0 + 0 + 1.35) / 2 = 1.23   # ✅ denominator excludes masked entries

yueming-yuan added 3 commits October 31, 2025 02:39

decouple IS and rejection mask, fix loss normalization

141e5ef

debug cp > 1

7d7e4b1

Merge branch 'main' into is_rejection

4b43e68

yueming-yuan marked this pull request as draft October 31, 2025 18:06

formatting

1e353e7

zhaochenyang20 reviewed Oct 31, 2025

View reviewed changes

fix compatibility of vanilla tis func

be49013

yitianlian approved these changes Nov 1, 2025

View reviewed changes

yueming-yuan marked this pull request as ready for review November 1, 2025 16:37

zhuzilin merged commit ad2ada3 into THUDM:main Nov 2, 2025
4 checks passed

llltttwww pushed a commit to llltttwww/slime that referenced this pull request Nov 30, 2025

Decouple IS Weights from Rejection Sampling in MIS (THUDM#657)

888770f

Yangruipis pushed a commit to rednote-ai/slime that referenced this pull request Feb 28, 2026

Decouple IS Weights from Rejection Sampling in MIS (THUDM#657)

6574fc1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple IS Weights from Rejection Sampling in MIS#657

Decouple IS Weights from Rejection Sampling in MIS#657
zhuzilin merged 5 commits intoTHUDM:mainfrom
yueming-yuan:is_rejection

yueming-yuan commented Oct 31, 2025

Uh oh!

zhaochenyang20 commented Oct 31, 2025

Uh oh!

zhaochenyang20 Oct 31, 2025

Uh oh!

yueming-yuan Nov 1, 2025

Uh oh!

zhaochenyang20 Oct 31, 2025

Uh oh!

yueming-yuan Nov 1, 2025

Uh oh!

yitianlian left a comment

Uh oh!

Uh oh!

szrlee commented Nov 2, 2025 •

edited

Loading

Uh oh!

yueming-yuan commented Nov 2, 2025

Uh oh!

zhaochenyang20 commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

yueming-yuan commented Oct 31, 2025

References

Summary

Motivation

Example

Main Changes

1. API Change

2. Separation of IS Weights and Rejection Sampling

3. Correct Loss Normalization

Files Changed

Modified Files

Impact by Mode

Uh oh!

zhaochenyang20 commented Oct 31, 2025

Uh oh!

zhaochenyang20 Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

yueming-yuan Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

zhaochenyang20 Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

yueming-yuan Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

yitianlian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

szrlee commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yueming-yuan commented Nov 2, 2025

Uh oh!

zhaochenyang20 commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

szrlee commented Nov 2, 2025 •

edited

Loading