Chunk Across Batch and Context length for logprob calculations for grpo #3628

pluesclues · 2025-11-21T21:20:50Z

Refactor grpo_trainer functions to handle log probabilities and entropies. Introduce mixed precision handling and improve input processing for model predictions.

Adapt logic to handle image sizes and chunk pixel values based on image grid dimensions.

Refactor padding logic to incorporate max_left_pad variable for better handling of prompt completion.

Refactor padding logic and remove commented code.

Added check for vllm_importance_sampling_correction in conditions using self.use_vllm.

Disable TRL's importance sampling logic in the function.

for more information, see https://pre-commit.ci

gemini-code-assist · 2025-11-21T21:21:18Z

Summary of Changes

Hello @pluesclues, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly overhauls the log probability calculation mechanism for the GRPO trainer. The primary goal is to accurately compute logprobs by processing input sequences in smaller, manageable chunks, addressing challenges related to varying batch and context lengths, especially when dealing with padding. It also introduces specific handling for visual inputs and adjusts the application of importance sampling.

Highlights

GRPO Logprob Calculation Refinement: Introduced a new grpo_selective_log_softmax and refactored the _get_per_token_logps_and_entropies function to process inputs in chunks, improving log probability calculations for GRPO.
Padding Handling: Enhanced the handling of left padding by calculating and propagating max_left_pad through the logprob calculation pipeline, ensuring correct alignment for models with varying input lengths.
TRL Importance Sampling Adjustment: Explicitly disabled TRL's default importance sampling correction logic for vLLM within the GRPO trainer.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces chunking for log probability calculations in GRPO to better manage memory usage. The changes primarily involve patching GRPO trainer functions in unsloth/models/rl_replacements.py, with the core logic for batch chunking implemented in the _get_per_token_logps_and_entropies function. My review identifies a potential bug concerning tensor device placement, along with suggestions to improve code maintainability by addressing code duplication and enhancing readability in line with Python's style guidelines.

for more information, see https://pre-commit.ci

pluesclues added 17 commits November 7, 2025 16:37

make it compatible with chunked hidden states selective log softmax

5278458

Merge branch 'unslothai:main' into alternative_compute_chunked_loss

65d6d9f

Merge branch 'unslothai:main' into alternative_compute_chunked_loss

1e49528

Refactor grpo_trainer for logps and entropies handling

494f611

Refactor grpo_trainer functions to handle log probabilities and entropies. Introduce mixed precision handling and improve input processing for model predictions.

Update fmt.Println message from 'Hello World'

387939f

Merge branch 'unslothai:main' into alternative_compute_chunked_loss

eccd41d

Refactor chunking logic for pixel values and image grid

95abf46

Adapt logic to handle image sizes and chunk pixel values based on image grid dimensions.

Merge branch 'unslothai:main' into alternative_compute_chunked_loss

52b23ff

Refactor padding logic with max_left_pad handling

f2102c8

Refactor padding logic to incorporate max_left_pad variable for better handling of prompt completion.

Merge branch 'unslothai:main' into alternative_compute_chunked_loss

16f6be6

Clean up padding logic and remove unused comments

ac15b81

Refactor padding logic and remove commented code.

Merge branch 'unslothai:main' into alternative_compute_chunked_loss

f49bf4f

Update vllm usage conditions with importance sampling check

ea6964a

Added check for vllm_importance_sampling_correction in conditions using self.use_vllm.

Disable TRL importance sampling logic

ca6b826

Disable TRL's importance sampling logic in the function.

Refactor error handling in rl_replacements.py

f2b29ee

Refactor vllm_importance_sampling_correction checks

8d263b1

Add grpo_selective_log_softmax to RL replacements

d332e93

pluesclues mentioned this pull request Nov 21, 2025

Chunk Across Batch and Context length for logprob calculations for grpo unslothai/unsloth-zoo#357

Open

[pre-commit.ci] auto fixes from pre-commit.com hooks

4be35d8

for more information, see https://pre-commit.ci

gemini-code-assist bot reviewed Nov 21, 2025

View reviewed changes

pluesclues and others added 2 commits November 21, 2025 16:36

Refactor code for readability and consistency

84c56aa

[pre-commit.ci] auto fixes from pre-commit.com hooks

9b2539c

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Chunk Across Batch and Context length for logprob calculations for grpo #3628

Chunk Across Batch and Context length for logprob calculations for grpo #3628

pluesclues commented Nov 21, 2025

Uh oh!

gemini-code-assist bot commented Nov 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Chunk Across Batch and Context length for logprob calculations for grpo #3628

Are you sure you want to change the base?

Chunk Across Batch and Context length for logprob calculations for grpo #3628

Conversation

pluesclues commented Nov 21, 2025

Uh oh!

gemini-code-assist bot commented Nov 21, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant