fix: (1) strip whitespace when parsing action_stop_tokens (2) avoid truncation could split multimodal placeholder token sequences #147
Open
zengxingchen wants to merge 2 commits intoTIGER-AI-Lab:mainfrom
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem1
When configuring
action_stop_tokenswith comma-separated values like:action_stop_tokens='</code>, </search>'The parser was not stripping whitespace, resulting in:
Token 1:
</code>✅Token 2:
</search>❌ (with leading space)This caused the model to fail to stop at , leading to incorrect behavior.
Solution1
Modified the token parsing logic in verltool_agent_loop.py (line 173) to strip whitespace:
Problem1
Similar to bugs identified in verl-project/verl#4050.
When truncating agent responses to fit within
response_lengthlimits, the truncation could split multimodal placeholder token sequences (e.g.,<|vision_start|><|image_pad|><|vision_end|>) in the middle. This causes lots of issues.Solution2
When truncating responses by response_length, walk backwards to avoid cutting multimodal placeholder tokens (e.g., <|vision_start|>, <|image_pad|>, <|vision_end|>, <|audio_bos|>, etc.) in half.
This ensures that multimodal token sequences are either kept completely or discarded as complete blocks, preventing token corruption that could cause model errors during training.