[GRPO] Fix re-tokenization bug in tool-calling loop by concatenating token IDs#5242
[GRPO] Fix re-tokenization bug in tool-calling loop by concatenating token IDs#5242qgallouedec wants to merge 100 commits intomainfrom
Conversation
… left-padding for per-token fields
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3375aeac6c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
… and multimodal_fields parameters
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
…r and RLOOTrainer
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 367a79ebc6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Context
Part of the series to fix the re-tokenization bug in GRPO multi-turn tool calling (see #5224).
closes #5224
closes #5144
When the model generates a completion in a tool-calling loop, the decoded text is re-tokenized via
apply_chat_template, which can produce different token IDs due to BPE merge ambiguities. To fix this, we need a token-in / token-out pipeline: tokenize once, then pass raw token IDs through every subsequent generation call — never decoding and re-tokenizing.promptsin vLLM client and server #5225rollout_funcfrom_generate_single_turnto_generate#5232_generate_single_turn#5239_generate_single_turn#5240This is the final PR in the series. It eliminates the re-tokenization in the tool-calling loop — the actual source of the bug.
Changes
_get_tool_suffix_ids(tool_messages)method: Tokenizes only the tool result portion by diffing a minimal dummy conversation (2 messages vs 3 messages). This avoids re-tokenizing the full conversation history._tool_call_loop: Instead of re-tokenizingprompt + completion + tool_resultsviaapply_chat_template, builds the token sequence by concatenation:prompt_ids + completion_ids + tool_suffix_ids. The original prompt and completion token IDs are preserved exactly as they were — only the new tool result tokens are freshly tokenized._tokenize_promptscall in the tool loop.The bug and the fix
Previously, after a tool call:
prompt + assistant + tool_resultswas re-tokenized viaapply_chat_templateNow:
prompt_idsandcompletion_idsare kept as-is (never decoded and re-tokenized)prompt_ids + completion_ids + suffix_idsBackward compatibility
No user-facing API changes.
_get_tool_suffix_idsand_tool_call_loopare internal methods.Note
Medium Risk
Medium risk because it changes GRPO multi-turn tool-calling generation by altering how prompts are rebuilt and tokenized, which can affect sequence lengths/masks and downstream training behavior. RLOO changes are mechanical return-value cleanups but share the same generation pathways.
Overview
Fixes a GRPO multi-turn tool-calling bug by removing decode/re-tokenize steps inside the tool loop and instead preserving original
prompt_ids/completion_idswhile appending newly tokenized tool-result suffix IDs.Adds
_get_tool_suffix_ids()and rewires_tool_call_loopto buildprompt+completion+toolinputs via token-ID concatenation (including image/multimodal subsetting), and updates generation helpers in bothGRPOTrainerandRLOOTrainerto stop returning unusedprompt_idsfrom_generate_single_turn/vLLM and simplify prompt-length handling.Written by Cursor Bugbot for commit 5147625. This will update automatically on new commits. Configure here.