Skip to content

[GRPO] Fix re-tokenization bug in tool-calling loop by concatenating token IDs#5242

Open
qgallouedec wants to merge 100 commits intomainfrom
fix-retokenization-tool-loop
Open

[GRPO] Fix re-tokenization bug in tool-calling loop by concatenating token IDs#5242
qgallouedec wants to merge 100 commits intomainfrom
fix-retokenization-tool-loop

Conversation

@qgallouedec
Copy link
Member

@qgallouedec qgallouedec commented Mar 7, 2026

Context

Part of the series to fix the re-tokenization bug in GRPO multi-turn tool calling (see #5224).

closes #5224
closes #5144

When the model generates a completion in a tool-calling loop, the decoded text is re-tokenized via apply_chat_template, which can produce different token IDs due to BPE merge ambiguities. To fix this, we need a token-in / token-out pipeline: tokenize once, then pass raw token IDs through every subsequent generation call — never decoding and re-tokenizing.

This is the final PR in the series. It eliminates the re-tokenization in the tool-calling loop — the actual source of the bug.

Changes

  • New _get_tool_suffix_ids(tool_messages) method: Tokenizes only the tool result portion by diffing a minimal dummy conversation (2 messages vs 3 messages). This avoids re-tokenizing the full conversation history.
  • _tool_call_loop: Instead of re-tokenizing prompt + completion + tool_results via apply_chat_template, builds the token sequence by concatenation: prompt_ids + completion_ids + tool_suffix_ids. The original prompt and completion token IDs are preserved exactly as they were — only the new tool result tokens are freshly tokenized.
  • Removed the prefix-preserving sanity check (no longer needed since the prefix is preserved by construction).
  • Removed the _tokenize_prompts call in the tool loop.

The bug and the fix

Previously, after a tool call:

  1. The completion was decoded to text and appended as an assistant message
  2. The full prompt + assistant + tool_results was re-tokenized via apply_chat_template
  3. Due to BPE merge ambiguity, step 2 could produce different token IDs for the completion part

Now:

  1. The original prompt_ids and completion_ids are kept as-is (never decoded and re-tokenized)
  2. Only the tool result suffix is tokenized, using a minimal dummy conversation to extract just the template formatting
  3. The full prompt is built by concatenation: prompt_ids + completion_ids + suffix_ids

Backward compatibility

No user-facing API changes. _get_tool_suffix_ids and _tool_call_loop are internal methods.


Note

Medium Risk
Medium risk because it changes GRPO multi-turn tool-calling generation by altering how prompts are rebuilt and tokenized, which can affect sequence lengths/masks and downstream training behavior. RLOO changes are mechanical return-value cleanups but share the same generation pathways.

Overview
Fixes a GRPO multi-turn tool-calling bug by removing decode/re-tokenize steps inside the tool loop and instead preserving original prompt_ids/completion_ids while appending newly tokenized tool-result suffix IDs.

Adds _get_tool_suffix_ids() and rewires _tool_call_loop to build prompt+completion+tool inputs via token-ID concatenation (including image/multimodal subsetting), and updates generation helpers in both GRPOTrainer and RLOOTrainer to stop returning unused prompt_ids from _generate_single_turn/vLLM and simplify prompt-length handling.

Written by Cursor Bugbot for commit 5147625. This will update automatically on new commits. Configure here.

qgallouedec and others added 27 commits March 5, 2026 19:10
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3375aeac6c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

qgallouedec and others added 21 commits March 10, 2026 12:23
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Base automatically changed from extract-tokenize-prompts to main March 10, 2026 21:35
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

@qgallouedec
Copy link
Member Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 367a79ebc6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Re-tokenization bug in GRPO multi-turn tool calling GRPOTrainer tool_mask can become longer than completion_ids after tool-call retokenization

2 participants