load pretokenized user query (v0) by fabrahman · Pull Request #965 · allenai/open-instruct

fabrahman · 2025-08-29T17:08:14Z

This is the early pr for the pre tokenized user query. It leads to some 2025-08-29 17:03:30 - WARNING - grpo_fast.py:2076 - [Main Thread] Timeout waiting for packed sequences. Retrying... 2025-08-29 17:04:00 - WARNING - grpo_fast.py:2076 - [Main Thread] Timeout waiting for packed sequences. Retrying... error.

@hamishivi can you TAK?

gemini-code-assist

Summary of Changes

Hello @fabrahman, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces foundational changes to support pre-tokenized user queries within the data pipeline, primarily by adding a new field for raw prompts and updating data structures and processing logic to utilize it. The changes aim to streamline how prompts are handled and are part of an ongoing effort to resolve a reported timeout error during sequence packing.

Highlights

Pre-tokenized Query Support: Introduced a new RAW_PROMPT_KEY to store and handle raw, pre-tokenized user queries throughout the data processing pipeline, enabling direct use of these queries in subsequent steps.
Data Processing Refinement: Modified data loading and batching mechanisms to incorporate the new raw prompt data, specifically updating the Batch dataclass and its handling in grpo_fast.py to use raw_queries directly for decoding.
Debugging Enhancements: Added breakpoint() calls in ground_truth_utils.py and utils.py, and set load_from_cache_file=False in dataset_transformation.py's get_dataset_v1 function, indicating active debugging efforts related to data processing and potential caching issues.
Error Handling Improvement: Adjusted the extract_user_query function to return the full conversation string if no specific user query pattern is matched, preventing None returns in certain scenarios.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

claude · 2025-08-29T17:10:01Z

Code Review Feedback

🚨 Critical Issues

1. Debugging Code Left in Production (High Priority)

utils.py:1542 - Active breakpoint statement must be removed
grpo_fast.py:1489, 1517 - Commented breakpoints should be cleaned up
ground_truth_utils.py:659 - Commented breakpoint should be removed

2. Performance & Caching Concern

dataset_transformation.py:1419 - load_from_cache_file=False disables caching entirely, which could significantly impact performance for large datasets. Consider a more targeted approach.

🔧 Code Quality Issues

3. Code Duplication

RAW_PROMPT_KEY is defined multiple times across the same file (lines 766, 817, 849)
Consider consolidating into a single constants section

4. Inconsistent Logic

utils.py:1550 - The fallback user_query = conversation seems overly broad and might not handle edge cases properly
Consider adding logging for when pattern matching fails

5. Missing Documentation

New raw_queries field in Batch class lacks docstring explaining its purpose
The prompt concatenation logic in rlvr_tokenize_v1/v2 needs comments explaining the format choice

🐛 Potential Bugs

6. Null Safety

Multiple places where raw_queries is accessed without proper null checks
grpo_fast.py:1520 - batch.raw_queries assignment assumes it exists

7. Data Consistency

The raw prompt extraction excludes system messages but includes all other roles - ensure this aligns with downstream expectations

📋 Recommendations

Immediate: Remove all breakpoint statements before merging
Testing: Add unit tests for the new raw query extraction logic
Performance: Reconsider the caching strategy or make it configurable
Error Handling: Add proper validation for the new raw_queries field
Documentation: Add docstrings explaining the raw query format and usage

✅ Positive Notes

Good use of Optional typing for backward compatibility
Proper handling of the new field in Batch slicing operations
Clean separation of concerns for raw vs tokenized queries

The core functionality looks solid, but the debugging artifacts and caching change need immediate attention before this can be merged.

gemini-code-assist

Code Review

This pull request introduces a change to pre-load user queries as raw strings to avoid decoding them during training, which is a good performance optimization. My review focuses on several areas for improvement:

Code Cleanup: There are several instances of leftover debugging code, such as breakpoint() calls and commented-out code, which should be removed.
Redundancy: The constant RAW_PROMPT_KEY is defined multiple times across the same file. It should be defined once and imported where needed.
Potential Bug: There's a potential logic error in sft_tokenize_mask_out_prompt_v1 where an incorrect key seems to be used, which could lead to unexpected behavior.
Clarity and Best Practices: Some minor improvements to code formatting and removal of unnecessary lines are suggested to enhance readability and maintainability.

Overall, the core logic of the PR is sound, but addressing these points will improve the quality and robustness of the code.

* Reapply "Added a decorator to profile functions." This reverts commit 504a8d3. * Removed claude * Removed extraneuous stuff. * Reverted to main. * Removed changes * Removed changes

* add better logging * fix * lint * lint2 * adjust based on finbarrs feedback * fix * remove lambda --------- Co-authored-by: root <root@neptune-cs-aus-263.reviz.ai2.in>

…lm logs

fabrahman · 2025-08-29T23:34:54Z

*** Update: This PR contains ***:

using the pre tokenized user queries to be passed to LLM judge
remove unnecessary LiteLLM logs
Slightly modified the judge prompt to consider "accurately following user instruction"

Here is a successful 1-gpu debug job

hamishivi

Looks good and runs with my testing but needs a little cleanup!

hamishivi · 2025-08-30T03:13:50Z

+logging.getLogger("litellm.cost_calculator").setLevel(logging.CRITICAL)
+logging.getLogger("litellm._client").setLevel(logging.CRITICAL)
+logging.getLogger("cost_calculator").setLevel(logging.WARNING)
+logging.getLogger("httpx").setLevel(logging.WARNING)


Does this work to remove the logging from litellm?

yes, it worked both locally and in the debug beaker job.

finbarrtimbers · 2025-09-02T16:42:31Z

 DEFAULT_SFT_MESSAGES_KEY = "messages"
 GROUND_TRUTHS_KEY = "ground_truth"
 VERIFIER_SOURCE_KEY = "dataset"
+RAW_PROMPT_KEY = "prompt"


If there's not multiple prompt key variables, let's just use PROMPT_KEY?

finbarrtimbers · 2025-09-02T16:42:42Z

        prompt = row[sft_messages_key]
    else:
        prompt = row[sft_messages_key][:-1]
+


Remove please!

finbarrtimbers · 2025-09-02T16:42:59Z

                frac_or_num_samples = float(frac_or_num_samples)
            else:
                frac_or_num_samples = int(frac_or_num_samples)
-


Remove please!

finbarrtimbers · 2025-09-02T16:43:33Z

 logger = logger_utils.setup_logger(__name__)


+logging.getLogger("LiteLLM").setLevel(logging.WARNING)


Can you add a comment as to why you're doing this?

This reverts commit fa7e608.

This reverts commit 341a77b.

* uses add_request * ran linter * Clean up * Fixed bug * Added duplication * Added prompt_tokens to metadata. * Added missing key to metadata * Fixed bug where we weren't returning properly. * Fix script * Added logging * fix bug * use clone for SamplingParams * Fixes to duplication * Removed logging. * Cleaned up PR. * Clean PR * Removed whitespace * Cleaned up PR * Added comment for cleaner PR. * Cleaning up PR * Revert "load pretokenized user query (v0) (#965)" This reverts commit fa7e608. * Bug fix. * Fixed issue where we weren't setting params right in tools. * Updated descriptions. * Fix ordering. * Updated tool script with description. * Fixed use of wrong vllm.SamplingParams. * Now, tool use should run. * Reapply "load pretokenized user query (v0) (#965)" This reverts commit 341a77b. * minor clean up.

* load pretokenized user queries * remove breakpoints * Remove claude (allenai#966) * Reapply "Added a decorator to profile functions." This reverts commit 87cb1a4. * Removed claude * Removed extraneuous stuff. * Reverted to main. * Removed changes * Removed changes * Better logging of errors in threads (allenai#967) * add better logging * fix * lint * lint2 * adjust based on finbarrs feedback * fix * remove lambda --------- Co-authored-by: root <root@neptune-cs-aus-263.reviz.ai2.in> * use a smaller model (allenai#968) * Less logging. (allenai#969) * fix PendingQueriesMap to include raw_query * fix reading the pretokenized query, modify judge prompt, remove litellm logs * remove load_from_cache_file=False * fix linting * pr cleanup * remove extract user query import * update to include system prompt * lint * priotize system prompt * update tests for new raw queries * lint * modify general-quality (no reference) jduge prompt * assume user queries are not None * remove spaces and add comments --------- Co-authored-by: Finbarr Timbers <finbarrtimbers@gmail.com> Co-authored-by: Hamish Ivison <hamishivi@gmail.com> Co-authored-by: root <root@neptune-cs-aus-263.reviz.ai2.in>

* uses add_request * ran linter * Clean up * Fixed bug * Added duplication * Added prompt_tokens to metadata. * Added missing key to metadata * Fixed bug where we weren't returning properly. * Fix script * Added logging * fix bug * use clone for SamplingParams * Fixes to duplication * Removed logging. * Cleaned up PR. * Clean PR * Removed whitespace * Cleaned up PR * Added comment for cleaner PR. * Cleaning up PR * Revert "load pretokenized user query (v0) (allenai#965)" This reverts commit bc9d084. * Bug fix. * Fixed issue where we weren't setting params right in tools. * Updated descriptions. * Fix ordering. * Updated tool script with description. * Fixed use of wrong vllm.SamplingParams. * Now, tool use should run. * Reapply "load pretokenized user query (v0) (allenai#965)" This reverts commit 341a77b. * minor clean up.

* load pretokenized user queries * remove breakpoints * Remove claude (allenai#966) * Reapply "Added a decorator to profile functions." This reverts commit 38817d3. * Removed claude * Removed extraneuous stuff. * Reverted to main. * Removed changes * Removed changes * Better logging of errors in threads (allenai#967) * add better logging * fix * lint * lint2 * adjust based on finbarrs feedback * fix * remove lambda --------- Co-authored-by: root <root@neptune-cs-aus-263.reviz.ai2.in> * use a smaller model (allenai#968) * Less logging. (allenai#969) * fix PendingQueriesMap to include raw_query * fix reading the pretokenized query, modify judge prompt, remove litellm logs * remove load_from_cache_file=False * fix linting * pr cleanup * remove extract user query import * update to include system prompt * lint * priotize system prompt * update tests for new raw queries * lint * modify general-quality (no reference) jduge prompt * assume user queries are not None * remove spaces and add comments --------- Co-authored-by: Finbarr Timbers <finbarrtimbers@gmail.com> Co-authored-by: Hamish Ivison <hamishivi@gmail.com> Co-authored-by: root <root@neptune-cs-aus-263.reviz.ai2.in>

* uses add_request * ran linter * Clean up * Fixed bug * Added duplication * Added prompt_tokens to metadata. * Added missing key to metadata * Fixed bug where we weren't returning properly. * Fix script * Added logging * fix bug * use clone for SamplingParams * Fixes to duplication * Removed logging. * Cleaned up PR. * Clean PR * Removed whitespace * Cleaned up PR * Added comment for cleaner PR. * Cleaning up PR * Revert "load pretokenized user query (v0) (allenai#965)" This reverts commit 6baee4c. * Bug fix. * Fixed issue where we weren't setting params right in tools. * Updated descriptions. * Fix ordering. * Updated tool script with description. * Fixed use of wrong vllm.SamplingParams. * Now, tool use should run. * Reapply "load pretokenized user query (v0) (allenai#965)" This reverts commit 341a77b. * minor clean up.

fabrahman added 2 commits August 29, 2025 17:00

load pretokenized user queries

82d2ad4

remove breakpoints

c804e07

gemini-code-assist Bot reviewed Aug 29, 2025

View reviewed changes

finbarrtimbers and others added 8 commits August 29, 2025 23:21

Remove claude (#966)

d38806d

* Reapply "Added a decorator to profile functions." This reverts commit 504a8d3. * Removed claude * Removed extraneuous stuff. * Reverted to main. * Removed changes * Removed changes

Better logging of errors in threads (#967)

058fcb5

* add better logging * fix * lint * lint2 * adjust based on finbarrs feedback * fix * remove lambda --------- Co-authored-by: root <root@neptune-cs-aus-263.reviz.ai2.in>

use a smaller model (#968)

f09d5ef

Less logging. (#969)

cabe288

fix PendingQueriesMap to include raw_query

34939c4

fix reading the pretokenized query, modify judge prompt, remove litel…

941a1e0

…lm logs

remove load_from_cache_file=False

5150683

fix linting

0f5f6e2

fabrahman assigned hamishivi Aug 29, 2025

Merge branch 'main' into fae-main-patch08

f0d4b55

hamishivi reviewed Aug 30, 2025

View reviewed changes

fabrahman and others added 7 commits August 31, 2025 01:37

pr cleanup

b0ea266

remove extract user query import

3a87a75

update to include system prompt

6a7dccd

lint

5f22cfd

priotize system prompt

a420aa1

update tests for new raw queries

09cbcf7

lint

3dc9e73

hamishivi requested a review from finbarrtimbers September 2, 2025 02:39

finbarrtimbers requested changes Sep 2, 2025

View reviewed changes

fabrahman added 3 commits September 2, 2025 18:25

modify general-quality (no reference) jduge prompt

46159ba

assume user queries are not None

3cd6e31

remove spaces and add comments

b9884c6

hamishivi enabled auto-merge September 2, 2025 21:04

finbarrtimbers approved these changes Sep 2, 2025

View reviewed changes

hamishivi added this pull request to the merge queue Sep 2, 2025

Merged via the queue into main with commit fa7e608 Sep 2, 2025
3 checks passed

finbarrtimbers added a commit that referenced this pull request Sep 3, 2025

Revert "load pretokenized user query (v0) (#965)"

341a77b

This reverts commit fa7e608.

finbarrtimbers mentioned this pull request Sep 3, 2025

Manually duplicate both tool and non-tool requests. #978

Merged

finbarrtimbers added a commit that referenced this pull request Sep 3, 2025

Revert "load pretokenized user query (v0) (#965)"

bb8b3f6

This reverts commit fa7e608.

finbarrtimbers added a commit that referenced this pull request Sep 3, 2025

Reapply "load pretokenized user query (v0) (#965)"

826f199

This reverts commit 341a77b.

hamishivi mentioned this pull request Sep 4, 2025

Fix finetune keys #987

Merged

		logger = logger_utils.setup_logger(__name__)


		logging.getLogger("LiteLLM").setLevel(logging.WARNING)

Conversation

fabrahman commented Aug 29, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

claude Bot commented Aug 29, 2025

Code Review Feedback

🚨 Critical Issues

🔧 Code Quality Issues

🐛 Potential Bugs

📋 Recommendations

✅ Positive Notes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fabrahman commented Aug 29, 2025

Uh oh!

hamishivi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hamishivi Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

fabrahman Aug 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

finbarrtimbers Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

finbarrtimbers Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

finbarrtimbers Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

finbarrtimbers Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants