Skip to content

load pretokenized user query (v0)#965

Merged
hamishivi merged 21 commits intomainfrom
fae-main-patch08
Sep 2, 2025
Merged

load pretokenized user query (v0)#965
hamishivi merged 21 commits intomainfrom
fae-main-patch08

Conversation

@fabrahman
Copy link
Copy Markdown
Contributor

This is the early pr for the pre tokenized user query. It leads to some 2025-08-29 17:03:30 - WARNING - grpo_fast.py:2076 - [Main Thread] Timeout waiting for packed sequences. Retrying... 2025-08-29 17:04:00 - WARNING - grpo_fast.py:2076 - [Main Thread] Timeout waiting for packed sequences. Retrying... error.

@hamishivi can you TAK?

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @fabrahman, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces foundational changes to support pre-tokenized user queries within the data pipeline, primarily by adding a new field for raw prompts and updating data structures and processing logic to utilize it. The changes aim to streamline how prompts are handled and are part of an ongoing effort to resolve a reported timeout error during sequence packing.

Highlights

  • Pre-tokenized Query Support: Introduced a new RAW_PROMPT_KEY to store and handle raw, pre-tokenized user queries throughout the data processing pipeline, enabling direct use of these queries in subsequent steps.
  • Data Processing Refinement: Modified data loading and batching mechanisms to incorporate the new raw prompt data, specifically updating the Batch dataclass and its handling in grpo_fast.py to use raw_queries directly for decoding.
  • Debugging Enhancements: Added breakpoint() calls in ground_truth_utils.py and utils.py, and set load_from_cache_file=False in dataset_transformation.py's get_dataset_v1 function, indicating active debugging efforts related to data processing and potential caching issues.
  • Error Handling Improvement: Adjusted the extract_user_query function to return the full conversation string if no specific user query pattern is matched, preventing None returns in certain scenarios.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@claude
Copy link
Copy Markdown

claude Bot commented Aug 29, 2025

Code Review Feedback

🚨 Critical Issues

1. Debugging Code Left in Production (High Priority)

  • utils.py:1542 - Active breakpoint statement must be removed
  • grpo_fast.py:1489, 1517 - Commented breakpoints should be cleaned up
  • ground_truth_utils.py:659 - Commented breakpoint should be removed

2. Performance & Caching Concern

  • dataset_transformation.py:1419 - load_from_cache_file=False disables caching entirely, which could significantly impact performance for large datasets. Consider a more targeted approach.

🔧 Code Quality Issues

3. Code Duplication

  • RAW_PROMPT_KEY is defined multiple times across the same file (lines 766, 817, 849)
  • Consider consolidating into a single constants section

4. Inconsistent Logic

  • utils.py:1550 - The fallback user_query = conversation seems overly broad and might not handle edge cases properly
  • Consider adding logging for when pattern matching fails

5. Missing Documentation

  • New raw_queries field in Batch class lacks docstring explaining its purpose
  • The prompt concatenation logic in rlvr_tokenize_v1/v2 needs comments explaining the format choice

🐛 Potential Bugs

6. Null Safety

  • Multiple places where raw_queries is accessed without proper null checks
  • grpo_fast.py:1520 - batch.raw_queries assignment assumes it exists

7. Data Consistency

  • The raw prompt extraction excludes system messages but includes all other roles - ensure this aligns with downstream expectations

📋 Recommendations

  1. Immediate: Remove all breakpoint statements before merging
  2. Testing: Add unit tests for the new raw query extraction logic
  3. Performance: Reconsider the caching strategy or make it configurable
  4. Error Handling: Add proper validation for the new raw_queries field
  5. Documentation: Add docstrings explaining the raw query format and usage

✅ Positive Notes

  • Good use of Optional typing for backward compatibility
  • Proper handling of the new field in Batch slicing operations
  • Clean separation of concerns for raw vs tokenized queries

The core functionality looks solid, but the debugging artifacts and caching change need immediate attention before this can be merged.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a change to pre-load user queries as raw strings to avoid decoding them during training, which is a good performance optimization. My review focuses on several areas for improvement:

  • Code Cleanup: There are several instances of leftover debugging code, such as breakpoint() calls and commented-out code, which should be removed.
  • Redundancy: The constant RAW_PROMPT_KEY is defined multiple times across the same file. It should be defined once and imported where needed.
  • Potential Bug: There's a potential logic error in sft_tokenize_mask_out_prompt_v1 where an incorrect key seems to be used, which could lead to unexpected behavior.
  • Clarity and Best Practices: Some minor improvements to code formatting and removal of unnecessary lines are suggested to enhance readability and maintainability.

Overall, the core logic of the PR is sound, but addressing these points will improve the quality and robustness of the code.

Comment thread open_instruct/dataset_transformation.py Outdated
Comment thread open_instruct/dataset_transformation.py Outdated
Comment thread open_instruct/dataset_transformation.py Outdated
Comment thread open_instruct/dataset_transformation.py Outdated
Comment thread open_instruct/ground_truth_utils.py Outdated
Comment thread open_instruct/grpo_fast.py Outdated
Comment thread open_instruct/grpo_fast.py Outdated
Comment thread open_instruct/grpo_fast.py Outdated
Comment thread open_instruct/utils.py Outdated
Comment thread open_instruct/utils.py
finbarrtimbers and others added 8 commits August 29, 2025 23:21
* Reapply "Added a decorator to profile functions."

This reverts commit 504a8d3.

* Removed claude

* Removed extraneuous stuff.

* Reverted to main.

* Removed changes

* Removed changes
* add better logging

* fix

* lint

* lint2

* adjust based on finbarrs feedback

* fix

* remove lambda

---------

Co-authored-by: root <root@neptune-cs-aus-263.reviz.ai2.in>
@fabrahman
Copy link
Copy Markdown
Contributor Author

*** Update: This PR contains ***:

  • using the pre tokenized user queries to be passed to LLM judge
  • remove unnecessary LiteLLM logs
  • Slightly modified the judge prompt to consider "accurately following user instruction"

Here is a successful 1-gpu debug job

Copy link
Copy Markdown
Collaborator

@hamishivi hamishivi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and runs with my testing but needs a little cleanup!

Comment thread open_instruct/dataset_transformation.py Outdated
logging.getLogger("litellm.cost_calculator").setLevel(logging.CRITICAL)
logging.getLogger("litellm._client").setLevel(logging.CRITICAL)
logging.getLogger("cost_calculator").setLevel(logging.WARNING)
logging.getLogger("httpx").setLevel(logging.WARNING)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work to remove the logging from litellm?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it worked both locally and in the debug beaker job.

Comment thread open_instruct/ground_truth_utils.py Outdated
Comment thread open_instruct/ground_truth_utils.py Outdated
Comment thread open_instruct/grpo_fast.py Outdated
Comment thread open_instruct/grpo_fast.py Outdated
Comment thread open_instruct/grpo_fast.py Outdated
Comment thread open_instruct/grpo_fast.py Outdated
DEFAULT_SFT_MESSAGES_KEY = "messages"
GROUND_TRUTHS_KEY = "ground_truth"
VERIFIER_SOURCE_KEY = "dataset"
RAW_PROMPT_KEY = "prompt"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's not multiple prompt key variables, let's just use PROMPT_KEY?

Comment thread open_instruct/dataset_transformation.py Outdated
prompt = row[sft_messages_key]
else:
prompt = row[sft_messages_key][:-1]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove please!

frac_or_num_samples = float(frac_or_num_samples)
else:
frac_or_num_samples = int(frac_or_num_samples)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove please!

logger = logger_utils.setup_logger(__name__)


logging.getLogger("LiteLLM").setLevel(logging.WARNING)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment as to why you're doing this?

Comment thread open_instruct/model_utils.py Outdated
@hamishivi hamishivi enabled auto-merge September 2, 2025 21:04
@hamishivi hamishivi added this pull request to the merge queue Sep 2, 2025
Merged via the queue into main with commit fa7e608 Sep 2, 2025
3 checks passed
finbarrtimbers added a commit that referenced this pull request Sep 3, 2025
finbarrtimbers added a commit that referenced this pull request Sep 3, 2025
finbarrtimbers added a commit that referenced this pull request Sep 3, 2025
github-merge-queue Bot pushed a commit that referenced this pull request Sep 3, 2025
* uses add_request

* ran linter

* Clean up

* Fixed bug

* Added duplication

* Added prompt_tokens to metadata.

* Added missing key to metadata

* Fixed bug where we weren't returning properly.

* Fix script

* Added logging

* fix bug

* use clone for SamplingParams

* Fixes to duplication

* Removed logging.

* Cleaned up PR.

* Clean PR

* Removed whitespace

* Cleaned up PR

* Added comment for cleaner PR.

* Cleaning up PR

* Revert "load pretokenized user query (v0) (#965)"

This reverts commit fa7e608.

* Bug fix.

* Fixed issue where we weren't setting params right in tools.

* Updated descriptions.

* Fix ordering.

* Updated tool script with description.

* Fixed use of wrong vllm.SamplingParams.

* Now, tool use should run.

* Reapply "load pretokenized user query (v0) (#965)"

This reverts commit 341a77b.

* minor clean up.
@hamishivi hamishivi mentioned this pull request Sep 4, 2025
sang1583535 pushed a commit to sang1583535/open-instruct that referenced this pull request Feb 3, 2026
* load pretokenized user queries

* remove breakpoints

* Remove claude (allenai#966)

* Reapply "Added a decorator to profile functions."

This reverts commit 87cb1a4.

* Removed claude

* Removed extraneuous stuff.

* Reverted to main.

* Removed changes

* Removed changes

* Better logging of errors in threads (allenai#967)

* add better logging

* fix

* lint

* lint2

* adjust based on finbarrs feedback

* fix

* remove lambda

---------

Co-authored-by: root <root@neptune-cs-aus-263.reviz.ai2.in>

* use a smaller model (allenai#968)

* Less logging. (allenai#969)

* fix PendingQueriesMap to include raw_query

* fix reading the pretokenized query, modify judge prompt, remove litellm logs

* remove load_from_cache_file=False

* fix linting

* pr cleanup

* remove extract user query import

* update to include system prompt

* lint

* priotize system prompt

* update tests for new raw queries

* lint

* modify general-quality (no reference) jduge prompt

* assume user queries are not None

* remove spaces and add comments

---------

Co-authored-by: Finbarr Timbers <finbarrtimbers@gmail.com>
Co-authored-by: Hamish Ivison <hamishivi@gmail.com>
Co-authored-by: root <root@neptune-cs-aus-263.reviz.ai2.in>
sang1583535 pushed a commit to sang1583535/open-instruct that referenced this pull request Feb 3, 2026
* uses add_request

* ran linter

* Clean up

* Fixed bug

* Added duplication

* Added prompt_tokens to metadata.

* Added missing key to metadata

* Fixed bug where we weren't returning properly.

* Fix script

* Added logging

* fix bug

* use clone for SamplingParams

* Fixes to duplication

* Removed logging.

* Cleaned up PR.

* Clean PR

* Removed whitespace

* Cleaned up PR

* Added comment for cleaner PR.

* Cleaning up PR

* Revert "load pretokenized user query (v0) (allenai#965)"

This reverts commit bc9d084.

* Bug fix.

* Fixed issue where we weren't setting params right in tools.

* Updated descriptions.

* Fix ordering.

* Updated tool script with description.

* Fixed use of wrong vllm.SamplingParams.

* Now, tool use should run.

* Reapply "load pretokenized user query (v0) (allenai#965)"

This reverts commit 341a77b.

* minor clean up.
lukashelff pushed a commit to lukashelff/open-instruct-slurm that referenced this pull request Feb 19, 2026
* load pretokenized user queries

* remove breakpoints

* Remove claude (allenai#966)

* Reapply "Added a decorator to profile functions."

This reverts commit 38817d3.

* Removed claude

* Removed extraneuous stuff.

* Reverted to main.

* Removed changes

* Removed changes

* Better logging of errors in threads (allenai#967)

* add better logging

* fix

* lint

* lint2

* adjust based on finbarrs feedback

* fix

* remove lambda

---------

Co-authored-by: root <root@neptune-cs-aus-263.reviz.ai2.in>

* use a smaller model (allenai#968)

* Less logging. (allenai#969)

* fix PendingQueriesMap to include raw_query

* fix reading the pretokenized query, modify judge prompt, remove litellm logs

* remove load_from_cache_file=False

* fix linting

* pr cleanup

* remove extract user query import

* update to include system prompt

* lint

* priotize system prompt

* update tests for new raw queries

* lint

* modify general-quality (no reference) jduge prompt

* assume user queries are not None

* remove spaces and add comments

---------

Co-authored-by: Finbarr Timbers <finbarrtimbers@gmail.com>
Co-authored-by: Hamish Ivison <hamishivi@gmail.com>
Co-authored-by: root <root@neptune-cs-aus-263.reviz.ai2.in>
lukashelff pushed a commit to lukashelff/open-instruct-slurm that referenced this pull request Feb 19, 2026
* uses add_request

* ran linter

* Clean up

* Fixed bug

* Added duplication

* Added prompt_tokens to metadata.

* Added missing key to metadata

* Fixed bug where we weren't returning properly.

* Fix script

* Added logging

* fix bug

* use clone for SamplingParams

* Fixes to duplication

* Removed logging.

* Cleaned up PR.

* Clean PR

* Removed whitespace

* Cleaned up PR

* Added comment for cleaner PR.

* Cleaning up PR

* Revert "load pretokenized user query (v0) (allenai#965)"

This reverts commit 6baee4c.

* Bug fix.

* Fixed issue where we weren't setting params right in tools.

* Updated descriptions.

* Fix ordering.

* Updated tool script with description.

* Fixed use of wrong vllm.SamplingParams.

* Now, tool use should run.

* Reapply "load pretokenized user query (v0) (allenai#965)"

This reverts commit 341a77b.

* minor clean up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants