Skip to content

Enable Python workflow with BytePairTokenizer and StartEndPacker#2628

Open
james77777778 wants to merge 5 commits intokeras-team:masterfrom
james77777778:refactor-bpe-tokenizer
Open

Enable Python workflow with BytePairTokenizer and StartEndPacker#2628
james77777778 wants to merge 5 commits intokeras-team:masterfrom
james77777778:refactor-bpe-tokenizer

Conversation

@james77777778
Copy link
Collaborator

Description of the change

On second thought, I think merging the Python workflow functionality into the original BytePairTokenizer and StartEndPacker layers will provide a more seamless UX. Users will benefit from a default Python workflow while maintaining a smooth experience when using these layers within tf.data.

Extensive tests have been added for these core modules:

  • keras_hub/src/tokenizers/byte_tokenizer_test.py
  • keras_hub/src/layers/preprocessing/start_end_packer_test.py

The diff is quite large due to updated tests. I found that all BytePairTokenizer tests were WRONG: BPE requires all elements in merges to be present in vocabulary.
Our tf and tf-text implementation is more permissive than widely adopted libraries like tokenizers.

Key updates for reviewers:

  • Added _allow_python_workflow to PreprocessingLayer to indicate whether a layer supports Python workflow
  • keras_hub/src/tokenizers/byte_pair_tokenizer.py
  • keras_hub/src/tokenizers/byte_pair_tokenizer_test.py
  • keras_hub/src/layers/preprocessing/start_end_packer.py
  • keras_hub/src/layers/preprocessing/start_end_packer_test.py
  • keras_hub/src/models/clip/clip_tokenizer.py: CLIPTokenizer has very different settings as a BPE tokenizer.
  • keras_hub/src/models/clip/clip_tokenizer_test.py

Reference

Colab Notebook

https://colab.research.google.com/drive/1YDHUwdjwgO0BGuelYLTRNvVB1Ulc1K7X?usp=sharing

image

Checklist

  • I have added all the necessary unit tests for my change.
  • I have verified that my change does not break existing code and works with all backends (TensorFlow, JAX, and PyTorch).
  • My PR is based on the latest changes of the main branch (if unsure, rebase the code).
  • I have followed the Keras Hub Model contribution guidelines in making these changes.
  • I have followed the Keras Hub API design guidelines in making these changes.
  • I have signed the Contributor License Agreement.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the flexibility and usability of key preprocessing layers, BytePairTokenizer and StartEndPacker, by introducing a Python-native execution path. This change allows developers to integrate these components more smoothly into diverse Python environments, reducing reliance on TensorFlow-specific constructs when not strictly necessary. The update also includes a crucial correction to the BytePair Encoding logic, ensuring more robust and standard-compliant tokenization.

Highlights

  • Python Workflow Enabled: The BytePairTokenizer and StartEndPacker layers now support a Python-native workflow, allowing them to be used directly without requiring TensorFlow or tf.data for execution. This provides a more seamless user experience.
  • Refactoring and Consolidation: The BytePairTokenizer and StartEndPacker layers, along with their respective test files, have been moved from the v2 API subdirectories to the main src directories. Their internal call methods were refactored to dispatch between TensorFlow graph mode and Python-native implementations based on the execution context.
  • BPE Vocabulary Correction: A critical bug in the BytePairTokenizer was addressed, ensuring that all elements in the BPE merges list are present in the vocabulary. This aligns the tokenizer's behavior with widely adopted BPE implementations.
  • Workflow Control Flag: An internal _allow_python_workflow flag was introduced to the PreprocessingLayer base class, allowing explicit control over whether a layer supports the Python workflow. This flag is leveraged by BytePairTokenizer and StartEndPacker.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • keras_hub/api/layers/v2/init.py
    • Removed StartEndPacker import, as the layer was moved to the main src directory.
  • keras_hub/api/tokenizers/v2/init.py
    • Removed BytePairTokenizer import, as the tokenizer was moved to the main src directory.
  • keras_hub/src/layers/preprocessing/preprocessing_layer.py
    • Added _allow_python_workflow parameter to the __init__ method to conditionally assert TensorFlow library installation.
  • keras_hub/src/layers/preprocessing/start_end_packer.py
    • Introduced Python-native _call_python implementation for processing inputs.
    • Refactored the main call method to dynamically dispatch to either _call_tf or _call_python based on the execution context.
    • Updated __init__ to accept _allow_python_workflow and handle None values for special tokens.
    • Modified compute_output_shape to correctly return shapes when return_padding_mask is enabled.
  • keras_hub/src/layers/preprocessing/start_end_packer_test.py
    • Updated existing tests to use parameterized decorator, covering both TensorFlow and Python workflows for StartEndPacker.
  • keras_hub/src/layers/preprocessing/v2/start_end_packer.py
    • Removed the v2 version of StartEndPacker, consolidating its functionality into the main src directory.
  • keras_hub/src/layers/preprocessing/v2/start_end_packer_test.py
    • Removed the v2 test file for StartEndPacker.
  • keras_hub/src/models/bart/bart_seq_2_seq_lm_preprocessor_test.py
    • Updated vocabulary generation logic and adjusted expected token IDs in tests to reflect BPE fixes.
  • keras_hub/src/models/bart/bart_seq_2_seq_lm_test.py
    • Updated vocabulary generation logic and adjusted expected_output_shape in tests.
  • keras_hub/src/models/bart/bart_tokenizer_test.py
    • Updated vocabulary generation logic and adjusted expected token IDs in tests.
  • keras_hub/src/models/bloom/bloom_causal_lm_preprocessor_test.py
    • Updated vocabulary generation logic and adjusted expected token IDs in tests.
  • keras_hub/src/models/bloom/bloom_causal_lm_test.py
    • Updated vocabulary generation logic and adjusted expected_output_shape in tests.
  • keras_hub/src/models/bloom/bloom_tokenizer_test.py
    • Updated vocabulary generation logic and adjusted expected token IDs in tests.
    • Changed @pytest.mark.extra_large to @pytest.mark.large for test_smallest_preset.
  • keras_hub/src/models/causal_lm_preprocessor.py
    • Modified __init__ to pass _allow_python_workflow to the base class.
    • Refactored call, generate_preprocess, and generate_postprocess methods to dispatch between TensorFlow and Python-native implementations.
  • keras_hub/src/models/clip/clip_preprocessor_test.py
    • Updated vocabulary generation logic and adjusted expected token IDs in tests.
    • Added CLIPPreprocessorDisallowPythonWorkflowTest to verify behavior when Python workflow is disabled.
  • keras_hub/src/models/clip/clip_tokenizer.py
    • Integrated tokenizers library for Python workflow, including CLIP-specific normalizers and pre-tokenizers.
    • Refactored tokenize and detokenize into TensorFlow-specific (_tokenize_tf, _detokenize_tf) and tokenizers-specific (_tokenize_tokenizers, _detokenize_tokenizers) implementations.
    • Added _set_vocabulary_and_merges_tokenizers for configuring the tokenizers library's BPE model with CLIP settings.
  • keras_hub/src/models/clip/clip_tokenizer_test.py
    • Updated vocabulary generation logic and adjusted expected token IDs in tests.
    • Added CLIPTokenizerDisallowPythonWorkflowTest to verify behavior when Python workflow is disabled.
  • keras_hub/src/models/falcon/falcon_causal_lm_preprocessor_test.py
    • Updated vocabulary generation logic and adjusted expected token IDs in tests.
  • keras_hub/src/models/falcon/falcon_causal_lm_test.py
    • Updated vocabulary generation logic and adjusted expected_output_shape in tests.
  • keras_hub/src/models/falcon/falcon_tokenizer_test.py
    • Updated vocabulary generation logic and adjusted expected token IDs in tests.
  • keras_hub/src/models/gpt2/gpt2_causal_lm_preprocessor_test.py
    • Updated vocabulary generation logic and adjusted expected token IDs in tests.
  • keras_hub/src/models/gpt2/gpt2_causal_lm_test.py
    • Updated vocabulary generation logic and adjusted expected_output_shape in tests.
  • keras_hub/src/models/gpt2/gpt2_preprocessor_test.py
    • Updated vocabulary generation logic and adjusted expected token IDs in tests.
  • keras_hub/src/models/gpt2/gpt2_tokenizer_test.py
    • Updated vocabulary generation logic and adjusted expected token IDs in tests.
  • keras_hub/src/models/gpt_neo_x/gpt_neo_x_causal_lm_preprocessor_test.py
    • Removed unused keras.ops import.
    • Updated vocabulary generation logic and adjusted expected token IDs in tests.
  • keras_hub/src/models/gpt_neo_x/gpt_neo_x_causal_lm_test.py
    • Updated vocabulary generation logic and adjusted expected_output_shape in tests.
  • keras_hub/src/models/gpt_neo_x/gpt_neo_x_tokenizer_test.py
    • Updated vocabulary generation logic and adjusted expected token IDs in tests.
  • keras_hub/src/models/gpt_oss/gpt_oss_causal_lm_preprocessor_test.py
    • Updated vocabulary generation logic and adjusted expected token IDs in tests.
  • keras_hub/src/models/gpt_oss/gpt_oss_causal_lm_test.py
    • Updated vocabulary generation logic and adjusted expected_output_shape in tests.
  • keras_hub/src/models/llama3/llama3_causal_lm_preprocessor_test.py
    • Updated vocabulary generation logic and adjusted expected token IDs in tests.
  • keras_hub/src/models/llama3/llama3_causal_lm_test.py
    • Updated vocabulary generation logic and adjusted expected_output_shape in tests.
  • keras_hub/src/models/llama3/llama3_tokenizer_test.py
    • Updated vocabulary generation logic and adjusted expected token IDs in tests.
  • keras_hub/src/models/masked_lm_preprocessor.py
    • Explicitly disabled Python workflow for the tokenizer due to MultiSegmentPacker's TensorFlow dependency.
  • keras_hub/src/models/opt/opt_causal_lm_preprocessor_test.py
    • Updated vocabulary generation logic and adjusted expected token IDs in tests.
  • keras_hub/src/models/opt/opt_causal_lm_test.py
    • Updated vocabulary generation logic and adjusted expected_output_shape in tests.
  • keras_hub/src/models/opt/opt_tokenizer_test.py
    • Updated vocabulary generation logic and adjusted expected token IDs in tests.
  • keras_hub/src/models/qwen/qwen_causal_lm_test.py
    • Updated vocabulary generation logic and adjusted expected_output_shape in tests.
  • keras_hub/src/models/qwen3/qwen3_causal_lm_preprocessor_test.py
    • Updated vocabulary generation logic and adjusted expected token IDs in tests.
  • keras_hub/src/models/qwen3/qwen3_causal_lm_test.py
    • Updated vocabulary generation logic and adjusted expected_output_shape in tests.
  • keras_hub/src/models/qwen3_moe/qwen3_moe_causal_lm_preprocessor_test.py
    • Updated vocabulary generation logic and adjusted expected token IDs in tests.
  • keras_hub/src/models/qwen3_moe/qwen3_moe_causal_lm_test.py
    • Updated vocabulary generation logic and adjusted expected_output_shape in tests.
  • keras_hub/src/models/qwen_moe/qwen_moe_causal_lm_preprocessor_test.py
    • Updated vocabulary generation logic and adjusted expected token IDs in tests.
  • keras_hub/src/models/qwen_moe/qwen_moe_causal_lm_test.py
    • Updated vocabulary generation logic and adjusted expected_output_shape in tests.
  • keras_hub/src/models/roberta/roberta_masked_lm_preprocessor_test.py
    • Updated vocabulary generation logic and adjusted expected token IDs and sequence length in tests.
  • keras_hub/src/models/roberta/roberta_masked_lm_test.py
    • Updated vocabulary generation logic and adjusted expected_output_shape and sequence length in tests.
  • keras_hub/src/models/roberta/roberta_text_classifier_preprocessor_test.py
    • Updated vocabulary generation logic and adjusted expected token IDs in tests.
  • keras_hub/src/models/roberta/roberta_text_classifier_test.py
    • Updated vocabulary generation logic and adjusted sequence length in tests.
  • keras_hub/src/models/roberta/roberta_tokenizer_test.py
    • Updated vocabulary generation logic and adjusted expected token IDs in tests.
  • keras_hub/src/models/seq_2_seq_lm_preprocessor.py
    • Explicitly disabled Python workflow for the tokenizer due to TensorFlow dependencies.
  • keras_hub/src/models/smollm3/smollm3_causal_lm_test.py
    • Updated vocabulary generation logic.
  • keras_hub/src/models/text_classifier_preprocessor.py
    • Explicitly disabled Python workflow for the tokenizer due to MultiSegmentPacker's TensorFlow dependency.
  • keras_hub/src/models/v2/causal_lm_preprocessor.py
    • Updated StartEndPacker import to point to the main src directory.
  • keras_hub/src/models/whisper/whisper_tokenizer_test.py
    • Updated vocabulary generation logic and adjusted expected token IDs and special token IDs in tests.
  • keras_hub/src/tokenizers/byte_pair_tokenizer.py
    • Integrated tokenizers library for Python workflow, adding _set_vocabulary_and_merges_tokenizers and SPLIT_PATTERN_TOKENIZERS.
    • Refactored __init__ to accept _allow_python_workflow and defer vocabulary initialization.
    • Split id_to_token, token_to_id, tokenize, and detokenize into TensorFlow-specific and tokenizers-specific implementations.
    • Modified set_vocabulary_and_merges to conditionally initialize TensorFlow or tokenizers backends.
    • Added lazy initialization methods (_maybe_initialized_tf, _maybe_initialized_tokenizers).
  • keras_hub/src/tokenizers/byte_pair_tokenizer_test.py
    • Updated setUp to explicitly disallow Python workflow for the base test class.
    • Added BytePairTokenizerDisallowPythonWorkflowTest to test TensorFlow-only behavior.
  • keras_hub/src/tokenizers/v2/byte_pair_tokenizer.py
    • Removed the v2 version of BytePairTokenizer, consolidating its functionality into the main src directory.
  • keras_hub/src/tokenizers/v2/byte_pair_tokenizer_test.py
    • Removed the v2 test file for BytePairTokenizer.
  • keras_hub/src/utils/tensor_utils.py
    • Added convert_preprocessing_outputs_python function to convert preprocessing outputs to a backend-agnostic format for Python workflow.
Activity
  • The pull request introduces a significant architectural change by enabling a Python-native workflow for BytePairTokenizer and StartEndPacker, which was a core objective.
  • Extensive unit tests have been added and updated across numerous tokenizer and preprocessor modules to ensure the new Python workflow and the BPE vocabulary fix function correctly.
  • The author has verified that the changes do not break existing code and work across multiple backends (TensorFlow, JAX, PyTorch), as indicated by the checklist.
  • The PR description highlights the large diff size due to updated tests, suggesting thorough validation of the changes.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly improves the BytePairTokenizer and StartEndPacker layers by enabling Python workflow functionality, alongside the existing TensorFlow-compatible workflow. The refactoring of call methods into _call_tf and _call_python for StartEndPacker and the integration of the tokenizers library for BytePairTokenizer are well-executed. The extensive test updates, including parameterization for _allow_python_workflow, demonstrate thoroughness. The removal of v2 specific files for these components indicates a successful consolidation. However, there is an inconsistency in how invalid merge rules are handled between the Python and TensorFlow paths in BytePairTokenizer.

Note: Security Review did not run due to the size of the PR.

@james77777778 james77777778 added the kokoro:force-run Runs Tests on GPU label Mar 10, 2026
@kokoro-team kokoro-team removed the kokoro:force-run Runs Tests on GPU label Mar 10, 2026
@james77777778 james77777778 added the kokoro:force-run Runs Tests on GPU label Mar 11, 2026
@kokoro-team kokoro-team removed the kokoro:force-run Runs Tests on GPU label Mar 11, 2026
@james77777778 james77777778 added the kokoro:force-run Runs Tests on GPU label Mar 11, 2026
@kokoro-team kokoro-team removed the kokoro:force-run Runs Tests on GPU label Mar 11, 2026
@james77777778 james77777778 added the kokoro:force-run Runs Tests on GPU label Mar 11, 2026
@kokoro-team kokoro-team removed the kokoro:force-run Runs Tests on GPU label Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants