Skip to content

Conversation

@sarahyurick
Copy link
Contributor

Closes #1296.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 8, 2026

Greptile Summary

This PR adds a transformers_kwargs parameter to allow users to pass additional arguments (like trust_remote_code=True) to HuggingFace's from_pretrained methods, addressing issue #1296 where models requiring custom code couldn't be loaded.

Key Changes:

  • Added transformers_kwargs parameter to EmbeddingModelStage, EmbeddingCreatorStage, TokenizerStage, MegatronTokenizerWriter, and TokenCountFilter
  • Added validation to prevent users from overriding internally-managed parameters (local_files_only, cache_dir, padding_side)

Issues Found:

  • TokenizerStage will crash with TypeError when instantiated without transformers_kwargs due to validation checking the parameter before assigning its default value
  • TokenCountFilter.load_tokenizer() validates transformers_kwargs but never passes it to from_pretrained(), making the feature ineffective for this class

Confidence Score: 2/5

  • This PR contains two bugs that will cause runtime failures and should be fixed before merging.
  • Score of 2 reflects: (1) TokenizerStage crashes on default usage due to validation ordering bug, (2) TokenCountFilter silently ignores transformers_kwargs. Both issues need resolution.
  • nemo_curator/stages/text/models/tokenizer.py (TypeError on default usage) and nemo_curator/stages/text/filters/heuristic_filter.py (kwargs never passed to from_pretrained)

Important Files Changed

Filename Overview
nemo_curator/stages/text/models/tokenizer.py Added transformers_kwargs parameter but validation checks it before assigning default, causing TypeError when None.
nemo_curator/stages/text/filters/heuristic_filter.py Added transformers_kwargs parameter but forgot to pass it to from_pretrained(), making it ineffective.
nemo_curator/stages/text/embedders/base.py Added transformers_kwargs to EmbeddingModelStage and EmbeddingCreatorStage with proper validation and usage.
nemo_curator/stages/text/io/writer/megatron_tokenizer.py Added transformers_kwargs to MegatronTokenizerWriter with proper validation and usage.

Sequence Diagram

sequenceDiagram
    participant User
    participant EmbeddingCreatorStage
    participant TokenizerStage
    participant EmbeddingModelStage
    participant AutoTokenizer
    participant AutoModel

    User->>EmbeddingCreatorStage: Create with transformers_kwargs
    EmbeddingCreatorStage->>TokenizerStage: Pass transformers_kwargs
    EmbeddingCreatorStage->>EmbeddingModelStage: Pass transformers_kwargs
    
    Note over TokenizerStage: setup_on_node()
    TokenizerStage->>AutoTokenizer: from_pretrained(**transformers_kwargs)
    AutoTokenizer-->>TokenizerStage: tokenizer instance
    
    Note over EmbeddingModelStage: setup()
    EmbeddingModelStage->>AutoModel: from_pretrained(**transformers_kwargs)
    AutoModel-->>EmbeddingModelStage: model instance
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Overview

This PR adds a transformers_kwargs parameter across multiple classes to allow users to pass additional arguments (like trust_remote_code=True) to HuggingFace's from_pretrained() methods. This addresses issue #1296 where models requiring custom code couldn't be loaded.

Key Changes

  1. EmbeddingModelStage - Added transformers_kwargs parameter with validation to prevent local_files_only override
  2. TokenizerStage - Added transformers_kwargs parameter with comprehensive validation for cache_dir, padding_side, and local_files_only
  3. MegatronTokenizerWriter - Added transformers_kwargs parameter with validation for cache_dir and local_files_only
  4. TokenCountFilter - Added transformers_kwargs parameter with validation for local_files_only in load_tokenizer()
  5. EmbeddingCreatorStage - Added transformers_kwargs field and propagates to child stages

Issues Found

1. Missing padding_side Validation in EmbeddingModelStage (Logic Error)

The EmbeddingModelStage class accepts a padding_side parameter in its __init__ signature but does NOT validate that users don't also pass padding_side via transformers_kwargs. This is inconsistent with TokenizerStage which properly validates this conflict. If a user passes transformers_kwargs={"padding_side": "left"} while also setting padding_side="right", the code will attempt to call AutoModel.from_pretrained(..., padding_side="left", padding_side="right"), resulting in a confusing TypeError about unexpected keyword arguments.

2. Inconsistent Validation Timing in TokenCountFilter (Style Issue)

The TokenCountFilter class validates local_files_only in the load_tokenizer() method instead of in __init__, unlike all other classes in this PR. This defers validation until runtime and is inconsistent with the pattern used elsewhere. Users won't discover configuration errors until load_tokenizer() is actually called. Additionally, the validation is only performed if load_tokenizer() is explicitly called - if the user provides a pre-loaded tokenizer, the validation never runs.

Positive Aspects

  • MegatronTokenizerWriter and TokenizerStage have correct validation in place
  • transformers_kwargs is properly passed to from_pretrained() in all classes
  • Good protection against internal parameter conflicts
  • Proper parameter propagation in EmbeddingCreatorStage

Confidence Score: 2/5

  • This PR contains 2 significant issues that should be fixed before merging: missing validation in EmbeddingModelStage and inconsistent validation placement in TokenCountFilter.
  • The PR adds a useful feature (transformers_kwargs) but has two bugs that need to be addressed: (1) EmbeddingModelStage is missing validation for the padding_side parameter, which could lead to confusing runtime errors for users; (2) TokenCountFilter defers validation until load_tokenizer() is called instead of validating in init, creating inconsistent behavior and delayed error detection. Both issues are fixable but should be resolved before merging. The MegatronTokenizerWriter and TokenizerStage implementations are correct and well-validated.
  • nemo_curator/stages/text/embedders/base.py (missing padding_side validation in EmbeddingModelStage), nemo_curator/stages/text/filters/heuristic_filter.py (inconsistent validation timing in TokenCountFilter)

Important Files Changed

File Analysis

Filename Score Overview
nemo_curator/stages/text/embedders/base.py 2/5 Added transformers_kwargs parameter and validation. Missing validation for padding_side parameter which should be checked to prevent conflicts.
nemo_curator/stages/text/filters/heuristic_filter.py 2/5 Added transformers_kwargs parameter. Validation for local_files_only is done in load_tokenizer() instead of init, delaying validation until runtime.
nemo_curator/stages/text/io/writer/megatron_tokenizer.py 4/5 Added transformers_kwargs parameter with proper validation for cache_dir and local_files_only conflicts. Correctly passes transformers_kwargs to from_pretrained().
nemo_curator/stages/text/models/tokenizer.py 4/5 Added transformers_kwargs parameter with comprehensive validation for cache_dir, padding_side, and local_files_only conflicts. Correctly passes transformers_kwargs to from_pretrained().

Sequence Diagram

sequenceDiagram
    participant User
    participant EmbeddingCreatorStage
    participant TokenizerStage
    participant EmbeddingModelStage
    participant AutoTokenizer
    participant AutoModel

    User->>EmbeddingCreatorStage: Initialize with transformers_kwargs
    EmbeddingCreatorStage->>TokenizerStage: Create with transformers_kwargs
    EmbeddingCreatorStage->>EmbeddingModelStage: Create with transformers_kwargs
    Note over EmbeddingModelStage: Missing padding_side validation!

    User->>TokenizerStage: Call setup()
    TokenizerStage->>AutoTokenizer: from_pretrained(..., **transformers_kwargs)
    AutoTokenizer-->>TokenizerStage: Returns tokenizer

    User->>EmbeddingModelStage: Call setup()
    EmbeddingModelStage->>AutoModel: from_pretrained(..., **transformers_kwargs)
    Note over AutoModel: If padding_side in transformers_kwargs,<br/>will raise TypeError
    AutoModel-->>EmbeddingModelStage: Returns model (or error)

Loading

Comment on lines 646 to 705
def __init__( # noqa: PLR0913
self,
tokenizer: AutoTokenizer | None = None,
hf_model_name: str | None = None,
hf_token: str | None = None,
min_tokens: int = 0,
max_tokens: int = float("inf"),
transformers_kwargs: dict[str, Any] | None = None,
):
"""
Args:
tokenizer (AutoTokenizer | None): The pre-loaded tokenizer to use to count the tokens.
If None, the tokenizer will be initialized from the hf_model_name.
hf_model_name (str | None): The name of the Hugging Face model to use to count the tokens.
If None, the pre-loaded tokenizer must be provided via the tokenizer argument.
hf_token (str | None): The token to use to access the Hugging Face model, if needed.
min_tokens (int): The minimum number of tokens the document must contain.
Set to 0 to disable the minimum token count filter.
max_tokens (int): The maximum number of tokens the document can contain.
Set to infinity to disable the maximum token count filter.
transformers_kwargs: Additional keyword arguments to pass to the tokenizer's from_pretrained method.
Defaults to {}.
"""
super().__init__()

if tokenizer is None and hf_model_name is None:
msg = "Either tokenizer or hf_model_name must be provided"
raise ValueError(msg)
if tokenizer is not None and hf_model_name is not None:
msg = "Either tokenizer or hf_model_name must be provided, not both"
raise ValueError(msg)

self._token_count_filter_tokenizer = tokenizer
self._hf_model_name = hf_model_name
self._hf_token = hf_token
self._min_tokens = min_tokens
self._max_tokens = max_tokens
self._transformers_kwargs = transformers_kwargs or {}
self._name = "token_count"

def model_check_or_download(self) -> None:
if self._hf_model_name is not None:
# Use snapshot_download to download all files without loading the model into memory.
huggingface_hub.snapshot_download(
repo_id=self._hf_model_name,
token=self._hf_token,
local_files_only=False, # Download if not cached
resume_download=True, # Resume interrupted downloads
)

def load_tokenizer(self) -> None:
if self._hf_model_name is not None:
if "local_files_only" in self._transformers_kwargs and self._transformers_kwargs["local_files_only"] is not None:
msg = "Passing the local_files_only parameter is not allowed"
raise ValueError(msg)

self._token_count_filter_tokenizer = AutoTokenizer.from_pretrained(
self._hf_model_name, local_files_only=True
self._hf_model_name, local_files_only=True, **self._transformers_kwargs
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation for local_files_only is performed in the load_tokenizer() method (line 698-700) instead of in __init__ (line 683-684). This creates inconsistent behavior with other classes in this PR (TokenizerStage, EmbeddingModelStage, MegatronTokenizerWriter), which all validate in __init__ or setup().

By deferring validation until load_tokenizer(), users won't discover invalid configurations until runtime when load_tokenizer() is called. If the tokenizer is never loaded (e.g., if a pre-loaded tokenizer is provided via the tokenizer parameter), the validation never runs.

For consistency and fail-fast behavior, move the validation to __init__ immediately after self._transformers_kwargs = transformers_kwargs or {} on line 683, and remove the validation check from the load_tokenizer() method.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 9, 2026

Additional Comments (1)

nemo_curator/stages/text/embedders/base.py
Missing validation for padding_side parameter. The EmbeddingModelStage accepts padding_side as a direct parameter (line 42) and passes it to the parent ModelStage class. However, unlike TokenizerStage, there is no validation to prevent users from also passing padding_side via transformers_kwargs.

If a user passes transformers_kwargs={"padding_side": "left"} while also setting padding_side="right" directly, the code will attempt to pass both to AutoModel.from_pretrained(), causing a TypeError about unexpected keyword argument since AutoModel doesn't accept padding_side.

To maintain consistency with TokenizerStage and provide clear error messages, add validation for padding_side:

transformers_kwargs = transformers_kwargs or {}
if "local_files_only" in transformers_kwargs and transformers_kwargs["local_files_only"] is not None:
    msg = "Passing the local_files_only parameter is not allowed"
    raise ValueError(msg)
if "padding_side" in transformers_kwargs and transformers_kwargs["padding_side"] is not None:
    msg = "Please pass the padding_side parameter directly to the stage instead of using the transformers_kwargs dictionary"
    raise ValueError(msg)
self.transformers_kwargs = transformers_kwargs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unable to use models which need trust_remote_code=True in EmbeddingCreatorStage

1 participant