Skip to content

Tokenizer: clarify intended input validation and error handling for invalid inputs #513

@DEEP-600

Description

@DEEP-600

While testing Gemma2Tokenizer.encode(), I noticed that handling of invalid input types is inconsistent.

From the code and documentation, the tokenizer appears to intentionally support:

  • str
  • list[str]

However, inputs outside these types are not validated early. As a result:

  • some invalid inputs are silently accepted
  • others fail later with low-level Python errors
  • error messages are inconsistent and hard to interpret

Before proposing any fix, I want to better understand the intended behavior here.


Observed Behavior

Some examples using invalid inputs:

Input Current result
123 TypeError: 'int' object is not iterable
["hello", 123] AttributeError: 'int' object has no attribute 'replace'
{"text": "hello"} Tokens returned
b"hello" AttributeError: 'int' object has no attribute 'replace'

The errors vary depending on the input type and often surface deeper in the call stack, exposing Python internals rather than clearly explaining the input mistake.


Expected Behavior

For inputs other than the supported types (str, list[str]):

  • The tokenizer should fail early
  • The error message should be clear and consistent
  • The error should indicate what input type was expected and what was actually received

For example:

==> TypeError: tokenizer.encode expects str or list[str], but got int <==

This would make input issues easier to understand and debug, especially in data pipelines and for new users.


Question for Maintainers

Could you please clarify the intended behavior of tokenizer.encode() when it receives unsupported input types?

In particular:

  • Should all unsupported input types be rejected early with a clear, consistent error?
  • Or is silent acceptance of certain structured inputs (e.g. dictionaries) intentional?

Once the intended behavior is confirmed, I’d be happy to work on a fix that aligns with the project’s design.


Reproduction

from gemma import gm

tokenizer = gm.text.Gemma2Tokenizer()

tests = [
    123,
    ["hello", 123],
    {"text": "hello"},
    b"hello",
]

for t in tests:
    try:
        print(tokenizer.encode(t))
    except Exception as e:
        print(type(e).__name__, e)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions