-
Notifications
You must be signed in to change notification settings - Fork 645
Description
While testing Gemma2Tokenizer.encode(), I noticed that handling of invalid input types is inconsistent.
From the code and documentation, the tokenizer appears to intentionally support:
- str
- list[str]
However, inputs outside these types are not validated early. As a result:
- some invalid inputs are silently accepted
- others fail later with low-level Python errors
- error messages are inconsistent and hard to interpret
Before proposing any fix, I want to better understand the intended behavior here.
Observed Behavior
Some examples using invalid inputs:
| Input | Current result |
|---|---|
123 |
TypeError: 'int' object is not iterable |
["hello", 123] |
AttributeError: 'int' object has no attribute 'replace' |
{"text": "hello"} |
Tokens returned |
b"hello" |
AttributeError: 'int' object has no attribute 'replace' |
The errors vary depending on the input type and often surface deeper in the call stack, exposing Python internals rather than clearly explaining the input mistake.
Expected Behavior
For inputs other than the supported types (str, list[str]):
- The tokenizer should fail early
- The error message should be clear and consistent
- The error should indicate what input type was expected and what was actually received
For example:
==> TypeError: tokenizer.encode expects str or list[str], but got int <==
This would make input issues easier to understand and debug, especially in data pipelines and for new users.
Question for Maintainers
Could you please clarify the intended behavior of tokenizer.encode() when it receives unsupported input types?
In particular:
- Should all unsupported input types be rejected early with a clear, consistent error?
- Or is silent acceptance of certain structured inputs (e.g. dictionaries) intentional?
Once the intended behavior is confirmed, I’d be happy to work on a fix that aligns with the project’s design.
Reproduction
from gemma import gm
tokenizer = gm.text.Gemma2Tokenizer()
tests = [
123,
["hello", 123],
{"text": "hello"},
b"hello",
]
for t in tests:
try:
print(tokenizer.encode(t))
except Exception as e:
print(type(e).__name__, e)