Tokenizer: clarify intended input validation and error handling for invalid inputs

While testing `Gemma2Tokenizer.encode()`, I noticed that handling of invalid input types is inconsistent.

From the code and documentation, the tokenizer appears to intentionally support:
- str
- list[str]

However, inputs outside these types are not validated early. As a result:
- some invalid inputs are silently accepted
- others fail later with low-level Python errors
- error messages are inconsistent and hard to interpret

Before proposing any fix, I want to better understand the intended behavior here.

---

### Observed Behavior

Some examples using invalid inputs:

| Input | Current result |
|------|----------------|
| `123` | `TypeError: 'int' object is not iterable` |
| `["hello", 123]` | `AttributeError: 'int' object has no attribute 'replace'` |
| `{"text": "hello"}` | Tokens returned |
| `b"hello"` | `AttributeError: 'int' object has no attribute 'replace'` |

The errors vary depending on the input type and often surface deeper in the call stack, exposing Python internals rather than clearly explaining the input mistake.

---

### Expected Behavior

For inputs other than the supported types (`str`, `list[str]`):

- The tokenizer should fail early
- The error message should be clear and consistent
- The error should indicate what input type was expected and what was actually received

For example:

==>  `TypeError: tokenizer.encode expects str or list[str], but got int`  <==

This would make input issues easier to understand and debug, especially in data pipelines and for new users.

---

### Question for Maintainers

Could you please clarify the intended behavior of `tokenizer.encode()` when it receives unsupported input types?

In particular:
- Should all unsupported input types be rejected early with a clear, consistent error?
- Or is silent acceptance of certain structured inputs (e.g. dictionaries) intentional?

Once the intended behavior is confirmed, I’d be happy to work on a fix that aligns with the project’s design.

---

### Reproduction

```python
from gemma import gm

tokenizer = gm.text.Gemma2Tokenizer()

tests = [
    123,
    ["hello", 123],
    {"text": "hello"},
    b"hello",
]

for t in tests:
    try:
        print(tokenizer.encode(t))
    except Exception as e:
        print(type(e).__name__, e)


Input	Current result
`123`	`TypeError: 'int' object is not iterable`
`["hello", 123]`	`AttributeError: 'int' object has no attribute 'replace'`
`{"text": "hello"}`	Tokens returned
`b"hello"`	`AttributeError: 'int' object has no attribute 'replace'`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer: clarify intended input validation and error handling for invalid inputs #513

Observed Behavior

Expected Behavior

Question for Maintainers

Reproduction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tokenizer: clarify intended input validation and error handling for invalid inputs #513

Description

Observed Behavior

Expected Behavior

Question for Maintainers

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions