fix(typing): correct encode() input typing (PreTokenizedInputSequence tuple + stub Any) by Anai-Guo · Pull Request #2089 · huggingface/tokenizers

Anai-Guo · 2026-06-07T22:08:16Z

What

Fixes the two input-typing issues reported in #2088 so that type checkers correctly type Tokenizer.encode() (and async_encode()).

1. PreTokenizedInputSequence (__init__.py) — was Union[List[str], Tuple[str]]. Tuple[str] means a tuple of exactly one str, so type checkers reject correct programs that pass a pre-tokenized tuple of any other length. The intended meaning is an arbitrary-length tuple of strings, spelled Tuple[str, ...] (typing spec). This also matches the docstring ("A Tuple of str") and the List[str] alternative in the same union.

PreTokenizedInputSequence = Union[List[str], Tuple[str, ...]]

2. Tokenizer.encode / async_encode stub (__init__.pyi) — the stub annotated sequence: Any (and pair: Any | None). Because a .pyi stub takes precedence over __init__.py for type checkers, this Any masked issue 1 entirely and let invalid calls like tokenizer.encode(3.14) pass mypy even though they fail at runtime with TypeError: TextInputSequence must be str.

The stub now defines the input-sequence aliases (mirroring __init__.py) and types the sequence/pair parameters as InputSequence:

TextInputSequence = str
PreTokenizedInputSequence = list[str] | tuple[str, ...]
InputSequence = TextInputSequence | PreTokenizedInputSequence

This is consistent with the existing docstrings, which already document sequence (~tokenizers.InputSequence).

Why

Without these, mypy cannot catch invalid encode() inputs, and correct programs that pass tuple-of-str pre-tokenized input are wrongly rejected.

Closes #2088.

🤖 Generated with Claude Code

… str tuples

JEHoctor · 2026-06-08T15:46:48Z

Thank you @Anai-Guo for taking this up. This PR will surgically address the first type annotation issue I reported.

Maybe you can include your note on the second type annotation issue in the issue discussion as well?

Anai-Guo · 2026-06-08T16:11:36Z

Done — I've posted the writeup on the second (stub-generator) issue in #2088 so it's captured in the discussion thread. Happy to follow up with a separate PR on the tools/stub-gen side if that direction is wanted.

…ence in stub The .pyi stub annotated encode(sequence: Any), which type checkers prefer over __init__.py and which masked the PreTokenizedInputSequence fix. Define the input-sequence aliases in the stub and use InputSequence so mypy can flag invalid encode() inputs. Fixes huggingface#2088 (part 2).

fix(typing): PreTokenizedInputSequence should accept arbitrary-length…

4fdb0b6

… str tuples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(typing): correct encode() input typing (PreTokenizedInputSequence tuple + stub Any)#2089

fix(typing): correct encode() input typing (PreTokenizedInputSequence tuple + stub Any)#2089
Anai-Guo wants to merge 2 commits into
huggingface:mainfrom
Anai-Guo:fix/pretokenized-tuple-typing

Anai-Guo commented Jun 7, 2026 •

edited

Loading

Uh oh!

JEHoctor commented Jun 8, 2026

Uh oh!

Anai-Guo commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Anai-Guo commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Uh oh!

JEHoctor commented Jun 8, 2026

Uh oh!

Anai-Guo commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Anai-Guo commented Jun 7, 2026 •

edited

Loading