Skip to content

fix(typing): correct encode() input typing (PreTokenizedInputSequence tuple + stub Any)#2089

Open
Anai-Guo wants to merge 2 commits into
huggingface:mainfrom
Anai-Guo:fix/pretokenized-tuple-typing
Open

fix(typing): correct encode() input typing (PreTokenizedInputSequence tuple + stub Any)#2089
Anai-Guo wants to merge 2 commits into
huggingface:mainfrom
Anai-Guo:fix/pretokenized-tuple-typing

Conversation

@Anai-Guo

@Anai-Guo Anai-Guo commented Jun 7, 2026

Copy link
Copy Markdown

What

Fixes the two input-typing issues reported in #2088 so that type checkers correctly type Tokenizer.encode() (and async_encode()).

1. PreTokenizedInputSequence (__init__.py) — was Union[List[str], Tuple[str]]. Tuple[str] means a tuple of exactly one str, so type checkers reject correct programs that pass a pre-tokenized tuple of any other length. The intended meaning is an arbitrary-length tuple of strings, spelled Tuple[str, ...] (typing spec). This also matches the docstring ("A Tuple of str") and the List[str] alternative in the same union.

PreTokenizedInputSequence = Union[List[str], Tuple[str, ...]]

2. Tokenizer.encode / async_encode stub (__init__.pyi) — the stub annotated sequence: Any (and pair: Any | None). Because a .pyi stub takes precedence over __init__.py for type checkers, this Any masked issue 1 entirely and let invalid calls like tokenizer.encode(3.14) pass mypy even though they fail at runtime with TypeError: TextInputSequence must be str.

The stub now defines the input-sequence aliases (mirroring __init__.py) and types the sequence/pair parameters as InputSequence:

TextInputSequence = str
PreTokenizedInputSequence = list[str] | tuple[str, ...]
InputSequence = TextInputSequence | PreTokenizedInputSequence

This is consistent with the existing docstrings, which already document sequence (~tokenizers.InputSequence).

Why

Without these, mypy cannot catch invalid encode() inputs, and correct programs that pass tuple-of-str pre-tokenized input are wrongly rejected.

Closes #2088.

🤖 Generated with Claude Code

@JEHoctor

JEHoctor commented Jun 8, 2026

Copy link
Copy Markdown

Thank you @Anai-Guo for taking this up. This PR will surgically address the first type annotation issue I reported.

Maybe you can include your note on the second type annotation issue in the issue discussion as well?

@Anai-Guo

Anai-Guo commented Jun 8, 2026

Copy link
Copy Markdown
Author

Done — I've posted the writeup on the second (stub-generator) issue in #2088 so it's captured in the discussion thread. Happy to follow up with a separate PR on the tools/stub-gen side if that direction is wanted.

…ence in stub

The .pyi stub annotated encode(sequence: Any), which type checkers prefer
over __init__.py and which masked the PreTokenizedInputSequence fix. Define
the input-sequence aliases in the stub and use InputSequence so mypy can
flag invalid encode() inputs. Fixes huggingface#2088 (part 2).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Two incorrect type annotations affect input typing to tokenizers.Tokenizer.encode()

2 participants