fix(typing): correct encode() input typing (PreTokenizedInputSequence tuple + stub Any)#2089
Open
Anai-Guo wants to merge 2 commits into
Open
fix(typing): correct encode() input typing (PreTokenizedInputSequence tuple + stub Any)#2089Anai-Guo wants to merge 2 commits into
Anai-Guo wants to merge 2 commits into
Conversation
|
Thank you @Anai-Guo for taking this up. This PR will surgically address the first type annotation issue I reported. Maybe you can include your note on the second type annotation issue in the issue discussion as well? |
Author
|
Done — I've posted the writeup on the second (stub-generator) issue in #2088 so it's captured in the discussion thread. Happy to follow up with a separate PR on the |
…ence in stub The .pyi stub annotated encode(sequence: Any), which type checkers prefer over __init__.py and which masked the PreTokenizedInputSequence fix. Define the input-sequence aliases in the stub and use InputSequence so mypy can flag invalid encode() inputs. Fixes huggingface#2088 (part 2).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Fixes the two input-typing issues reported in #2088 so that type checkers correctly type
Tokenizer.encode()(andasync_encode()).1.
PreTokenizedInputSequence(__init__.py) — wasUnion[List[str], Tuple[str]].Tuple[str]means a tuple of exactly onestr, so type checkers reject correct programs that pass a pre-tokenized tuple of any other length. The intended meaning is an arbitrary-length tuple of strings, spelledTuple[str, ...](typing spec). This also matches the docstring ("ATupleofstr") and theList[str]alternative in the same union.2.
Tokenizer.encode/async_encodestub (__init__.pyi) — the stub annotatedsequence: Any(andpair: Any | None). Because a.pyistub takes precedence over__init__.pyfor type checkers, thisAnymasked issue 1 entirely and let invalid calls liketokenizer.encode(3.14)passmypyeven though they fail at runtime withTypeError: TextInputSequence must be str.The stub now defines the input-sequence aliases (mirroring
__init__.py) and types thesequence/pairparameters asInputSequence:This is consistent with the existing docstrings, which already document
sequence (~tokenizers.InputSequence).Why
Without these,
mypycannot catch invalidencode()inputs, and correct programs that pass tuple-of-str pre-tokenized input are wrongly rejected.Closes #2088.
🤖 Generated with Claude Code