Add PEP 561 type stubs for the Python package#1258
Conversation
Add py.typed and extend setuptools package-data so stub files are distributed with the sentencepiece package. This is the packaging foundation for static type checking support (fixes google#1030).
Introduce __init__.pyi covering the most common public entry points: model loading via model_file, Encode/Decode, and vocabulary helpers. This addresses the primary Pylance/mypy complaints in google#1030.
Complete the public Python API surface documented in sentencepiece.i, including runtime-added methods like encode/decode and training helpers.
Provide a small script that exercises model_file, encode, and decode so type checkers can validate the stub surface in CI or local development.
Deepak-png981
left a comment
There was a problem hiding this comment.
i initially hoped we could avoid hand maintained stubs, either because swig would surface types automatically or because we could generate .pyi files when regenerating the wrapper. after reading the code path, that does not really work for this project. swig generates the c++ binding and a baseline init.py, but the api most users call (model_file on the constructor, encode/decode etc....) is added in sentencepiece.i python layer and at import time via setattr. type checkers do not execute that logic, so they cannot infer the same surface you get at runtime.
because of that, annotating the generated init.py is not a good fit either. it gets regenerated on swig updates, and it still would not describe the full public contract cleanly. changing the swig pipeline to emit complete stubs would be a larger custom build step, not something this repo does today.
the approach in this pr is the usual one for native extension packages (pep 561: py.typed plus init.pyi). many open source libraries ship types this way when the runtime is c/c++ or swig/pybind style bindings. we document the public python api users actually import, ship it in the wheel, and leave runtime behavior unchanged. we are only making that contract visible to pylance, mypy, and pyright.
the tradeoff is maintenance. init.pyi needs to stay in sync when the public interface changes in sentencepiece.i (new kwargs, renamed methods, new snake_case aliases, etc.). it does not need to track every internal c++ or wrap detail i would say but, only the documented python surface. if the api changes in a pr, the stubs should be updated in the same pr, ideally with typing_smoke.py or pyright in ci to catch drift.
There was a problem hiding this comment.
Please let me know if you would like any additional tests for this change. I added this file for a basic pyright check, and I am still getting familiar with the repo.
There was a problem hiding this comment.
py.typed is intentionally empty. Under PEP 561, this file is only a marker that tells type checkers this package ships type information. The actual types live in init.pyi.
There was a problem hiding this comment.
Is it possible to add comments to this file, or must it be left entirely blank?
There was a problem hiding this comment.
yes, it can contain comments. under pep 561, type checkers only look for the file's presence in the package; the contents are ignored. i've added a short comment explaining what it is and pointing to init.pyi.
Deepak-png981
left a comment
There was a problem hiding this comment.
Hello @taku910 , I'm not sure whether any action is required from my side to complete the pending GitHub Action. Please let me know if there's anything I need to do.
|
Thank you for this change. We've added a new Python API for parallel encoding. Could you update this PR? One question: did you create this file manually or was it made by AI? We would like to make the process fully automated, so it would be nice to update the definitions automatically. |
|
Could this PR resolve #1030? Thanks again. |
The public Python API is assembled at import time in sentencepiece.i (setattr, _add_snake_case, Tokenize = Encode), so a static parse cannot see it. tools/gen_stubs.py imports the built package and discovers the method list from the live classes, then applies a small curated type overlay. Unknown members get a permissive fallback and are reported, so the stub stays complete even when the API grows.
Run tools/gen_stubs.py against the v0.2.2 module so the stubs cover ParallelEncode/ParallelEncodeAs*, the new ThreadPool class and the thread_pool argument, plus Tokenize/Detokenize and the init alias. Regeneration also corrects load_from_rule_t_s_v (the hand-written stub had load_from_rule_tsv) and uses list[...] batch overloads to match the runtime type(arg) is list dispatch.
Exercises the generator under controlled mutations of the imported module: a new method is auto-discovered and flagged, a removed method drops out, new properties fall back to Any, leaked imports are ignored, and output is deterministic. Each case asserts that drift would be caught by the --check gate.
Wrap the checks in never-called functions so importing the module has no runtime side effects, and add coverage for parallel_encode, ThreadPool and the snake_case aliases.
Add a short comment explaining that the file is a PEP 561 marker whose contents are ignored; the actual types live in __init__.pyi.
Builds the package, runs gen_stubs.py --check to fail on stub drift, runs the generator scenario tests, then type-checks the public API smoke test with pyright and mypy.
initially i wrote the stub by hand, mostly to understand the codebase and how the public api is put together. but you are right that this should be automated, so now we generate it instead. I also rebased on the latest master so the stubs cover the new parallel encoding api (ParallelEncode / ParallelEncodeAs, the ThreadPool class, and the thread_pool argument..). for the automation, tools/gen_stubs.py imports the built package and discovers the public method list from the live classes. that way it sees the surface that gets assembled at import time (setattr, _add_snake_case, Tokenize = Encode), which a static parse of sentencepiece.i cannot. it only keeps a small curated table for the parameter types, and anything new but uncurated still gets a permissive (*args, **kwargs) stub (and is reported), so the stub is always complete. i added a typing.yml workflow that runs gen_stubs.py --check to fail the build if the committed stub drifts from the runtime, plus a few scenario tests for the generator and pyright/mypy on a smoke test. as a side effect it already caught a bug in my earlier manual stub (load_from_rule_tsv vs the real runtime name load_from_rule_t_s_v). so the flow now is: change the api, rebuild, run gen_stubs.py --output ..., and ci enforces it stays in sync. i validated this workflow on my fork and it passes (build, --check, scenario tests, pyright, mypy) , link to the action : https://github.com/Deepak-png981/sentencepiece/actions/runs/27125948086/job/80054450339
|
|
We are currently migrating from SWIG to pybind11, and most of the routine conversion work is being handled by an AI (antigravity agent). As a test, simply instructing it to "add type annotations" allowed it to generate something almost identical to the contents of the PR. (This is not included in the PR yet.) As with the code in question, even if it is automated, having a large amount of code dedicated to automation means that the automation code itself will require maintenance. Sorry, but could you please remove the changes to .github/workflow? The code hasn't been fully tested yet. I will merge the current PR, but please allow us to consider our approach further, including the AI's auto-generation capabilities. Please note that the code may be modified or deleted without prior notice. Thank you for your understanding. |
|
We haven't decided for sure yet, but right now it feels like the most realistic approach is to just keep the test code and let the AI handle the generation. The agent will run the tests and fix things on its own. |





Fixes #1030
Summary
The Python package is built with SWIG and exposes much of its user-facing API at import time (for example
model_fileon the constructor anddecodeas a snake_case alias). Static analyzers cannot infer that surface from the generated runtime code alone, which leads to false positives in Pylance, pyright, and mypy (#1030).This change adds official typing support:
py.typedmarker (PEP 561)__init__.pyidescribing the documented public API (now covering the newParallelEncode/ParallelEncodeAs*API, theThreadPoolclass and thethread_poolargument)tools/gen_stubs.py, which generates the stub by introspecting the imported moduletyping_smoke.pyscript for local/CI type checkingtyping.ymlGitHub Actions workflow that keeps the stub in sync and type-checks the public APIApproach
Stubs describe the Python API users call (as defined in
sentencepiece.iand the Python README), not every internal SWIG symbol. Runtime behavior is unchanged.To address the maintenance/automation concern,
tools/gen_stubs.pyimports the built package and discovers the public method list from the live classes (so it cannot silently drift when a method orsnake_casealias is added), then applies a small curated overlay for parameter types. Anything new but uncurated still gets a permissive(*args, **kwargs) -> Anystub (and is reported), so the stub is always complete. CI runsgen_stubs.py --checkto fail the build if the committed stub is stale, plus scenario tests for the generator and pyright/mypy on the smoke test.Notes for reviewers
The stub is generated, not hand-edited. After changing the public Python API, rebuild the package and run:
The CI
--checkstep enforces that the committed stub matches a fresh generation.