Skip to content

Add PEP 561 type stubs for the Python package#1258

Open
Deepak-png981 wants to merge 11 commits into
google:masterfrom
Deepak-png981:add-python-type-stubs
Open

Add PEP 561 type stubs for the Python package#1258
Deepak-png981 wants to merge 11 commits into
google:masterfrom
Deepak-png981:add-python-type-stubs

Conversation

@Deepak-png981

@Deepak-png981 Deepak-png981 commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Fixes #1030

Summary

The Python package is built with SWIG and exposes much of its user-facing API at import time (for example model_file on the constructor and decode as a snake_case alias). Static analyzers cannot infer that surface from the generated runtime code alone, which leads to false positives in Pylance, pyright, and mypy (#1030).

This change adds official typing support:

  • py.typed marker (PEP 561)
  • __init__.pyi describing the documented public API (now covering the new ParallelEncode/ParallelEncodeAs* API, the ThreadPool class and the thread_pool argument)
  • setuptools package-data updates so stubs ship inside wheels
  • tools/gen_stubs.py, which generates the stub by introspecting the imported module
  • a small typing_smoke.py script for local/CI type checking
  • a typing.yml GitHub Actions workflow that keeps the stub in sync and type-checks the public API

Approach

Stubs describe the Python API users call (as defined in sentencepiece.i and the Python README), not every internal SWIG symbol. Runtime behavior is unchanged.

To address the maintenance/automation concern, tools/gen_stubs.py imports the built package and discovers the public method list from the live classes (so it cannot silently drift when a method or snake_case alias is added), then applies a small curated overlay for parameter types. Anything new but uncurated still gets a permissive (*args, **kwargs) -> Any stub (and is reported), so the stub is always complete. CI runs gen_stubs.py --check to fail the build if the committed stub is stale, plus scenario tests for the generator and pyright/mypy on the smoke test.

Notes for reviewers

The stub is generated, not hand-edited. After changing the public Python API, rebuild the package and run:

python tools/gen_stubs.py --output src/sentencepiece/__init__.pyi

The CI --check step enforces that the committed stub matches a fresh generation.

Add py.typed and extend setuptools package-data so stub files are
distributed with the sentencepiece package. This is the packaging
foundation for static type checking support (fixes google#1030).
Introduce __init__.pyi covering the most common public entry points:
model loading via model_file, Encode/Decode, and vocabulary helpers.
This addresses the primary Pylance/mypy complaints in google#1030.
Complete the public Python API surface documented in sentencepiece.i,
including runtime-added methods like encode/decode and training helpers.
Provide a small script that exercises model_file, encode, and decode so
type checkers can validate the stub surface in CI or local development.

@Deepak-png981 Deepak-png981 left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before :
Image
After :
Image

@Deepak-png981 Deepak-png981 left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i initially hoped we could avoid hand maintained stubs, either because swig would surface types automatically or because we could generate .pyi files when regenerating the wrapper. after reading the code path, that does not really work for this project. swig generates the c++ binding and a baseline init.py, but the api most users call (model_file on the constructor, encode/decode etc....) is added in sentencepiece.i python layer and at import time via setattr. type checkers do not execute that logic, so they cannot infer the same surface you get at runtime.

because of that, annotating the generated init.py is not a good fit either. it gets regenerated on swig updates, and it still would not describe the full public contract cleanly. changing the swig pipeline to emit complete stubs would be a larger custom build step, not something this repo does today.

the approach in this pr is the usual one for native extension packages (pep 561: py.typed plus init.pyi). many open source libraries ship types this way when the runtime is c/c++ or swig/pybind style bindings. we document the public python api users actually import, ship it in the wheel, and leave runtime behavior unchanged. we are only making that contract visible to pylance, mypy, and pyright.

the tradeoff is maintenance. init.pyi needs to stay in sync when the public interface changes in sentencepiece.i (new kwargs, renamed methods, new snake_case aliases, etc.). it does not need to track every internal c++ or wrap detail i would say but, only the documented python surface. if the api changes in a pr, the stubs should be updated in the same pr, ideally with typing_smoke.py or pyright in ci to catch drift.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let me know if you would like any additional tests for this change. I added this file for a basic pyright check, and I am still getting familiar with the repo.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

py.typed is intentionally empty. Under PEP 561, this file is only a marker that tells type checkers this package ships type information. The actual types live in init.pyi.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add comments to this file, or must it be left entirely blank?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it can contain comments. under pep 561, type checkers only look for the file's presence in the package; the contents are ignored. i've added a short comment explaining what it is and pointing to init.pyi.

@Deepak-png981 Deepak-png981 left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @taku910 , I'm not sure whether any action is required from my side to complete the pending GitHub Action. Please let me know if there's anything I need to do.

Image

@taku910

taku910 commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Thank you for this change.

We've added a new Python API for parallel encoding. Could you update this PR?
We will not update the python API for the next release (v0.2.2)

One question: did you create this file manually or was it made by AI? We would like to make the process fully automated, so it would be nice to update the definitions automatically.

@taku910

taku910 commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Could this PR resolve #1030? Thanks again.

The public Python API is assembled at import time in sentencepiece.i
(setattr, _add_snake_case, Tokenize = Encode), so a static parse cannot
see it. tools/gen_stubs.py imports the built package and discovers the
method list from the live classes, then applies a small curated type
overlay. Unknown members get a permissive fallback and are reported, so
the stub stays complete even when the API grows.
Run tools/gen_stubs.py against the v0.2.2 module so the stubs cover
ParallelEncode/ParallelEncodeAs*, the new ThreadPool class and the
thread_pool argument, plus Tokenize/Detokenize and the init alias.
Regeneration also corrects load_from_rule_t_s_v (the hand-written stub
had load_from_rule_tsv) and uses list[...] batch overloads to match the
runtime type(arg) is list dispatch.
Exercises the generator under controlled mutations of the imported
module: a new method is auto-discovered and flagged, a removed method
drops out, new properties fall back to Any, leaked imports are ignored,
and output is deterministic. Each case asserts that drift would be
caught by the --check gate.
Wrap the checks in never-called functions so importing the module has no
runtime side effects, and add coverage for parallel_encode, ThreadPool
and the snake_case aliases.
Add a short comment explaining that the file is a PEP 561 marker whose
contents are ignored; the actual types live in __init__.pyi.
Builds the package, runs gen_stubs.py --check to fail on stub drift,
runs the generator scenario tests, then type-checks the public API
smoke test with pyright and mypy.
@Deepak-png981

Copy link
Copy Markdown
Contributor Author

Thank you for this change.

We've added a new Python API for parallel encoding. Could you update this PR? We will not update the python API for the next release (v0.2.2)

One question: did you create this file manually or was it made by AI? We would like to make the process fully automated, so it would be nice to update the definitions automatically.

initially i wrote the stub by hand, mostly to understand the codebase and how the public api is put together. but you are right that this should be automated, so now we generate it instead. I also rebased on the latest master so the stubs cover the new parallel encoding api (ParallelEncode / ParallelEncodeAs, the ThreadPool class, and the thread_pool argument..).

for the automation, tools/gen_stubs.py imports the built package and discovers the public method list from the live classes. that way it sees the surface that gets assembled at import time (setattr, _add_snake_case, Tokenize = Encode), which a static parse of sentencepiece.i cannot. it only keeps a small curated table for the parameter types, and anything new but uncurated still gets a permissive (*args, **kwargs) stub (and is reported), so the stub is always complete.

i added a typing.yml workflow that runs gen_stubs.py --check to fail the build if the committed stub drifts from the runtime, plus a few scenario tests for the generator and pyright/mypy on a smoke test. as a side effect it already caught a bug in my earlier manual stub (load_from_rule_tsv vs the real runtime name load_from_rule_t_s_v). so the flow now is: change the api, rebuild, run gen_stubs.py --output ..., and ci enforces it stays in sync. i validated this workflow on my fork and it passes (build, --check, scenario tests, pyright, mypy) , link to the action : https://github.com/Deepak-png981/sentencepiece/actions/runs/27125948086/job/80054450339

image

@Deepak-png981

Copy link
Copy Markdown
Contributor Author

Could this PR resolve #1030? Thanks again.

yes, that is the goal of this pr, and i added "fixes #1030" to the description so it closes on merge. Thank you for taking the time to review the PR. Looking forward to next round of review.

@Deepak-png981 Deepak-png981 requested a review from taku910 June 8, 2026 08:56

@Deepak-png981 Deepak-png981 left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before:

Image

after :

Image

@taku910

taku910 commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

We are currently migrating from SWIG to pybind11, and most of the routine conversion work is being handled by an AI (antigravity agent). As a test, simply instructing it to "add type annotations" allowed it to generate something almost identical to the contents of the PR. (This is not included in the PR yet.)

#1266

As with the code in question, even if it is automated, having a large amount of code dedicated to automation means that the automation code itself will require maintenance.

Sorry, but could you please remove the changes to .github/workflow? The code hasn't been fully tested yet.

I will merge the current PR, but please allow us to consider our approach further, including the AI's auto-generation capabilities. Please note that the code may be modified or deleted without prior notice. Thank you for your understanding.

@taku910

taku910 commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

We haven't decided for sure yet, but right now it feels like the most realistic approach is to just keep the test code and let the AI handle the generation. The agent will run the tests and fix things on its own.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

No typings in Python package

2 participants