Add PEP 561 type stubs for the Python package by Deepak-png981 · Pull Request #1258 · google/sentencepiece

Deepak-png981 · 2026-06-01T19:53:19Z

Summary

The Python package is built with SWIG and exposes much of its user-facing API at import time (for example model_file on the constructor and decode as a snake_case alias). Static analyzers cannot infer that surface from the generated runtime code alone, which leads to false positives in Pylance, pyright, and mypy (#1030).

This change adds official typing support:

py.typed marker (PEP 561)
__init__.pyi describing the documented public API (now covering the new ParallelEncode/ParallelEncodeAs* API, the ThreadPool class and the thread_pool argument)
setuptools package-data updates so stubs ship inside wheels
tools/gen_stubs.py, which generates the stub by introspecting the imported module
a small typing_smoke.py script for local/CI type checking
a typing.yml GitHub Actions workflow that keeps the stub in sync and type-checks the public API

Approach

Stubs describe the Python API users call (as defined in sentencepiece.i and the Python README), not every internal SWIG symbol. Runtime behavior is unchanged.

To address the maintenance/automation concern, tools/gen_stubs.py imports the built package and discovers the public method list from the live classes (so it cannot silently drift when a method or snake_case alias is added), then applies a small curated overlay for parameter types. Anything new but uncurated still gets a permissive (*args, **kwargs) -> Any stub (and is reported), so the stub is always complete. CI runs gen_stubs.py --check to fail the build if the committed stub is stale, plus scenario tests for the generator and pyright/mypy on the smoke test.

Notes for reviewers

The stub is generated, not hand-edited. After changing the public Python API, rebuild the package and run:

python tools/gen_stubs.py --output src/sentencepiece/__init__.pyi

The CI --check step enforces that the committed stub matches a fresh generation.

Add py.typed and extend setuptools package-data so stub files are distributed with the sentencepiece package. This is the packaging foundation for static type checking support (fixes google#1030).

Introduce __init__.pyi covering the most common public entry points: model loading via model_file, Encode/Decode, and vocabulary helpers. This addresses the primary Pylance/mypy complaints in google#1030.

Complete the public Python API surface documented in sentencepiece.i, including runtime-added methods like encode/decode and training helpers.

Provide a small script that exercises model_file, encode, and decode so type checkers can validate the stub surface in CI or local development.

Deepak-png981

Before :

After :

Deepak-png981

i initially hoped we could avoid hand maintained stubs, either because swig would surface types automatically or because we could generate .pyi files when regenerating the wrapper. after reading the code path, that does not really work for this project. swig generates the c++ binding and a baseline init.py, but the api most users call (model_file on the constructor, encode/decode etc....) is added in sentencepiece.i python layer and at import time via setattr. type checkers do not execute that logic, so they cannot infer the same surface you get at runtime.

because of that, annotating the generated init.py is not a good fit either. it gets regenerated on swig updates, and it still would not describe the full public contract cleanly. changing the swig pipeline to emit complete stubs would be a larger custom build step, not something this repo does today.

the approach in this pr is the usual one for native extension packages (pep 561: py.typed plus init.pyi). many open source libraries ship types this way when the runtime is c/c++ or swig/pybind style bindings. we document the public python api users actually import, ship it in the wheel, and leave runtime behavior unchanged. we are only making that contract visible to pylance, mypy, and pyright.

the tradeoff is maintenance. init.pyi needs to stay in sync when the public interface changes in sentencepiece.i (new kwargs, renamed methods, new snake_case aliases, etc.). it does not need to track every internal c++ or wrap detail i would say but, only the documented python surface. if the api changes in a pr, the stubs should be updated in the same pr, ideally with typing_smoke.py or pyright in ci to catch drift.

Deepak-png981 · 2026-06-01T20:10:27Z

Please let me know if you would like any additional tests for this change. I added this file for a basic pyright check, and I am still getting familiar with the repo.

Deepak-png981 · 2026-06-01T20:11:20Z

py.typed is intentionally empty. Under PEP 561, this file is only a marker that tells type checkers this package ships type information. The actual types live in init.pyi.

Is it possible to add comments to this file, or must it be left entirely blank?

yes, it can contain comments. under pep 561, type checkers only look for the file's presence in the package; the contents are ignored. i've added a short comment explaining what it is and pointing to init.pyi.

Deepak-png981

Hello @taku910 , I'm not sure whether any action is required from my side to complete the pending GitHub Action. Please let me know if there's anything I need to do.

taku910 · 2026-06-08T02:41:36Z

Thank you for this change.

We've added a new Python API for parallel encoding. Could you update this PR?
We will not update the python API for the next release (v0.2.2)

One question: did you create this file manually or was it made by AI? We would like to make the process fully automated, so it would be nice to update the definitions automatically.

taku910 · 2026-06-08T02:43:00Z

Could this PR resolve #1030? Thanks again.

…tubs

The public Python API is assembled at import time in sentencepiece.i (setattr, _add_snake_case, Tokenize = Encode), so a static parse cannot see it. tools/gen_stubs.py imports the built package and discovers the method list from the live classes, then applies a small curated type overlay. Unknown members get a permissive fallback and are reported, so the stub stays complete even when the API grows.

Run tools/gen_stubs.py against the v0.2.2 module so the stubs cover ParallelEncode/ParallelEncodeAs*, the new ThreadPool class and the thread_pool argument, plus Tokenize/Detokenize and the init alias. Regeneration also corrects load_from_rule_t_s_v (the hand-written stub had load_from_rule_tsv) and uses list[...] batch overloads to match the runtime type(arg) is list dispatch.

Exercises the generator under controlled mutations of the imported module: a new method is auto-discovered and flagged, a removed method drops out, new properties fall back to Any, leaked imports are ignored, and output is deterministic. Each case asserts that drift would be caught by the --check gate.

Wrap the checks in never-called functions so importing the module has no runtime side effects, and add coverage for parallel_encode, ThreadPool and the snake_case aliases.

Add a short comment explaining that the file is a PEP 561 marker whose contents are ignored; the actual types live in __init__.pyi.

Builds the package, runs gen_stubs.py --check to fail on stub drift, runs the generator scenario tests, then type-checks the public API smoke test with pyright and mypy.

Deepak-png981 · 2026-06-08T08:54:52Z

Thank you for this change.

We've added a new Python API for parallel encoding. Could you update this PR? We will not update the python API for the next release (v0.2.2)

One question: did you create this file manually or was it made by AI? We would like to make the process fully automated, so it would be nice to update the definitions automatically.

initially i wrote the stub by hand, mostly to understand the codebase and how the public api is put together. but you are right that this should be automated, so now we generate it instead. I also rebased on the latest master so the stubs cover the new parallel encoding api (ParallelEncode / ParallelEncodeAs, the ThreadPool class, and the thread_pool argument..).

for the automation, tools/gen_stubs.py imports the built package and discovers the public method list from the live classes. that way it sees the surface that gets assembled at import time (setattr, _add_snake_case, Tokenize = Encode), which a static parse of sentencepiece.i cannot. it only keeps a small curated table for the parameter types, and anything new but uncurated still gets a permissive (*args, **kwargs) stub (and is reported), so the stub is always complete.

i added a typing.yml workflow that runs gen_stubs.py --check to fail the build if the committed stub drifts from the runtime, plus a few scenario tests for the generator and pyright/mypy on a smoke test. as a side effect it already caught a bug in my earlier manual stub (load_from_rule_tsv vs the real runtime name load_from_rule_t_s_v). so the flow now is: change the api, rebuild, run gen_stubs.py --output ..., and ci enforces it stays in sync. i validated this workflow on my fork and it passes (build, --check, scenario tests, pyright, mypy) , link to the action : https://github.com/Deepak-png981/sentencepiece/actions/runs/27125948086/job/80054450339

Deepak-png981 · 2026-06-08T08:56:20Z

Could this PR resolve #1030? Thanks again.

yes, that is the goal of this pr, and i added "fixes #1030" to the description so it closes on merge. Thank you for taking the time to review the PR. Looking forward to next round of review.

Deepak-png981

before:

after :

taku910 · 2026-06-11T01:09:17Z

We are currently migrating from SWIG to pybind11, and most of the routine conversion work is being handled by an AI (antigravity agent). As a test, simply instructing it to "add type annotations" allowed it to generate something almost identical to the contents of the PR. (This is not included in the PR yet.)

#1266

As with the code in question, even if it is automated, having a large amount of code dedicated to automation means that the automation code itself will require maintenance.

Sorry, but could you please remove the changes to .github/workflow? The code hasn't been fully tested yet.

I will merge the current PR, but please allow us to consider our approach further, including the AI's auto-generation capabilities. Please note that the code may be modified or deleted without prior notice. Thank you for your understanding.

taku910 · 2026-06-11T01:13:14Z

We haven't decided for sure yet, but right now it feels like the most realistic approach is to just keep the test code and let the AI handle the generation. The agent will run the tests and fix things on its own.

Deepak-png981 added 4 commits June 2, 2026 01:22

Ship PEP 561 marker and include typing artifacts in wheels.

c9eb77a

Add py.typed and extend setuptools package-data so stub files are distributed with the sentencepiece package. This is the packaging foundation for static type checking support (fixes google#1030).

Add initial type stubs for SentencePieceProcessor.

8b7f97b

Introduce __init__.pyi covering the most common public entry points: model loading via model_file, Encode/Decode, and vocabulary helpers. This addresses the primary Pylance/mypy complaints in google#1030.

Expand stubs for trainer, normalizer, and snake_case aliases.

2a29bc2

Complete the public Python API surface documented in sentencepiece.i, including runtime-added methods like encode/decode and training helpers.

Add pyright smoke test for typed public API.

5eedfdc

Provide a small script that exercises model_file, encode, and decode so type checkers can validate the stub surface in CI or local development.

Deepak-png981 commented Jun 1, 2026

View reviewed changes

Deepak-png981 commented Jun 4, 2026

View reviewed changes

Deepak-png981 added 7 commits June 8, 2026 13:30

Merge remote-tracking branch 'upstream/master' into add-python-type-s…

e8205e9

…tubs

Exercise parallel encoding in the typing smoke test.

d7b8080

Wrap the checks in never-called functions so importing the module has no runtime side effects, and add coverage for parallel_encode, ThreadPool and the snake_case aliases.

Document the py.typed marker.

fa9c7ec

Add a short comment explaining that the file is a PEP 561 marker whose contents are ignored; the actual types live in __init__.pyi.

Add CI workflow to keep stubs in sync and type-check.

61ee556

Builds the package, runs gen_stubs.py --check to fail on stub drift, runs the generator scenario tests, then type-checks the public API smoke test with pyright and mypy.

Deepak-png981 requested a review from taku910 June 8, 2026 08:56

Deepak-png981 commented Jun 8, 2026

View reviewed changes

Conversation

Deepak-png981 commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Approach

Notes for reviewers

Uh oh!

Deepak-png981 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Deepak-png981 left a comment

Choose a reason for hiding this comment

Uh oh!

Deepak-png981 Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Deepak-png981 Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

taku910 Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Deepak-png981 Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Deepak-png981 left a comment

Choose a reason for hiding this comment

Uh oh!

taku910 commented Jun 8, 2026

Uh oh!

taku910 commented Jun 8, 2026

Uh oh!

Deepak-png981 commented Jun 8, 2026

Uh oh!

Deepak-png981 commented Jun 8, 2026

Uh oh!

Deepak-png981 left a comment

Choose a reason for hiding this comment

Uh oh!

taku910 commented Jun 11, 2026

Uh oh!

taku910 commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Deepak-png981 commented Jun 1, 2026 •

edited

Loading

Deepak-png981 left a comment •

edited

Loading