Skip to content

chore(rust): bump tokenizers to 0.22.2#103

Merged
tazarov merged 1 commit into
mainfrom
codex/bump-tokenizers-0-22-2
Mar 4, 2026
Merged

chore(rust): bump tokenizers to 0.22.2#103
tazarov merged 1 commit into
mainfrom
codex/bump-tokenizers-0-22-2

Conversation

@tazarov

@tazarov tazarov commented Mar 4, 2026

Copy link
Copy Markdown
Contributor

Summary

  • bump Rust dependency tokenizers from 0.21.4 to 0.22.2
  • refresh Cargo.lock to capture resolved transitive updates
  • keep Rust FFI surface unchanged

Validation

  • cargo check
  • cargo test --lib
  • cargo test --all-targets
  • go test ./... -count=1

@claude

claude Bot commented Mar 4, 2026

Copy link
Copy Markdown

Review: chore(rust): bump tokenizers to 0.22.2

Overall this is a clean, well-scoped dependency bump. A few notes:

Positive

  • Diff is minimal: only Cargo.toml and Cargo.lock — no FFI surface changes, which is what matters for ABI stability.
  • Transitive updates (console, indicatif → unit-prefix, portable-atomic, windows-sys 0.61.2) are all lock-file level and don't affect the public API.

Concern: API compatibility between 0.21.4 → 0.22.2 not explicitly verified

The PR description lists cargo check / cargo test as validation steps, which confirms the code compiles and unit tests pass. However, tokenizers minor-version bumps sometimes include behavioral changes (e.g., tokenization output differences for edge-case inputs). It would be good to confirm:

  1. The HuggingFace tokenizers 0.22.x changelog/releases was reviewed for any behavioral changes that could affect encoding/decoding output.
  2. Integration tests (e.g., make test or go test ./...) pass against a real model — the PR mentions go test ./... -count=1 was run, which is good.

Minor pre-existing issue (not blocking this PR)
cbindgen = "0.29.0" in [dependencies] should be [build-dependencies] since it's a code-generation tool, not a runtime dependency. This inflates the binary unnecessarily. Worth a follow-up issue.

Overall: looks good to merge assuming CI passes and the changelog was reviewed.

@github-actions

github-actions Bot commented Mar 4, 2026

Copy link
Copy Markdown

Benchmark Comparison

goos: linux
goarch: amd64
pkg: github.com/amikos-tech/pure-tokenizers
cpu: AMD EPYC 7763 64-Core Processor                
                                  │ base_bench.txt │            pr_bench.txt            │
                                  │     sec/op     │    sec/op     vs base              │
Encode/Short-4                        9.442µ ± ∞ ¹   9.188µ ± ∞ ¹  -2.69% (p=0.008 n=5)
Encode/Medium-4                       42.70µ ± ∞ ¹   43.10µ ± ∞ ¹       ~ (p=0.690 n=5)
Encode/Long-4                         332.3µ ± ∞ ¹   327.5µ ± ∞ ¹       ~ (p=0.222 n=5)
EncodeWithOptions/Default-4           41.77µ ± ∞ ¹   42.33µ ± ∞ ¹       ~ (p=0.151 n=5)
EncodeWithOptions/WithTypeIDs-4       42.86µ ± ∞ ¹   42.74µ ± ∞ ¹       ~ (p=0.690 n=5)
EncodeWithOptions/WithTokens-4        42.26µ ± ∞ ¹   42.95µ ± ∞ ¹  +1.64% (p=0.008 n=5)
EncodeWithOptions/WithOffsets-4       42.73µ ± ∞ ¹   43.12µ ± ∞ ¹       ~ (p=0.310 n=5)
EncodeWithOptions/AllOptions-4        44.38µ ± ∞ ¹   46.01µ ± ∞ ¹  +3.67% (p=0.008 n=5)
Decode/WithSpecialTokens-4            19.23µ ± ∞ ¹   18.06µ ± ∞ ¹       ~ (p=0.151 n=5)
Decode/SkipSpecialTokens-4            19.25µ ± ∞ ¹   17.87µ ± ∞ ¹  -7.17% (p=0.016 n=5)
BatchEncode-4                         433.7µ ± ∞ ¹   436.8µ ± ∞ ¹       ~ (p=0.690 n=5)
FromHuggingFace/CreationOnly-4        35.51m ± ∞ ¹   35.56m ± ∞ ¹       ~ (p=0.841 n=5)
FromHuggingFace/FullLifecycle-4       36.05m ± ∞ ¹   36.02m ± ∞ ¹       ~ (p=1.000 n=5)
VocabSize-4                           3.122m ± ∞ ¹   3.197m ± ∞ ¹       ~ (p=0.151 n=5)
EncodeDecode/Short-4                  14.14µ ± ∞ ¹   14.07µ ± ∞ ¹       ~ (p=0.310 n=5)
EncodeDecode/Medium-4                 65.76µ ± ∞ ¹   63.66µ ± ∞ ¹       ~ (p=0.310 n=5)
EncodeDecode/Long-4                   475.0µ ± ∞ ¹   467.0µ ± ∞ ¹       ~ (p=0.548 n=5)
Truncation-4                          328.4µ ± ∞ ¹   337.6µ ± ∞ ¹       ~ (p=0.310 n=5)
Padding-4                             120.6µ ± ∞ ¹   118.5µ ± ∞ ¹       ~ (p=0.151 n=5)
ConcurrentCacheRead-4                 5.762µ ± ∞ ¹   5.817µ ± ∞ ¹       ~ (p=0.286 n=5)
ConcurrentCacheValidation-4           6.607µ ± ∞ ¹   6.640µ ± ∞ ¹       ~ (p=0.056 n=5)
ConcurrentHFCacheLookup-4             10.75µ ± ∞ ¹   10.75µ ± ∞ ¹       ~ (p=0.841 n=5)
DownloadWithFailureRecovery-4          1.070 ± ∞ ¹    1.105 ± ∞ ¹       ~ (p=0.841 n=5)
ConcurrentDownloadsWithFailures-4     45.01m ± ∞ ¹   45.34m ± ∞ ¹       ~ (p=0.690 n=5)
LoadFromCache-4                       13.83µ ± ∞ ¹   13.59µ ± ∞ ¹  -1.72% (p=0.008 n=5)
FromHuggingFaceWithoutCache-4         160.4µ ± ∞ ¹   165.4µ ± ∞ ¹  +3.17% (p=0.008 n=5)
geomean                               169.7µ         169.5µ        -0.16%
¹ need >= 6 samples for confidence interval at level 0.95

                                  │ base_bench.txt │             pr_bench.txt              │
                                  │      B/op      │     B/op       vs base                │
Encode/Short-4                         920.0 ± ∞ ¹     920.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
Encode/Medium-4                      1.516Ki ± ∞ ¹   1.516Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
Encode/Long-4                        6.703Ki ± ∞ ¹   6.703Ki ± ∞ ¹       ~ (p=1.000 n=5)
EncodeWithOptions/Default-4          1.516Ki ± ∞ ¹   1.516Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/WithTypeIDs-4      1.609Ki ± ∞ ¹   1.609Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/WithTokens-4       1.516Ki ± ∞ ¹   1.516Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/WithOffsets-4      1.703Ki ± ∞ ¹   1.703Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/AllOptions-4       2.109Ki ± ∞ ¹   2.109Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
Decode/WithSpecialTokens-4             740.0 ± ∞ ¹     740.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
Decode/SkipSpecialTokens-4             740.0 ± ∞ ¹     740.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
BatchEncode-4                        11.30Ki ± ∞ ¹   11.30Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
FromHuggingFace/CreationOnly-4       6.128Mi ± ∞ ¹   6.151Mi ± ∞ ¹       ~ (p=0.056 n=5)
FromHuggingFace/FullLifecycle-4      6.142Mi ± ∞ ¹   6.138Mi ± ∞ ¹       ~ (p=0.548 n=5)
VocabSize-4                            288.0 ± ∞ ¹     288.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeDecode/Short-4                 1.516Ki ± ∞ ¹   1.516Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeDecode/Medium-4                2.242Ki ± ∞ ¹   2.242Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeDecode/Long-4                  8.430Ki ± ∞ ¹   8.430Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
Truncation-4                         5.500Ki ± ∞ ¹   5.500Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
Padding-4                            15.89Ki ± ∞ ¹   15.89Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
ConcurrentCacheRead-4                2.062Ki ± ∞ ¹   2.062Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
ConcurrentCacheValidation-4          3.023Ki ± ∞ ¹   3.023Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
ConcurrentHFCacheLookup-4            3.195Ki ± ∞ ¹   3.195Ki ± ∞ ¹       ~ (p=1.000 n=5)
DownloadWithFailureRecovery-4        60.02Ki ± ∞ ¹   62.77Ki ± ∞ ¹       ~ (p=0.548 n=5)
ConcurrentDownloadsWithFailures-4    18.98Ki ± ∞ ¹   18.98Ki ± ∞ ¹       ~ (p=0.397 n=5)
LoadFromCache-4                      1.695Ki ± ∞ ¹   1.695Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
FromHuggingFaceWithoutCache-4        16.36Ki ± ∞ ¹   16.36Ki ± ∞ ¹       ~ (p=0.984 n=5)
geomean                              5.426Ki         5.436Ki        +0.18%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

                                  │ base_bench.txt │             pr_bench.txt             │
                                  │   allocs/op    │  allocs/op    vs base                │
Encode/Short-4                         16.00 ± ∞ ¹    16.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
Encode/Medium-4                        35.00 ± ∞ ¹    35.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
Encode/Long-4                          165.0 ± ∞ ¹    165.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/Default-4            35.00 ± ∞ ¹    35.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/WithTypeIDs-4        36.00 ± ∞ ¹    36.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/WithTokens-4         35.00 ± ∞ ¹    35.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/WithOffsets-4        36.00 ± ∞ ¹    36.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/AllOptions-4         41.00 ± ∞ ¹    41.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
Decode/WithSpecialTokens-4             10.00 ± ∞ ¹    10.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
Decode/SkipSpecialTokens-4             10.00 ± ∞ ¹    10.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
BatchEncode-4                          261.0 ± ∞ ¹    261.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
FromHuggingFace/CreationOnly-4        92.20k ± ∞ ¹   92.20k ± ∞ ¹       ~ (p=0.333 n=5)
FromHuggingFace/FullLifecycle-4       92.21k ± ∞ ¹   92.21k ± ∞ ¹       ~ (p=0.921 n=5)
VocabSize-4                            5.000 ± ∞ ¹    5.000 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeDecode/Short-4                   26.00 ± ∞ ¹    26.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeDecode/Medium-4                  45.00 ± ∞ ¹    45.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeDecode/Long-4                    175.0 ± ∞ ¹    175.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
Truncation-4                           127.0 ± ∞ ¹    127.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
Padding-4                              535.0 ± ∞ ¹    535.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
ConcurrentCacheRead-4                  25.00 ± ∞ ¹    25.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
ConcurrentCacheValidation-4            43.00 ± ∞ ¹    43.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
ConcurrentHFCacheLookup-4              39.00 ± ∞ ¹    39.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
DownloadWithFailureRecovery-4          456.0 ± ∞ ¹    469.0 ± ∞ ¹       ~ (p=0.421 n=5)
ConcurrentDownloadsWithFailures-4      233.0 ± ∞ ¹    233.0 ± ∞ ¹       ~ (p=0.444 n=5)
LoadFromCache-4                        7.000 ± ∞ ¹    7.000 ± ∞ ¹       ~ (p=1.000 n=5) ²
FromHuggingFaceWithoutCache-4          219.0 ± ∞ ¹    219.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
geomean                                89.93          90.03        +0.11%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

@tazarov tazarov merged commit 02a3420 into main Mar 4, 2026
17 checks passed
@tazarov tazarov deleted the codex/bump-tokenizers-0-22-2 branch March 4, 2026 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant