Skip to content

Update tokenizers crate to 0.22.2#27

Merged
dfalbel merged 8 commits into
mainfrom
gelatinous-fisherman
Apr 16, 2026
Merged

Update tokenizers crate to 0.22.2#27
dfalbel merged 8 commits into
mainfrom
gelatinous-fisherman

Conversation

@dfalbel

@dfalbel dfalbel commented Apr 15, 2026

Copy link
Copy Markdown
Member

Summary

  • Bumps the Hugging Face tokenizers Rust dependency from 0.20.3 to 0.22.2
  • Updates 62 transitive dependencies (notably thiserror 1.x→2.x, rand 0.8→0.9, indicatif 0.17→0.18)
  • Re-vendored dependencies — tarball shrank from ~10MB to ~7.7MB due to removed transitive deps (lazy_static, old windows-targets sub-crates, etc.)
  • No Rust source code changes were needed — existing code compiles cleanly against 0.22.2
  • extendr-api stays at 0.8.1 (already latest)

Test plan

  • CI passes on all platforms (Linux, macOS, Windows)
  • Basic tokenizer encode/decode works
  • Training workflows still work (BPE, WordPiece, Unigram)

dfalbel added 8 commits April 15, 2026 15:36
Bumps the Hugging Face tokenizers Rust dependency to the latest version.
No source code changes were needed. Re-vendored dependencies (tarball
shrank from ~10MB to ~7.7MB due to removed transitive deps).
Use COPYFILE_DISABLE=1 and --no-xattrs when creating the vendor tarball
to prevent macOS xattr metadata (com.apple.provenance) from being
embedded, which causes warnings when extracted with GNU tar on Linux CI.
These WASI-only crates are never compiled for our supported targets
(Linux, macOS, Windows) but Cargo still parses their manifests when
using vendored sources. wit-bindgen v0.51.0 uses edition 2024 which
requires Cargo >= 1.85, breaking offline builds with older toolchains.
wit-bindgen v0.51.0 uses edition 2024, which Cargo < 1.85 cannot parse.
Since it is never compiled for our targets (only needed for WASI),
patch its Cargo.toml to edition 2021 after vendoring so older Cargo
can read the manifest without error.
Remove the .Rbuildignore exclusion for tests/testthat/_snaps so that
snapshot tests work correctly during R CMD check.
unicode-segmentation 1.13.2 requires rustc 1.85+. Pin to 1.12.0 to
maintain compatibility with the MSRV of 1.81 specified in Cargo.toml.
CRAN's Debian testing has rustc 1.92, so 1.91 is safe. This removes the
need to pin unicode-segmentation (1.13.2 requires 1.85+) and avoids
other MSRV-related compilation issues with tokenizers 0.22.2.
Match the MSRV update in Cargo.toml and DESCRIPTION.
@dfalbel dfalbel merged commit 9d1c297 into main Apr 16, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant