A Swift implementation of HuggingFace tokenizers using a RUST -> C -> Swift bridge.
This is an experimental implementation. Use the battle tested version from Swift-Transformers.
In contrast to the Tokenizer in swift-transformers, this implementation uses
the original Rust Tokenizers as
its core. We then use cbindgen to generate C headers from the Rust code, which
can then be imported into Swift.
RUST (Core Tokenizer) -> C (Bridge) -> Swift (API)
Only works on ARM Macs today, common sense required to build for other platforms (dylib -> so etc etc).
- Create a parent directory and cd into it.
- Clone tokenizers-sys.
- Clone swift-tokenizers.
cd tokenizers-sys- Run
./compile-ex.sh. - Check that
./target/release/libtokenizers_sys.dylibexists. cd ..cp ./tokenizers-sys/target/release/libtokenizers_sys.dylib ./swift-tokenizers/dependencies/libtokenizers_sys.dylibcd swift-tokenizersswift buildswift test- 😎
func NLLBTokenizer() async throws {
let tokenizer = try Tokenizer.fromPretrained(name: "facebook/nllb-200-distilled-600M")
let encoding = try tokenizer.encode("how much wood could a woodchuck chuck?")
print(encoding.ids)
let decoded = try tokenizer.decode(encoding.ids)
print(decoded)
}Right now this just links a dylib compiled from tokenizers-sys, so resolving packaging for all platforms is another step to take.
- Pass 100% of
swift-transformersTokenizer tests - C API won't expose async, so we may want to use
Hubpackage and avoid usingfromPretrainedfrom the Rust package. - Cross platform packaging
- Drop in replacement for
swift-transformerstokenizer - Implement Chat Templates