Skip to content

sbrunk/tokenizers-scala

Repository files navigation

tokenizers-scala

Maven Central

Scala bindings for the Hugging Face Tokenizers library, written in Rust.

Usage

import io.brunk.tokenizers.Tokenizer

val tokenizer = Tokenizer.fromPretrained("bert-base-cased")
val encoding = tokenizer.encode("Hello, y'all! How are you 😁 ?", addSpecialTokens=true)
println(encoding.length)
// 13
println(encoding.ids)
// ArraySeq(101, 8667, 117, 194, 112, 1155, 106, 1731, 1132, 1128, 100, 136, 102)
println(encoding.tokens)
// ArraySeq([CLS], Hello, ,, y, ', all, !, How, are, you, [UNK], ?, [SEP])

Installation

sbt

libraryDependencies += "io.brunk.tokenizers" %% "tokenizers" % "<version>"

Scala CLI

//> using lib "io.brunk.tokenizers::tokenizers:<version>"

Others

Copy coordinates from Maven Central for Scala 2.13 or Scala 3.

Status

Currently, we can only load and run pre-trained tokenizers. Training is not yet possible.

How to build the project

  1. Install bleep
  2. Install Rust and Cargo
  3. bleep compile
    bleep test

About

Scala bindings for Hugging Face Tokenizers

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published