Skip to content

Additional support for T5 Tokenizer - SentencepieceTokenizer #828

Open
@r4ghu

Description

Hi team,
I would like to request some support for adding additional features for T5Tokenizer / SentencepieceTokenizer. I was able to convert the HuggingFace T5 Tokenizer to Onnx format using the following code -

import numpy as np
from transformers import T5TokenizerFast
from onnxruntime_extensions import gen_processing_models, get_library_path
import onnxruntime as ort

# Initialize the tokenizer
tokenizer = T5TokenizerFast.from_pretrained("t5-small")
text = "Translate this English sentence to French: Hello, how are you?"
input_ids = tokenizer.encode(text, return_tensors="np")

# Create the ONNX graphs for the tokenizer
# ort_tokenzer - Model to perform tokenization from string input to tokenized output
# ort_decoder - Model to perform decoding from tokenized input to string output
ort_tokenizer, ort_decoder = gen_processing_models(tokenizer, pre_kwargs={'CAST_TOKEN_ID': True}, post_kwargs={})

# Save the ONNX graphs
with open("tokenizer.onnx", "wb") as f:
    f.write(ort_tokenizer.SerializeToString())

with open("tokenizer_decoder.onnx", "wb") as f:
    f.write(ort_decoder.SerializeToString())

# Run inference with the ONNX models
session_options = ort.SessionOptions()
session_options.register_custom_ops_library(get_library_path())
tokenizer_session = ort.InferenceSession("tokenizer.onnx", sess_options=session_options)
decoder_session = ort.InferenceSession("tokenizer_decoder.onnx", sess_options=session_options)

# Tokenize the input
actual_ids = tokenizer_session.run(None, {'inputs':[text]})[0]
np.testing.assert_array_equal(input_ids[0], actual_ids)
print("Actual IDs:", actual_ids)

# Decode the tokenized input
output = decoder_session.run(None, {'ids': actual_ids})[0]
print("Decoded sentence:", output)
assert output == text

So far, the tokenizer works great without issues when I pass normal sentences. But when I add the sentinel tokens into my input sentence, the tokenizer behavior differs from the HuggingFace tokenizer. Can you please add some additional feature to support sentinel tokens in SentencepieceTokenizer? If it's possible to get this functionality working with a workaround of existing logic, I would like to know as it can simplify some preprocessing logic in my tokenization logic to handle sentinel tokens.

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions