Additional support for T5 Tokenizer - SentencepieceTokenizer #828
Open
Description
Hi team,
I would like to request some support for adding additional features for T5Tokenizer / SentencepieceTokenizer. I was able to convert the HuggingFace T5 Tokenizer to Onnx format using the following code -
import numpy as np
from transformers import T5TokenizerFast
from onnxruntime_extensions import gen_processing_models, get_library_path
import onnxruntime as ort
# Initialize the tokenizer
tokenizer = T5TokenizerFast.from_pretrained("t5-small")
text = "Translate this English sentence to French: Hello, how are you?"
input_ids = tokenizer.encode(text, return_tensors="np")
# Create the ONNX graphs for the tokenizer
# ort_tokenzer - Model to perform tokenization from string input to tokenized output
# ort_decoder - Model to perform decoding from tokenized input to string output
ort_tokenizer, ort_decoder = gen_processing_models(tokenizer, pre_kwargs={'CAST_TOKEN_ID': True}, post_kwargs={})
# Save the ONNX graphs
with open("tokenizer.onnx", "wb") as f:
f.write(ort_tokenizer.SerializeToString())
with open("tokenizer_decoder.onnx", "wb") as f:
f.write(ort_decoder.SerializeToString())
# Run inference with the ONNX models
session_options = ort.SessionOptions()
session_options.register_custom_ops_library(get_library_path())
tokenizer_session = ort.InferenceSession("tokenizer.onnx", sess_options=session_options)
decoder_session = ort.InferenceSession("tokenizer_decoder.onnx", sess_options=session_options)
# Tokenize the input
actual_ids = tokenizer_session.run(None, {'inputs':[text]})[0]
np.testing.assert_array_equal(input_ids[0], actual_ids)
print("Actual IDs:", actual_ids)
# Decode the tokenized input
output = decoder_session.run(None, {'ids': actual_ids})[0]
print("Decoded sentence:", output)
assert output == text
So far, the tokenizer works great without issues when I pass normal sentences. But when I add the sentinel tokens into my input sentence, the tokenizer behavior differs from the HuggingFace tokenizer. Can you please add some additional feature to support sentinel tokens in SentencepieceTokenizer? If it's possible to get this functionality working with a workaround of existing logic, I would like to know as it can simplify some preprocessing logic in my tokenization logic to handle sentinel tokens.
Metadata
Assignees
Labels
No labels