Cleanup SentencePiece tokenizer #7427

tarekgh · 2025-03-19T22:25:15Z

Removing creating sentence piece tokenizer creation with options as it is not needed at least from now. Also, address some older comment #7409 (review) for checking the special tokens before storing them in the vocabulary reverse array.

Copilot

Pull Request Overview

This PR cleans up the SentencePiece tokenizer implementation by removing the JSON‐based tokenizer creation and its associated tests, and by adding a validation check to ensure that the special tokens exist in the vocabulary. Key changes include:

Removal of the CreateUnigramTokenizerFromJson method and corresponding tests.
Addition of a validation check in the SentencePieceUnigramModel to ensure the BOS, EOS, and UNK token IDs are within bounds.
Removal of the constructors that take SentencePieceOptions and the related SentencePieceOptions file.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs	Removed creation and tests for the outdated JSON tokenizer variant.
src/Microsoft.ML.Tokenizers/Model/SentencePieceUnigramModel.cs	Added bounds-check for special token indexes before storing them in the vocabulary reverse array.
src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs	Removed the constructor overload that created the tokenizer from options.
src/Microsoft.ML.Tokenizers/Model/SentencePieceBaseModel.cs	Removed the constructor that used SentencePieceOptions.
src/Microsoft.ML.Tokenizers/Model/SentencePieceBpeModel.cs	Removed the constructor accepting SentencePieceOptions.
src/Microsoft.ML.Tokenizers/Model/SentencePieceOptions.cs	Removed the unused SentencePieceOptions file.

Comments suppressed due to low confidence (3)

test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs:21

[nitpick] Removal of the JSON tokenizer tests is intentional; please ensure that the remaining tests fully cover the expected behavior of the tokenizer with special tokens.

private static SentencePieceTokenizer _unigramTokenizerFromJson = CreateUnigramTokenizerFromJson();

src/Microsoft.ML.Tokenizers/Model/SentencePieceUnigramModel.cs:31

The added check for BOS, EOS, and UNK tokens is critical; ensure that the condition correctly reflects the expected bounds for modelProto.Pieces to prevent runtime errors.

if (modelProto.TrainerSpec.BosId >= modelProto.Pieces.Count ||

src/Microsoft.ML.Tokenizers/Model/SentencePieceOptions.cs:1

Since the SentencePieceOptions file has been removed, verify that all documentation and external references have been updated accordingly to prevent integration issues.

// Licensed to the .NET Foundation under one or more agreements.

tarekgh · 2025-03-19T22:33:28Z

CC @ericstj

codecov · 2025-03-19T23:53:22Z

Codecov Report

Attention: Patch coverage is 40.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 68.96%. Comparing base (d4f690c) to head (66f201e).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
...t.ML.Tokenizers/Model/SentencePieceUnigramModel.cs	40.00%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7427      +/-   ##
==========================================
- Coverage   69.00%   68.96%   -0.04%     
==========================================
  Files        1483     1482       -1     
  Lines      274563   274198     -365     
  Branches    28395    28347      -48     
==========================================
- Hits       189455   189103     -352     
- Misses      77672    77681       +9     
+ Partials     7436     7414      -22

Flag	Coverage Δ
Debug	`68.96% <40.00%> (-0.04%)`	⬇️
production	`63.25% <40.00%> (-0.03%)`	⬇️
test	`89.46% <ø> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...soft.ML.Tokenizers/Model/SentencePieceBaseModel.cs	`79.64% <ø> (+1.25%)`	⬆️
...osoft.ML.Tokenizers/Model/SentencePieceBpeModel.cs	`77.43% <ø> (+2.92%)`	⬆️
...soft.ML.Tokenizers/Model/SentencePieceTokenizer.cs	`90.00% <ø> (+4.73%)`	⬆️
test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs	`92.96% <ø> (-1.14%)`	⬇️
...t.ML.Tokenizers/Model/SentencePieceUnigramModel.cs	`60.04% <40.00%> (-5.85%)`	⬇️

... and 8 files with indirect coverage changes

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ericstj

LGTM

Cleanup SentencePiece tokenizer

66f201e

Copilot AI review requested due to automatic review settings March 19, 2025 22:25

dotnet-policy-service bot assigned tarekgh Mar 19, 2025

tarekgh added the Tokenizers label Mar 19, 2025

tarekgh added this to the ML.NET 5.0 milestone Mar 19, 2025

Copilot AI reviewed Mar 19, 2025

View reviewed changes

tarekgh requested a review from michaelgsharp March 19, 2025 22:33

michaelgsharp approved these changes Mar 19, 2025

View reviewed changes

tarekgh merged commit a30eec9 into dotnet:main Mar 20, 2025
25 checks passed

ericstj reviewed Mar 21, 2025

View reviewed changes

github-actions bot locked and limited conversation to collaborators Apr 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cleanup SentencePiece tokenizer #7427

Cleanup SentencePiece tokenizer #7427

Uh oh!

tarekgh commented Mar 19, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

tarekgh commented Mar 19, 2025

Uh oh!

codecov bot commented Mar 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

ericstj left a comment

Uh oh!

Uh oh!

Cleanup SentencePiece tokenizer #7427

Cleanup SentencePiece tokenizer #7427

Uh oh!

Conversation

tarekgh commented Mar 19, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

tarekgh commented Mar 19, 2025

Uh oh!

codecov bot commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

ericstj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Mar 19, 2025 •

edited

Loading