Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup SentencePiece tokenizer #7427

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tarekgh
Copy link
Member

@tarekgh tarekgh commented Mar 19, 2025

Removing creating sentence piece tokenizer creation with options as it is not needed at least from now. Also, address some older comment #7409 (review) for checking the special tokens before storing them in the vocabulary reverse array.

@Copilot Copilot bot review requested due to automatic review settings March 19, 2025 22:25
@tarekgh tarekgh added this to the ML.NET 5.0 milestone Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR cleans up the SentencePiece tokenizer implementation by removing the JSON‐based tokenizer creation and its associated tests, and by adding a validation check to ensure that the special tokens exist in the vocabulary. Key changes include:

  • Removal of the CreateUnigramTokenizerFromJson method and corresponding tests.
  • Addition of a validation check in the SentencePieceUnigramModel to ensure the BOS, EOS, and UNK token IDs are within bounds.
  • Removal of the constructors that take SentencePieceOptions and the related SentencePieceOptions file.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs Removed creation and tests for the outdated JSON tokenizer variant.
src/Microsoft.ML.Tokenizers/Model/SentencePieceUnigramModel.cs Added bounds-check for special token indexes before storing them in the vocabulary reverse array.
src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs Removed the constructor overload that created the tokenizer from options.
src/Microsoft.ML.Tokenizers/Model/SentencePieceBaseModel.cs Removed the constructor that used SentencePieceOptions.
src/Microsoft.ML.Tokenizers/Model/SentencePieceBpeModel.cs Removed the constructor accepting SentencePieceOptions.
src/Microsoft.ML.Tokenizers/Model/SentencePieceOptions.cs Removed the unused SentencePieceOptions file.
Comments suppressed due to low confidence (3)

test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs:21

  • [nitpick] Removal of the JSON tokenizer tests is intentional; please ensure that the remaining tests fully cover the expected behavior of the tokenizer with special tokens.
private static SentencePieceTokenizer _unigramTokenizerFromJson = CreateUnigramTokenizerFromJson();

src/Microsoft.ML.Tokenizers/Model/SentencePieceUnigramModel.cs:31

  • The added check for BOS, EOS, and UNK tokens is critical; ensure that the condition correctly reflects the expected bounds for modelProto.Pieces to prevent runtime errors.
if (modelProto.TrainerSpec.BosId >= modelProto.Pieces.Count ||

src/Microsoft.ML.Tokenizers/Model/SentencePieceOptions.cs:1

  • Since the SentencePieceOptions file has been removed, verify that all documentation and external references have been updated accordingly to prevent integration issues.
// Licensed to the .NET Foundation under one or more agreements.
@tarekgh tarekgh requested a review from michaelgsharp March 19, 2025 22:33
@tarekgh
Copy link
Member Author

tarekgh commented Mar 19, 2025

CC @ericstj

Copy link

codecov bot commented Mar 19, 2025

Codecov Report

Attention: Patch coverage is 40.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 68.96%. Comparing base (d4f690c) to head (66f201e).

Files with missing lines Patch % Lines
...t.ML.Tokenizers/Model/SentencePieceUnigramModel.cs 40.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7427      +/-   ##
==========================================
- Coverage   69.00%   68.96%   -0.04%     
==========================================
  Files        1483     1482       -1     
  Lines      274563   274198     -365     
  Branches    28395    28347      -48     
==========================================
- Hits       189455   189103     -352     
- Misses      77672    77681       +9     
+ Partials     7436     7414      -22     
Flag Coverage Δ
Debug 68.96% <40.00%> (-0.04%) ⬇️
production 63.25% <40.00%> (-0.03%) ⬇️
test 89.46% <ø> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...soft.ML.Tokenizers/Model/SentencePieceBaseModel.cs 79.64% <ø> (+1.25%) ⬆️
...osoft.ML.Tokenizers/Model/SentencePieceBpeModel.cs 77.43% <ø> (+2.92%) ⬆️
...soft.ML.Tokenizers/Model/SentencePieceTokenizer.cs 90.00% <ø> (+4.73%) ⬆️
test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs 92.96% <ø> (-1.14%) ⬇️
...t.ML.Tokenizers/Model/SentencePieceUnigramModel.cs 60.04% <40.00%> (-5.85%) ⬇️

... and 8 files with indirect coverage changes

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants