-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup SentencePiece tokenizer #7427
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR cleans up the SentencePiece tokenizer implementation by removing the JSON‐based tokenizer creation and its associated tests, and by adding a validation check to ensure that the special tokens exist in the vocabulary. Key changes include:
- Removal of the CreateUnigramTokenizerFromJson method and corresponding tests.
- Addition of a validation check in the SentencePieceUnigramModel to ensure the BOS, EOS, and UNK token IDs are within bounds.
- Removal of the constructors that take SentencePieceOptions and the related SentencePieceOptions file.
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs | Removed creation and tests for the outdated JSON tokenizer variant. |
src/Microsoft.ML.Tokenizers/Model/SentencePieceUnigramModel.cs | Added bounds-check for special token indexes before storing them in the vocabulary reverse array. |
src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs | Removed the constructor overload that created the tokenizer from options. |
src/Microsoft.ML.Tokenizers/Model/SentencePieceBaseModel.cs | Removed the constructor that used SentencePieceOptions. |
src/Microsoft.ML.Tokenizers/Model/SentencePieceBpeModel.cs | Removed the constructor accepting SentencePieceOptions. |
src/Microsoft.ML.Tokenizers/Model/SentencePieceOptions.cs | Removed the unused SentencePieceOptions file. |
Comments suppressed due to low confidence (3)
test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs:21
- [nitpick] Removal of the JSON tokenizer tests is intentional; please ensure that the remaining tests fully cover the expected behavior of the tokenizer with special tokens.
private static SentencePieceTokenizer _unigramTokenizerFromJson = CreateUnigramTokenizerFromJson();
src/Microsoft.ML.Tokenizers/Model/SentencePieceUnigramModel.cs:31
- The added check for BOS, EOS, and UNK tokens is critical; ensure that the condition correctly reflects the expected bounds for modelProto.Pieces to prevent runtime errors.
if (modelProto.TrainerSpec.BosId >= modelProto.Pieces.Count ||
src/Microsoft.ML.Tokenizers/Model/SentencePieceOptions.cs:1
- Since the SentencePieceOptions file has been removed, verify that all documentation and external references have been updated accordingly to prevent integration issues.
// Licensed to the .NET Foundation under one or more agreements.
CC @ericstj |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7427 +/- ##
==========================================
- Coverage 69.00% 68.96% -0.04%
==========================================
Files 1483 1482 -1
Lines 274563 274198 -365
Branches 28395 28347 -48
==========================================
- Hits 189455 189103 -352
- Misses 77672 77681 +9
+ Partials 7436 7414 -22
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Removing creating sentence piece tokenizer creation with options as it is not needed at least from now. Also, address some older comment #7409 (review) for checking the special tokens before storing them in the vocabulary reverse array.