Clean up protocol requirements: remove redundant overloads and defaults#10
Merged
DePasqualeOrg merged 1 commit intomainfrom Mar 4, 2026
Merged
Conversation
Move convenience methods (callAsFunction, bulk wrappers, short-parameter versions) from protocol requirements to extensions across Tokenizer, TokenizingModel, Normalizer, Decoder, PostProcessor, and PreTokenizer. Derive bosTokenId/eosTokenId/unknownTokenId from token strings on the Tokenizer protocol. Remove redundant default parameter values from all concrete implementations.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Multiple protocols had
callAsFunctionand bulk-wrapper methods as protocol requirements when they were pure forwarders that no conformer overrides. This PR moves them to extensions and removes redundant default parameter values from concrete implementations.Design principle
Each operation should have exactly one protocol requirement (the full-parameter version). Convenience entry points – short-parameter versions,
callAsFunction, and bulk wrappers – belong in protocol extensions.This aligns with the Python
tokenizerslibrary, where the internal components (Normalizer,PreTokenizer,Decoder,PostProcessor) each define a single core method.__call__is not part of the protocol interface – it's a convenience alias, just ascallAsFunctionis a convenience extension in Swift.For the main
Tokenizerprotocol, the design also aligns with Pythontransformers:convertTokenToId/convertIdToTokenare protocol requirements, matching Python's abstract methods.convertTokensToIds/convertIdsToTokensare derived from the single-item versions. Python does this via type checking at runtime; Swift does it via protocol extensions at compile time.encodeanddecodeare the core operations, with the full-parameter version as the requirement and short-parameter versions as extension conveniences.No conformer in the codebase overrides any of the methods moved to extensions.
Changes
TokenizerprotocolRequirements removed (moved to extensions):
encode(text:) -> [Int]– callsencode(text:addSpecialTokens: true)callAsFunction(_:addSpecialTokens:) -> [Int]– callsencode(text:addSpecialTokens:)decode(tokens:) -> String– callsdecode(tokens:skipSpecialTokens: false)convertTokensToIds(_:) -> [Int?]– maps overconvertTokenToId(_:)convertIdsToTokens(_:) -> [String?]– maps overconvertIdToToken(_:)bosTokenId/eosTokenId/unknownTokenId– now derived from the token string viaconvertTokenToId, matching how Pythontransformersderives them on demand rather than storing themTokenizingModelprotocolRequirements removed (moved to extensions):
callAsFunction(_:) -> [String]– callstokenize(text:)convertTokensToIds(_:) -> [Int?]– maps overconvertTokenToId(_:)convertIdsToTokens(_:) -> [String?]– maps overconvertIdToToken(_:)NormalizerprotocolRequirements removed (moved to extensions):
callAsFunction(text:) -> String– callsnormalize(text:)DecoderprotocolRequirements removed (moved to extensions):
callAsFunction(tokens:) -> [String]– callsdecode(tokens:)PostProcessorprotocolRequirements removed (moved to extensions):
callAsFunction(tokens:tokensPair:addSpecialTokens:) -> [String]– callspostProcess(...)PreTokenizerprotocolRequirements removed (moved to extensions):
preTokenize(texts:options:) -> [String]– flatMaps overpreTokenize(text:options:)callAsFunction(texts:options:) -> [String]– callspreTokenize(texts:options:)callAsFunction(text:options:) -> [String]– callspreTokenize(text:options:)Redundant defaults removed from concrete implementations
Defaults now live in protocol extensions only. Removed from:
PreTrainedTokenizer.encode(text:addSpecialTokens:)– had= truePreTrainedTokenizer.decode(tokens:skipSpecialTokens:)– had= falsePreTrainedTokenizer.applyChatTemplate(...)– had all six defaults repeatedPostProcessorconformers (TemplateProcessing,ByteLevelPostProcessor,RobertaProcessing,BertProcessing,SequenceProcessing) – hadtokensPair: nilandaddSpecialTokens: truedefaults onpostProcessPreTokenizerconformers (BertPreTokenizer,PreTokenizerSequence,WhitespacePreTokenizer,MetaspacePreTokenizer,ByteLevelPreTokenizer,PunctuationPreTokenizer,DigitsPreTokenizer,SplitPreTokenizer) – hadoptions: [.firstSection]default onpreTokenizePreTrainedTokenizerredundant forwarders removedencode(text:) -> [Int]– just calledencode(text:addSpecialTokens: true)bosTokenId/eosTokenId/unknownTokenId– forwarded tomodel.*, now derived by the extensionNote:
TokenizingModelkeeps the ID properties as stored requirements, since the model layer uses them on hot paths (e.g.,convertTokenToIdfalls back tounknownTokenId, andUnigramTokenizerreadsunknownTokenIdfrom config as the primary identifier).