Clean up protocol requirements: remove redundant overloads and defaults#10

Merged

DePasqualeOrg merged 1 commit intomainfrom

clean-up-protocol-requirements

Mar 4, 2026

Owner

DePasqualeOrg commented Mar 4, 2026

Multiple protocols had callAsFunction and bulk-wrapper methods as protocol requirements when they were pure forwarders that no conformer overrides. This PR moves them to extensions and removes redundant default parameter values from concrete implementations.

Design principle

Each operation should have exactly one protocol requirement (the full-parameter version). Convenience entry points – short-parameter versions, callAsFunction, and bulk wrappers – belong in protocol extensions.

This aligns with the Python tokenizers library, where the internal components (Normalizer, PreTokenizer, Decoder, PostProcessor) each define a single core method. __call__ is not part of the protocol interface – it's a convenience alias, just as callAsFunction is a convenience extension in Swift.

For the main Tokenizer protocol, the design also aligns with Python transformers:

Single-item convertTokenToId / convertIdToToken are protocol requirements, matching Python's abstract methods.
Bulk convertTokensToIds / convertIdsToTokens are derived from the single-item versions. Python does this via type checking at runtime; Swift does it via protocol extensions at compile time.
encode and decode are the core operations, with the full-parameter version as the requirement and short-parameter versions as extension conveniences.

No conformer in the codebase overrides any of the methods moved to extensions.

Changes

`Tokenizer` protocol

Requirements removed (moved to extensions):

encode(text:) -> [Int] – calls encode(text:addSpecialTokens: true)
callAsFunction(_:addSpecialTokens:) -> [Int] – calls encode(text:addSpecialTokens:)
decode(tokens:) -> String – calls decode(tokens:skipSpecialTokens: false)
convertTokensToIds(_:) -> [Int?] – maps over convertTokenToId(_:)
convertIdsToTokens(_:) -> [String?] – maps over convertIdToToken(_:)
bosTokenId / eosTokenId / unknownTokenId – now derived from the token string via convertTokenToId, matching how Python transformers derives them on demand rather than storing them

`TokenizingModel` protocol

Requirements removed (moved to extensions):

callAsFunction(_:) -> [String] – calls tokenize(text:)
convertTokensToIds(_:) -> [Int?] – maps over convertTokenToId(_:)
convertIdsToTokens(_:) -> [String?] – maps over convertIdToToken(_:)

`Normalizer` protocol

Requirements removed (moved to extensions):

callAsFunction(text:) -> String – calls normalize(text:)

`Decoder` protocol

Requirements removed (moved to extensions):

callAsFunction(tokens:) -> [String] – calls decode(tokens:)

`PostProcessor` protocol

Requirements removed (moved to extensions):

callAsFunction(tokens:tokensPair:addSpecialTokens:) -> [String] – calls postProcess(...)

`PreTokenizer` protocol

Requirements removed (moved to extensions):

preTokenize(texts:options:) -> [String] – flatMaps over preTokenize(text:options:)
callAsFunction(texts:options:) -> [String] – calls preTokenize(texts:options:)
callAsFunction(text:options:) -> [String] – calls preTokenize(text:options:)

Redundant defaults removed from concrete implementations

Defaults now live in protocol extensions only. Removed from:

PreTrainedTokenizer.encode(text:addSpecialTokens:) – had = true
PreTrainedTokenizer.decode(tokens:skipSpecialTokens:) – had = false
PreTrainedTokenizer.applyChatTemplate(...) – had all six defaults repeated
All PostProcessor conformers (TemplateProcessing, ByteLevelPostProcessor, RobertaProcessing, BertProcessing, SequenceProcessing) – had tokensPair: nil and addSpecialTokens: true defaults on postProcess
All PreTokenizer conformers (BertPreTokenizer, PreTokenizerSequence, WhitespacePreTokenizer, MetaspacePreTokenizer, ByteLevelPreTokenizer, PunctuationPreTokenizer, DigitsPreTokenizer, SplitPreTokenizer) – had options: [.firstSection] default on preTokenize

`PreTrainedTokenizer` redundant forwarders removed

encode(text:) -> [Int] – just called encode(text:addSpecialTokens: true)
bosTokenId / eosTokenId / unknownTokenId – forwarded to model.*, now derived by the extension

Note: TokenizingModel keeps the ID properties as stored requirements, since the model layer uses them on hot paths (e.g., convertTokenToId falls back to unknownTokenId, and UnigramTokenizer reads unknownTokenId from config as the primary identifier).


          Clean up protocol requirements: remove redundant overloads and defaults

4b94e69

Move convenience methods (callAsFunction, bulk wrappers, short-parameter
versions) from protocol requirements to extensions across Tokenizer,
TokenizingModel, Normalizer, Decoder, PostProcessor, and PreTokenizer.
Derive bosTokenId/eosTokenId/unknownTokenId from token strings on the
Tokenizer protocol. Remove redundant default parameter values from all
concrete implementations.

DePasqualeOrg merged commit b094600 into main

3 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet