Skip to content

Clean up protocol requirements: remove redundant overloads and defaults#10

Merged
DePasqualeOrg merged 1 commit intomainfrom
clean-up-protocol-requirements
Mar 4, 2026
Merged

Clean up protocol requirements: remove redundant overloads and defaults#10
DePasqualeOrg merged 1 commit intomainfrom
clean-up-protocol-requirements

Conversation

@DePasqualeOrg
Copy link
Copy Markdown
Owner

Multiple protocols had callAsFunction and bulk-wrapper methods as protocol requirements when they were pure forwarders that no conformer overrides. This PR moves them to extensions and removes redundant default parameter values from concrete implementations.

Design principle

Each operation should have exactly one protocol requirement (the full-parameter version). Convenience entry points – short-parameter versions, callAsFunction, and bulk wrappers – belong in protocol extensions.

This aligns with the Python tokenizers library, where the internal components (Normalizer, PreTokenizer, Decoder, PostProcessor) each define a single core method. __call__ is not part of the protocol interface – it's a convenience alias, just as callAsFunction is a convenience extension in Swift.

For the main Tokenizer protocol, the design also aligns with Python transformers:

  • Single-item convertTokenToId / convertIdToToken are protocol requirements, matching Python's abstract methods.
  • Bulk convertTokensToIds / convertIdsToTokens are derived from the single-item versions. Python does this via type checking at runtime; Swift does it via protocol extensions at compile time.
  • encode and decode are the core operations, with the full-parameter version as the requirement and short-parameter versions as extension conveniences.

No conformer in the codebase overrides any of the methods moved to extensions.

Changes

Tokenizer protocol

Requirements removed (moved to extensions):

  • encode(text:) -> [Int] – calls encode(text:addSpecialTokens: true)
  • callAsFunction(_:addSpecialTokens:) -> [Int] – calls encode(text:addSpecialTokens:)
  • decode(tokens:) -> String – calls decode(tokens:skipSpecialTokens: false)
  • convertTokensToIds(_:) -> [Int?] – maps over convertTokenToId(_:)
  • convertIdsToTokens(_:) -> [String?] – maps over convertIdToToken(_:)
  • bosTokenId / eosTokenId / unknownTokenId – now derived from the token string via convertTokenToId, matching how Python transformers derives them on demand rather than storing them

TokenizingModel protocol

Requirements removed (moved to extensions):

  • callAsFunction(_:) -> [String] – calls tokenize(text:)
  • convertTokensToIds(_:) -> [Int?] – maps over convertTokenToId(_:)
  • convertIdsToTokens(_:) -> [String?] – maps over convertIdToToken(_:)

Normalizer protocol

Requirements removed (moved to extensions):

  • callAsFunction(text:) -> String – calls normalize(text:)

Decoder protocol

Requirements removed (moved to extensions):

  • callAsFunction(tokens:) -> [String] – calls decode(tokens:)

PostProcessor protocol

Requirements removed (moved to extensions):

  • callAsFunction(tokens:tokensPair:addSpecialTokens:) -> [String] – calls postProcess(...)

PreTokenizer protocol

Requirements removed (moved to extensions):

  • preTokenize(texts:options:) -> [String] – flatMaps over preTokenize(text:options:)
  • callAsFunction(texts:options:) -> [String] – calls preTokenize(texts:options:)
  • callAsFunction(text:options:) -> [String] – calls preTokenize(text:options:)

Redundant defaults removed from concrete implementations

Defaults now live in protocol extensions only. Removed from:

  • PreTrainedTokenizer.encode(text:addSpecialTokens:) – had = true
  • PreTrainedTokenizer.decode(tokens:skipSpecialTokens:) – had = false
  • PreTrainedTokenizer.applyChatTemplate(...) – had all six defaults repeated
  • All PostProcessor conformers (TemplateProcessing, ByteLevelPostProcessor, RobertaProcessing, BertProcessing, SequenceProcessing) – had tokensPair: nil and addSpecialTokens: true defaults on postProcess
  • All PreTokenizer conformers (BertPreTokenizer, PreTokenizerSequence, WhitespacePreTokenizer, MetaspacePreTokenizer, ByteLevelPreTokenizer, PunctuationPreTokenizer, DigitsPreTokenizer, SplitPreTokenizer) – had options: [.firstSection] default on preTokenize

PreTrainedTokenizer redundant forwarders removed

  • encode(text:) -> [Int] – just called encode(text:addSpecialTokens: true)
  • bosTokenId / eosTokenId / unknownTokenId – forwarded to model.*, now derived by the extension

Note: TokenizingModel keeps the ID properties as stored requirements, since the model layer uses them on hot paths (e.g., convertTokenToId falls back to unknownTokenId, and UnigramTokenizer reads unknownTokenId from config as the primary identifier).

Move convenience methods (callAsFunction, bulk wrappers, short-parameter
versions) from protocol requirements to extensions across Tokenizer,
TokenizingModel, Normalizer, Decoder, PostProcessor, and PreTokenizer.
Derive bosTokenId/eosTokenId/unknownTokenId from token strings on the
Tokenizer protocol. Remove redundant default parameter values from all
concrete implementations.
@DePasqualeOrg DePasqualeOrg merged commit b094600 into main Mar 4, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant