Overview
Add fine-grained access to tokenizer pipeline components, enabling advanced customization and modification of tokenization behavior. This feature provides access to the internal tokenizer pipeline stages: normalization, pre-tokenization, and post-processing.
Features to Implement
Component Access Methods
Enable getting and setting individual pipeline components to customize tokenizer behavior without full retraining.
Normalizer Access
- Get Method:
func (t *Tokenizer) GetNormalizer() (Normalizer, error)
- Set Method:
func (t *Tokenizer) SetNormalizer(normalizer Normalizer) error
- Purpose: Control text normalization (lowercasing, unicode normalization, accents, etc.)
Pre-tokenizer Access
- Get Method:
func (t *Tokenizer) GetPreTokenizer() (PreTokenizer, error)
- Set Method:
func (t *Tokenizer) SetPreTokenizer(preTokenizer PreTokenizer) error
- Purpose: Control pre-tokenization splitting (whitespace, punctuation, custom patterns)
Post-processor Access
- Get Method:
func (t *Tokenizer) GetPostProcessor() (PostProcessor, error)
- Set Method:
func (t *Tokenizer) SetPostProcessor(postProcessor PostProcessor) error
- Purpose: Control post-processing (special token addition, sequence formatting)
Component Types and Configurations
- Support for common component types from tokenizers crate
- Configuration interfaces for each component type
- Factory methods for creating standard component configurations
Implementation Requirements
Go Layer Interface Definitions
Normalizer Interface
PreTokenizer Interface
PostProcessor Interface
Go Layer (tokenizers.go)
Rust Layer (src/lib.rs)
FFI Bridge (library.go)
Acceptance Criteria
Functional Requirements
Component Type Support
Integration Requirements
Compatibility Requirements
Testing Requirements
Documentation Requirements
Technical Considerations
Component Compatibility
- Validation of component combinations for model compatibility
- Warning systems for potentially incompatible component changes
- Testing component modifications against expected model inputs
Performance Impact
- Minimize overhead for component access operations
- Efficient component configuration transfer across FFI
- Consider caching frequently accessed component configurations
Memory Management
- Proper cleanup of component configuration data
- Efficient serialization of complex component hierarchies
- Memory usage optimization for component chains
Error Handling
- Clear error messages for invalid component configurations
- Validation of component parameters before application
- Graceful handling of incompatible component combinations
Serialization Integration
- Component configurations must be preserved in Save/ToJSON operations
- Proper loading of custom component configurations from saved tokenizers
- Version compatibility for component configuration formats
Example Usage
// Get current normalizer configuration
currentNormalizer, err := tokenizer.GetNormalizer()
if err != nil {
log.Fatal(err)
}
fmt.Printf("Current normalizer: %T", currentNormalizer)
// Set a custom normalizer chain
customNormalizer := &SequenceNormalizer{
Normalizers: []Normalizer{
&NFCNormalizer{},
&LowercaseNormalizer{},
&StripNormalizer{Left: true, Right: true},
},
}
err = tokenizer.SetNormalizer(customNormalizer)
if err != nil {
log.Fatal(err)
}
// Modify pre-tokenizer for domain-specific splitting
domainPreTokenizer := &SequencePreTokenizer{
PreTokenizers: []PreTokenizer{
&WhitespacePreTokenizer{},
&SplitPreTokenizer{
Pattern: `[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+`, // Split IP addresses
Behavior: "isolated",
},
&PunctuationPreTokenizer{},
},
}
err = tokenizer.SetPreTokenizer(domainPreTokenizer)
if err != nil {
log.Fatal(err)
}
// Custom post-processor for question-answering format
qaPostProcessor := &TemplateProcessing{
Single: "[CLS] [SEP]",
Pair: "[CLS] [SEP] [SEP]",
SpecialTokens: []SpecialToken{
{Token: "[CLS]", ID: 101},
{Token: "[SEP]", ID: 102},
},
}
err = tokenizer.SetPostProcessor(qaPostProcessor)
if err != nil {
log.Fatal(err)
}
// Test the modified tokenizer
question := "What is the server IP?"
context := "The server is running on 192.168.1.100 port 8080."
encoding, err := tokenizer.EncodeSequencePair(question, context)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Modified tokenization: %v", encoding.Tokens)
// Save tokenizer with custom components
err = tokenizer.Save("./custom_pipeline_tokenizer.json", true)
if err != nil {
log.Fatal(err)
}
// Load and verify component preservation
loadedTokenizer, err := LoadFromFile("./custom_pipeline_tokenizer.json")
if err != nil {
log.Fatal(err)
}
loadedNormalizer, err := loadedTokenizer.GetNormalizer()
if err != nil {
log.Fatal(err)
}
fmt.Printf("Loaded normalizer type: %T", loadedNormalizer)
Related Issues
- Enables advanced customization of tokenization pipelines
- Supports domain-specific tokenization requirements
- Foundation for tokenizer fine-tuning and adaptation
- Completes the advanced features milestone providing full tokenizer control
Overview
Add fine-grained access to tokenizer pipeline components, enabling advanced customization and modification of tokenization behavior. This feature provides access to the internal tokenizer pipeline stages: normalization, pre-tokenization, and post-processing.
Features to Implement
Component Access Methods
Enable getting and setting individual pipeline components to customize tokenizer behavior without full retraining.
Normalizer Access
func (t *Tokenizer) GetNormalizer() (Normalizer, error)func (t *Tokenizer) SetNormalizer(normalizer Normalizer) errorPre-tokenizer Access
func (t *Tokenizer) GetPreTokenizer() (PreTokenizer, error)func (t *Tokenizer) SetPreTokenizer(preTokenizer PreTokenizer) errorPost-processor Access
func (t *Tokenizer) GetPostProcessor() (PostProcessor, error)func (t *Tokenizer) SetPostProcessor(postProcessor PostProcessor) errorComponent Types and Configurations
Implementation Requirements
Go Layer Interface Definitions
Normalizer Interface
Normalizerinterface with common methodsBertNormalizer- BERT-style normalizationNFCNormalizer- Unicode NFC normalizationNFDNormalizer- Unicode NFD normalizationNFKCNormalizer- Unicode NFKC normalizationNFKDNormalizer- Unicode NFKD normalizationLowercaseNormalizer- Simple lowercasingStripNormalizer- Whitespace strippingSequenceNormalizer- Chain multiple normalizersPreTokenizer Interface
PreTokenizerinterface with common methodsWhitespacePreTokenizer- Split on whitespaceBertPreTokenizer- BERT-style pre-tokenizationByteLevelPreTokenizer- GPT-style byte-levelPunctuationPreTokenizer- Split on punctuationDigitsPreTokenizer- Handle digits speciallySplitPreTokenizer- Custom pattern splittingSequencePreTokenizer- Chain multiple pre-tokenizersPostProcessor Interface
PostProcessorinterface with common methodsBertProcessing- BERT [CLS] and [SEP] handlingRobertaProcessing- RoBERTa special token handlingTemplateProcessing- Custom template-based processingByteLevelProcessing- GPT-style post-processingGo Layer (tokenizers.go)
TokenizerstructRust Layer (src/lib.rs)
get_normalizerget_pre_tokenizerget_post_processorset_normalizerset_pre_tokenizerset_post_processorFFI Bridge (library.go)
Acceptance Criteria
Functional Requirements
Component Type Support
Integration Requirements
Compatibility Requirements
Testing Requirements
Documentation Requirements
Technical Considerations
Component Compatibility
Performance Impact
Memory Management
Error Handling
Serialization Integration
Example Usage
Related Issues