Skip to content

[ENH] Implement Pipeline Component Access (Get/Set Normalizer, Pre-tokenizer, Post-processor) #70

@tazarov

Description

@tazarov

Overview

Add fine-grained access to tokenizer pipeline components, enabling advanced customization and modification of tokenization behavior. This feature provides access to the internal tokenizer pipeline stages: normalization, pre-tokenization, and post-processing.

Features to Implement

Component Access Methods

Enable getting and setting individual pipeline components to customize tokenizer behavior without full retraining.

Normalizer Access

  • Get Method: func (t *Tokenizer) GetNormalizer() (Normalizer, error)
  • Set Method: func (t *Tokenizer) SetNormalizer(normalizer Normalizer) error
  • Purpose: Control text normalization (lowercasing, unicode normalization, accents, etc.)

Pre-tokenizer Access

  • Get Method: func (t *Tokenizer) GetPreTokenizer() (PreTokenizer, error)
  • Set Method: func (t *Tokenizer) SetPreTokenizer(preTokenizer PreTokenizer) error
  • Purpose: Control pre-tokenization splitting (whitespace, punctuation, custom patterns)

Post-processor Access

  • Get Method: func (t *Tokenizer) GetPostProcessor() (PostProcessor, error)
  • Set Method: func (t *Tokenizer) SetPostProcessor(postProcessor PostProcessor) error
  • Purpose: Control post-processing (special token addition, sequence formatting)

Component Types and Configurations

  • Support for common component types from tokenizers crate
  • Configuration interfaces for each component type
  • Factory methods for creating standard component configurations

Implementation Requirements

Go Layer Interface Definitions

Normalizer Interface

  • Define Normalizer interface with common methods
  • Implement concrete normalizer types:
    • BertNormalizer - BERT-style normalization
    • NFCNormalizer - Unicode NFC normalization
    • NFDNormalizer - Unicode NFD normalization
    • NFKCNormalizer - Unicode NFKC normalization
    • NFKDNormalizer - Unicode NFKD normalization
    • LowercaseNormalizer - Simple lowercasing
    • StripNormalizer - Whitespace stripping
    • SequenceNormalizer - Chain multiple normalizers
  • Configuration structs for each normalizer type

PreTokenizer Interface

  • Define PreTokenizer interface with common methods
  • Implement concrete pre-tokenizer types:
    • WhitespacePreTokenizer - Split on whitespace
    • BertPreTokenizer - BERT-style pre-tokenization
    • ByteLevelPreTokenizer - GPT-style byte-level
    • PunctuationPreTokenizer - Split on punctuation
    • DigitsPreTokenizer - Handle digits specially
    • SplitPreTokenizer - Custom pattern splitting
    • SequencePreTokenizer - Chain multiple pre-tokenizers
  • Configuration structs for each pre-tokenizer type

PostProcessor Interface

  • Define PostProcessor interface with common methods
  • Implement concrete post-processor types:
    • BertProcessing - BERT [CLS] and [SEP] handling
    • RobertaProcessing - RoBERTa special token handling
    • TemplateProcessing - Custom template-based processing
    • ByteLevelProcessing - GPT-style post-processing
  • Configuration structs for each post-processor type

Go Layer (tokenizers.go)

  • Add component access methods to Tokenizer struct
  • Component validation and compatibility checking
  • Error handling for invalid component configurations
  • Integration with existing tokenizer functionality
  • Component serialization support for Save/Load operations

Rust Layer (src/lib.rs)

  • FFI functions for getting component configurations:
    • get_normalizer
    • get_pre_tokenizer
    • get_post_processor
  • FFI functions for setting component configurations:
    • set_normalizer
    • set_pre_tokenizer
    • set_post_processor
  • Component serialization to/from configuration structs
  • Integration with tokenizers crate component APIs
  • Memory management for component data transfer

FFI Bridge (library.go)

  • Define component data structures for FFI transfer
  • Component configuration marshaling/unmarshaling
  • Memory management for component configurations
  • Error propagation for component operations

Acceptance Criteria

Functional Requirements

  • Get methods return accurate current component configurations
  • Set methods successfully update tokenizer pipeline components
  • Modified tokenizers produce expected tokenization results
  • Component changes affect encoding/decoding behavior correctly
  • All standard component types are supported and functional
  • Component chaining (Sequence types) works correctly

Component Type Support

  • All major normalizer types work correctly with various text inputs
  • Pre-tokenizer types handle different text patterns appropriately
  • Post-processor types format sequences correctly for model compatibility
  • Custom component configurations are validated and applied properly

Integration Requirements

  • Component modifications work with existing encoding/decoding methods
  • Modified tokenizers can be saved and loaded with Save/ToJSON methods
  • Component changes integrate with batch processing methods
  • Dynamic token additions work with modified pipeline components

Compatibility Requirements

  • Component configurations match HuggingFace tokenizers behavior
  • Modified tokenizers maintain compatibility with model expectations
  • Standard component presets work with popular model architectures (BERT, GPT, RoBERTa, etc.)

Testing Requirements

  • Unit tests for each component get/set method
  • Integration tests verifying component behavior changes
  • Compatibility tests with various model architectures
  • Component serialization/deserialization tests
  • Performance tests for component modification overhead
  • Unicode and multilingual text handling with different normalizers
  • Edge case tests for component configuration validation
  • Cross-component interaction tests (normalizer + pre-tokenizer combinations)

Documentation Requirements

  • Go doc comments for all component interfaces and methods
  • Usage examples for common component modifications
  • Guidelines for choosing appropriate components for different use cases
  • Documentation of component behavior and configuration options
  • Examples of creating custom tokenizer pipelines

Technical Considerations

Component Compatibility

  • Validation of component combinations for model compatibility
  • Warning systems for potentially incompatible component changes
  • Testing component modifications against expected model inputs

Performance Impact

  • Minimize overhead for component access operations
  • Efficient component configuration transfer across FFI
  • Consider caching frequently accessed component configurations

Memory Management

  • Proper cleanup of component configuration data
  • Efficient serialization of complex component hierarchies
  • Memory usage optimization for component chains

Error Handling

  • Clear error messages for invalid component configurations
  • Validation of component parameters before application
  • Graceful handling of incompatible component combinations

Serialization Integration

  • Component configurations must be preserved in Save/ToJSON operations
  • Proper loading of custom component configurations from saved tokenizers
  • Version compatibility for component configuration formats

Example Usage

// Get current normalizer configuration
currentNormalizer, err := tokenizer.GetNormalizer()
if err != nil {
    log.Fatal(err)
}
fmt.Printf("Current normalizer: %T", currentNormalizer)

// Set a custom normalizer chain
customNormalizer := &SequenceNormalizer{
    Normalizers: []Normalizer{
        &NFCNormalizer{},
        &LowercaseNormalizer{},
        &StripNormalizer{Left: true, Right: true},
    },
}
err = tokenizer.SetNormalizer(customNormalizer)
if err != nil {
    log.Fatal(err)
}

// Modify pre-tokenizer for domain-specific splitting
domainPreTokenizer := &SequencePreTokenizer{
    PreTokenizers: []PreTokenizer{
        &WhitespacePreTokenizer{},
        &SplitPreTokenizer{
            Pattern: `[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+`, // Split IP addresses
            Behavior: "isolated",
        },
        &PunctuationPreTokenizer{},
    },
}
err = tokenizer.SetPreTokenizer(domainPreTokenizer)
if err != nil {
    log.Fatal(err)
}

// Custom post-processor for question-answering format
qaPostProcessor := &TemplateProcessing{
    Single: "[CLS]  [SEP]",
    Pair:   "[CLS]  [SEP]  [SEP]",
    SpecialTokens: []SpecialToken{
        {Token: "[CLS]", ID: 101},
        {Token: "[SEP]", ID: 102},
    },
}
err = tokenizer.SetPostProcessor(qaPostProcessor)
if err != nil {
    log.Fatal(err)
}

// Test the modified tokenizer
question := "What is the server IP?"
context := "The server is running on 192.168.1.100 port 8080."
encoding, err := tokenizer.EncodeSequencePair(question, context)
if err != nil {
    log.Fatal(err)
}

fmt.Printf("Modified tokenization: %v", encoding.Tokens)

// Save tokenizer with custom components
err = tokenizer.Save("./custom_pipeline_tokenizer.json", true)
if err != nil {
    log.Fatal(err)
}

// Load and verify component preservation
loadedTokenizer, err := LoadFromFile("./custom_pipeline_tokenizer.json")
if err != nil {
    log.Fatal(err)
}

loadedNormalizer, err := loadedTokenizer.GetNormalizer()
if err != nil {
    log.Fatal(err)
}
fmt.Printf("Loaded normalizer type: %T", loadedNormalizer)

Related Issues

  • Enables advanced customization of tokenization pipelines
  • Supports domain-specific tokenization requirements
  • Foundation for tokenizer fine-tuning and adaptation
  • Completes the advanced features milestone providing full tokenizer control

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions