[ENH] Implement Pipeline Component Access (Get/Set Normalizer, Pre-tokenizer, Post-processor)

## Overview

Add fine-grained access to tokenizer pipeline components, enabling advanced customization and modification of tokenization behavior. This feature provides access to the internal tokenizer pipeline stages: normalization, pre-tokenization, and post-processing.

## Features to Implement

### Component Access Methods
Enable getting and setting individual pipeline components to customize tokenizer behavior without full retraining.

### Normalizer Access
- **Get Method**: `func (t *Tokenizer) GetNormalizer() (Normalizer, error)`
- **Set Method**: `func (t *Tokenizer) SetNormalizer(normalizer Normalizer) error`
- **Purpose**: Control text normalization (lowercasing, unicode normalization, accents, etc.)

### Pre-tokenizer Access
- **Get Method**: `func (t *Tokenizer) GetPreTokenizer() (PreTokenizer, error)`
- **Set Method**: `func (t *Tokenizer) SetPreTokenizer(preTokenizer PreTokenizer) error`
- **Purpose**: Control pre-tokenization splitting (whitespace, punctuation, custom patterns)

### Post-processor Access
- **Get Method**: `func (t *Tokenizer) GetPostProcessor() (PostProcessor, error)`
- **Set Method**: `func (t *Tokenizer) SetPostProcessor(postProcessor PostProcessor) error`
- **Purpose**: Control post-processing (special token addition, sequence formatting)

### Component Types and Configurations
- Support for common component types from tokenizers crate
- Configuration interfaces for each component type
- Factory methods for creating standard component configurations

## Implementation Requirements

### Go Layer Interface Definitions

#### Normalizer Interface
- [ ] Define `Normalizer` interface with common methods
- [ ] Implement concrete normalizer types:
  - `BertNormalizer` - BERT-style normalization
  - `NFCNormalizer` - Unicode NFC normalization
  - `NFDNormalizer` - Unicode NFD normalization
  - `NFKCNormalizer` - Unicode NFKC normalization
  - `NFKDNormalizer` - Unicode NFKD normalization
  - `LowercaseNormalizer` - Simple lowercasing
  - `StripNormalizer` - Whitespace stripping
  - `SequenceNormalizer` - Chain multiple normalizers
- [ ] Configuration structs for each normalizer type

#### PreTokenizer Interface
- [ ] Define `PreTokenizer` interface with common methods
- [ ] Implement concrete pre-tokenizer types:
  - `WhitespacePreTokenizer` - Split on whitespace
  - `BertPreTokenizer` - BERT-style pre-tokenization
  - `ByteLevelPreTokenizer` - GPT-style byte-level
  - `PunctuationPreTokenizer` - Split on punctuation
  - `DigitsPreTokenizer` - Handle digits specially
  - `SplitPreTokenizer` - Custom pattern splitting
  - `SequencePreTokenizer` - Chain multiple pre-tokenizers
- [ ] Configuration structs for each pre-tokenizer type

#### PostProcessor Interface
- [ ] Define `PostProcessor` interface with common methods
- [ ] Implement concrete post-processor types:
  - `BertProcessing` - BERT [CLS] and [SEP] handling
  - `RobertaProcessing` - RoBERTa special token handling
  - `TemplateProcessing` - Custom template-based processing
  - `ByteLevelProcessing` - GPT-style post-processing
- [ ] Configuration structs for each post-processor type

### Go Layer (tokenizers.go)
- [ ] Add component access methods to `Tokenizer` struct
- [ ] Component validation and compatibility checking
- [ ] Error handling for invalid component configurations
- [ ] Integration with existing tokenizer functionality
- [ ] Component serialization support for Save/Load operations

### Rust Layer (src/lib.rs)
- [ ] FFI functions for getting component configurations:
  - `get_normalizer`
  - `get_pre_tokenizer`
  - `get_post_processor`
- [ ] FFI functions for setting component configurations:
  - `set_normalizer`
  - `set_pre_tokenizer`
  - `set_post_processor`
- [ ] Component serialization to/from configuration structs
- [ ] Integration with tokenizers crate component APIs
- [ ] Memory management for component data transfer

### FFI Bridge (library.go)
- [ ] Define component data structures for FFI transfer
- [ ] Component configuration marshaling/unmarshaling
- [ ] Memory management for component configurations
- [ ] Error propagation for component operations

## Acceptance Criteria

### Functional Requirements
- [ ] Get methods return accurate current component configurations
- [ ] Set methods successfully update tokenizer pipeline components
- [ ] Modified tokenizers produce expected tokenization results
- [ ] Component changes affect encoding/decoding behavior correctly
- [ ] All standard component types are supported and functional
- [ ] Component chaining (Sequence types) works correctly

### Component Type Support
- [ ] All major normalizer types work correctly with various text inputs
- [ ] Pre-tokenizer types handle different text patterns appropriately
- [ ] Post-processor types format sequences correctly for model compatibility
- [ ] Custom component configurations are validated and applied properly

### Integration Requirements
- [ ] Component modifications work with existing encoding/decoding methods
- [ ] Modified tokenizers can be saved and loaded with Save/ToJSON methods
- [ ] Component changes integrate with batch processing methods
- [ ] Dynamic token additions work with modified pipeline components

### Compatibility Requirements
- [ ] Component configurations match HuggingFace tokenizers behavior
- [ ] Modified tokenizers maintain compatibility with model expectations
- [ ] Standard component presets work with popular model architectures (BERT, GPT, RoBERTa, etc.)

### Testing Requirements
- [ ] Unit tests for each component get/set method
- [ ] Integration tests verifying component behavior changes
- [ ] Compatibility tests with various model architectures
- [ ] Component serialization/deserialization tests
- [ ] Performance tests for component modification overhead
- [ ] Unicode and multilingual text handling with different normalizers
- [ ] Edge case tests for component configuration validation
- [ ] Cross-component interaction tests (normalizer + pre-tokenizer combinations)

### Documentation Requirements
- [ ] Go doc comments for all component interfaces and methods
- [ ] Usage examples for common component modifications
- [ ] Guidelines for choosing appropriate components for different use cases
- [ ] Documentation of component behavior and configuration options
- [ ] Examples of creating custom tokenizer pipelines

## Technical Considerations

### Component Compatibility
- Validation of component combinations for model compatibility
- Warning systems for potentially incompatible component changes
- Testing component modifications against expected model inputs

### Performance Impact
- Minimize overhead for component access operations
- Efficient component configuration transfer across FFI
- Consider caching frequently accessed component configurations

### Memory Management
- Proper cleanup of component configuration data
- Efficient serialization of complex component hierarchies
- Memory usage optimization for component chains

### Error Handling
- Clear error messages for invalid component configurations
- Validation of component parameters before application
- Graceful handling of incompatible component combinations

### Serialization Integration
- Component configurations must be preserved in Save/ToJSON operations
- Proper loading of custom component configurations from saved tokenizers
- Version compatibility for component configuration formats

## Example Usage

```go
// Get current normalizer configuration
currentNormalizer, err := tokenizer.GetNormalizer()
if err != nil {
    log.Fatal(err)
}
fmt.Printf("Current normalizer: %T", currentNormalizer)

// Set a custom normalizer chain
customNormalizer := &SequenceNormalizer{
    Normalizers: []Normalizer{
        &NFCNormalizer{},
        &LowercaseNormalizer{},
        &StripNormalizer{Left: true, Right: true},
    },
}
err = tokenizer.SetNormalizer(customNormalizer)
if err != nil {
    log.Fatal(err)
}

// Modify pre-tokenizer for domain-specific splitting
domainPreTokenizer := &SequencePreTokenizer{
    PreTokenizers: []PreTokenizer{
        &WhitespacePreTokenizer{},
        &SplitPreTokenizer{
            Pattern: `[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+`, // Split IP addresses
            Behavior: "isolated",
        },
        &PunctuationPreTokenizer{},
    },
}
err = tokenizer.SetPreTokenizer(domainPreTokenizer)
if err != nil {
    log.Fatal(err)
}

// Custom post-processor for question-answering format
qaPostProcessor := &TemplateProcessing{
    Single: "[CLS]  [SEP]",
    Pair:   "[CLS]  [SEP]  [SEP]",
    SpecialTokens: []SpecialToken{
        {Token: "[CLS]", ID: 101},
        {Token: "[SEP]", ID: 102},
    },
}
err = tokenizer.SetPostProcessor(qaPostProcessor)
if err != nil {
    log.Fatal(err)
}

// Test the modified tokenizer
question := "What is the server IP?"
context := "The server is running on 192.168.1.100 port 8080."
encoding, err := tokenizer.EncodeSequencePair(question, context)
if err != nil {
    log.Fatal(err)
}

fmt.Printf("Modified tokenization: %v", encoding.Tokens)

// Save tokenizer with custom components
err = tokenizer.Save("./custom_pipeline_tokenizer.json", true)
if err != nil {
    log.Fatal(err)
}

// Load and verify component preservation
loadedTokenizer, err := LoadFromFile("./custom_pipeline_tokenizer.json")
if err != nil {
    log.Fatal(err)
}

loadedNormalizer, err := loadedTokenizer.GetNormalizer()
if err != nil {
    log.Fatal(err)
}
fmt.Printf("Loaded normalizer type: %T", loadedNormalizer)
```

## Related Issues
- Enables advanced customization of tokenization pipelines
- Supports domain-specific tokenization requirements
- Foundation for tokenizer fine-tuning and adaptation
- Completes the advanced features milestone providing full tokenizer control

[ENH] Implement Pipeline Component Access (Get/Set Normalizer, Pre-tokenizer, Post-processor) #70

Description

Overview

Features to Implement

Component Access Methods

Normalizer Access

Pre-tokenizer Access

Post-processor Access

Component Types and Configurations

Implementation Requirements

Go Layer Interface Definitions

Normalizer Interface

PreTokenizer Interface

PostProcessor Interface

Go Layer (tokenizers.go)

Rust Layer (src/lib.rs)

FFI Bridge (library.go)

Acceptance Criteria

Functional Requirements

Component Type Support

Integration Requirements

Compatibility Requirements

Testing Requirements

Documentation Requirements

Technical Considerations

Component Compatibility

Performance Impact

Memory Management

Error Handling

Serialization Integration

Example Usage

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions