Skip to content

[ENH] Implement Serialization Support (Save, ToJSON) #69

@tazarov

Description

@tazarov

Overview

Add comprehensive serialization capabilities to enable saving, loading, and sharing tokenizers in various formats. This feature provides persistence, interoperability, and the ability to export tokenizers for use in different environments and frameworks.

Features to Implement

Save Method

  • Signature: func (t *Tokenizer) Save(path string, prettyPrint bool) error
  • Purpose: Save tokenizer to file in HuggingFace tokenizer format
  • Parameters:
    • path: File path for saving the tokenizer
    • prettyPrint: Whether to format JSON output for readability
  • Format: Standard HuggingFace tokenizers JSON format for maximum compatibility

ToJSON Method

  • Signature: func (t *Tokenizer) ToJSON(prettyPrint bool) (string, error)
  • Purpose: Serialize tokenizer to JSON string representation
  • Returns: JSON string containing complete tokenizer configuration
  • Use Case: In-memory serialization, network transfer, debugging

Load Method Enhancement

  • Signature: func LoadFromFile(path string) (*Tokenizer, error)
  • Purpose: Enhanced loading with better error handling and validation
  • Compatibility: Load tokenizers saved by Save method or HuggingFace tokenizers

Export Formats

  • HuggingFace JSON: Primary format for interoperability
  • Portable JSON: Simplified format for specific use cases
  • Binary format: Optional high-performance serialization

Implementation Requirements

Go Layer (tokenizers.go)

  • Add Save method to Tokenizer struct
  • Add ToJSON method to Tokenizer struct
  • Enhance LoadFromFile with better error handling
  • Add validation for saved/loaded tokenizers
  • Support for different serialization options
  • File I/O error handling and atomic writes

Serialization Components

  • Complete tokenizer state serialization:
    • Vocabulary and token mappings
    • Model configuration (BPE, WordPiece, etc.)
    • Normalizer settings
    • Pre-tokenizer configuration
    • Post-processor settings
    • Special tokens configuration
    • Added tokens (from AddTokens/AddSpecialTokens)
    • Training parameters and metadata
  • Version compatibility tracking
  • Checksum/integrity verification

Rust Layer (src/lib.rs)

  • Add save_tokenizer FFI function
  • Add tokenizer_to_json FFI function
  • Integration with tokenizers crate serialization APIs
  • Complete state extraction from Rust tokenizer objects
  • JSON formatting and pretty-printing support
  • File writing with proper error handling
  • Memory management for serialization buffers

FFI Bridge (library.go)

  • Define serialization function signatures
  • Handle file path and JSON string transfers
  • Memory management for large JSON outputs
  • Error propagation for I/O operations
  • Support for serialization options/flags

Acceptance Criteria

Functional Requirements

  • Save creates valid HuggingFace-compatible tokenizer files
  • ToJSON produces complete and accurate JSON representations
  • Saved tokenizers can be loaded and function identically to original
  • JSON output includes all tokenizer state and configuration
  • Pretty-printing produces readable, formatted JSON
  • Saved files are compatible with HuggingFace tokenizers library
  • Round-trip save/load preserves all tokenizer functionality

State Preservation

  • All vocabulary tokens and IDs are preserved
  • Model configuration (BPE merges, WordPiece settings, etc.) is saved
  • Special tokens and their properties are maintained
  • Added tokens from dynamic token management are included
  • Normalizer, pre-tokenizer, and post-processor settings are preserved
  • Training metadata and parameters are stored

Compatibility Requirements

  • Saved tokenizers work with HuggingFace transformers library
  • JSON format follows HuggingFace tokenizers schema
  • Version compatibility with different tokenizer versions
  • Cross-platform file format compatibility
  • Unicode text handling in serialized data

Performance Requirements

  • Save operation completes efficiently for large vocabularies
  • JSON serialization memory usage is reasonable
  • File I/O operations are atomic (no partial writes on failure)
  • Large tokenizer serialization doesn't cause memory issues

Testing Requirements

  • Unit tests for Save method with various tokenizer types
  • Unit tests for ToJSON with different formatting options
  • Round-trip testing (save → load → verify functionality)
  • Compatibility tests with HuggingFace tokenizers library
  • Large vocabulary serialization tests
  • Unicode and special character handling in serialization
  • Error handling tests for I/O failures and permission issues
  • Performance benchmarks for serialization operations
  • Cross-platform serialization compatibility tests

Documentation Requirements

  • Go doc comments for all serialization methods
  • Examples of saving and loading tokenizers
  • Documentation of JSON format structure
  • Guidelines for tokenizer portability and sharing
  • Troubleshooting guide for serialization issues

Technical Considerations

File Format Compatibility

  • Strict adherence to HuggingFace tokenizers JSON schema
  • Version metadata for future compatibility tracking
  • Backward/forward compatibility strategies

Memory Management

  • Efficient handling of large JSON outputs
  • Streaming serialization for very large tokenizers
  • Proper cleanup of serialization buffers

Atomic Operations

  • Atomic file writes to prevent corruption
  • Temporary file usage during save operations
  • Rollback capabilities for failed saves

Error Handling

  • Clear error messages for I/O failures
  • Validation of tokenizer state before serialization
  • Recovery from partial or corrupted save files

Security Considerations

  • File permission handling
  • Path traversal protection
  • Validation of loaded tokenizer data

Example Usage

// Save tokenizer to file
err := tokenizer.Save("/path/to/my_tokenizer.json", true)
if err != nil {
    log.Fatalf("Failed to save tokenizer: %v", err)
}

// Serialize to JSON string
jsonStr, err := tokenizer.ToJSON(true) // pretty-printed
if err != nil {
    log.Fatal(err)
}
fmt.Printf("Tokenizer JSON: %s", jsonStr)

// Load saved tokenizer
loadedTokenizer, err := LoadFromFile("/path/to/my_tokenizer.json")
if err != nil {
    log.Fatal(err)
}

// Verify functionality is preserved
originalEncoding, _ := tokenizer.Encode("Test text")
loadedEncoding, _ := loadedTokenizer.Encode("Test text")

if !reflect.DeepEqual(originalEncoding.TokenIDs, loadedEncoding.TokenIDs) {
    log.Fatal("Tokenizer state not preserved after save/load")
}

// Save trained tokenizer for sharing
bpeTrainer := &BPETrainer{VocabSize: 30000, MinFrequency: 2}
err = customTokenizer.Train([]string{"corpus.txt"}, bpeTrainer)
if err != nil {
    log.Fatal(err)
}

// Save the trained tokenizer
err = customTokenizer.Save("./custom_domain_tokenizer.json", true)
if err != nil {
    log.Fatal(err)
}

// Export compact JSON for API transfer
compactJSON, err := customTokenizer.ToJSON(false) // no pretty printing
if err != nil {
    log.Fatal(err)
}

// Use in HTTP API
http.HandleFunc("/tokenizer", func(w http.ResponseWriter, r *http.Request) {
    w.Header().Set("Content-Type", "application/json")
    w.Write([]byte(compactJSON))
})

Related Issues

  • Enables persistence of trained tokenizers (issue [ENH] Implement Training Support (Train, TrainFromIterator) #68)
  • Supports sharing and distribution of custom tokenizers
  • Foundation for tokenizer versioning and compatibility management
  • Part of advanced features milestone providing complete tokenizer lifecycle management

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions