Overview
Add comprehensive serialization capabilities to enable saving, loading, and sharing tokenizers in various formats. This feature provides persistence, interoperability, and the ability to export tokenizers for use in different environments and frameworks.
Features to Implement
Save Method
Signature : func (t *Tokenizer) Save(path string, prettyPrint bool) error
Purpose : Save tokenizer to file in HuggingFace tokenizer format
Parameters :
path: File path for saving the tokenizer
prettyPrint: Whether to format JSON output for readability
Format : Standard HuggingFace tokenizers JSON format for maximum compatibility
ToJSON Method
Signature : func (t *Tokenizer) ToJSON(prettyPrint bool) (string, error)
Purpose : Serialize tokenizer to JSON string representation
Returns : JSON string containing complete tokenizer configuration
Use Case : In-memory serialization, network transfer, debugging
Load Method Enhancement
Signature : func LoadFromFile(path string) (*Tokenizer, error)
Purpose : Enhanced loading with better error handling and validation
Compatibility : Load tokenizers saved by Save method or HuggingFace tokenizers
Export Formats
HuggingFace JSON : Primary format for interoperability
Portable JSON : Simplified format for specific use cases
Binary format : Optional high-performance serialization
Implementation Requirements
Go Layer (tokenizers.go)
Serialization Components
Rust Layer (src/lib.rs)
FFI Bridge (library.go)
Acceptance Criteria
Functional Requirements
State Preservation
Compatibility Requirements
Performance Requirements
Testing Requirements
Documentation Requirements
Technical Considerations
File Format Compatibility
Strict adherence to HuggingFace tokenizers JSON schema
Version metadata for future compatibility tracking
Backward/forward compatibility strategies
Memory Management
Efficient handling of large JSON outputs
Streaming serialization for very large tokenizers
Proper cleanup of serialization buffers
Atomic Operations
Atomic file writes to prevent corruption
Temporary file usage during save operations
Rollback capabilities for failed saves
Error Handling
Clear error messages for I/O failures
Validation of tokenizer state before serialization
Recovery from partial or corrupted save files
Security Considerations
File permission handling
Path traversal protection
Validation of loaded tokenizer data
Example Usage
// Save tokenizer to file
err := tokenizer .Save ("/path/to/my_tokenizer.json" , true )
if err != nil {
log .Fatalf ("Failed to save tokenizer: %v" , err )
}
// Serialize to JSON string
jsonStr , err := tokenizer .ToJSON (true ) // pretty-printed
if err != nil {
log .Fatal (err )
}
fmt .Printf ("Tokenizer JSON: %s" , jsonStr )
// Load saved tokenizer
loadedTokenizer , err := LoadFromFile ("/path/to/my_tokenizer.json" )
if err != nil {
log .Fatal (err )
}
// Verify functionality is preserved
originalEncoding , _ := tokenizer .Encode ("Test text" )
loadedEncoding , _ := loadedTokenizer .Encode ("Test text" )
if ! reflect .DeepEqual (originalEncoding .TokenIDs , loadedEncoding .TokenIDs ) {
log .Fatal ("Tokenizer state not preserved after save/load" )
}
// Save trained tokenizer for sharing
bpeTrainer := & BPETrainer {VocabSize : 30000 , MinFrequency : 2 }
err = customTokenizer .Train ([]string {"corpus.txt" }, bpeTrainer )
if err != nil {
log .Fatal (err )
}
// Save the trained tokenizer
err = customTokenizer .Save ("./custom_domain_tokenizer.json" , true )
if err != nil {
log .Fatal (err )
}
// Export compact JSON for API transfer
compactJSON , err := customTokenizer .ToJSON (false ) // no pretty printing
if err != nil {
log .Fatal (err )
}
// Use in HTTP API
http .HandleFunc ("/tokenizer" , func (w http.ResponseWriter , r * http.Request ) {
w .Header ().Set ("Content-Type" , "application/json" )
w .Write ([]byte (compactJSON ))
})
Related Issues
Enables persistence of trained tokenizers (issue [ENH] Implement Training Support (Train, TrainFromIterator) #68 )
Supports sharing and distribution of custom tokenizers
Foundation for tokenizer versioning and compatibility management
Part of advanced features milestone providing complete tokenizer lifecycle management
Overview
Add comprehensive serialization capabilities to enable saving, loading, and sharing tokenizers in various formats. This feature provides persistence, interoperability, and the ability to export tokenizers for use in different environments and frameworks.
Features to Implement
Save Method
func (t *Tokenizer) Save(path string, prettyPrint bool) errorpath: File path for saving the tokenizerprettyPrint: Whether to format JSON output for readabilityToJSON Method
func (t *Tokenizer) ToJSON(prettyPrint bool) (string, error)Load Method Enhancement
func LoadFromFile(path string) (*Tokenizer, error)Export Formats
Implementation Requirements
Go Layer (tokenizers.go)
Savemethod toTokenizerstructToJSONmethod toTokenizerstructLoadFromFilewith better error handlingSerialization Components
Rust Layer (src/lib.rs)
save_tokenizerFFI functiontokenizer_to_jsonFFI functionFFI Bridge (library.go)
Acceptance Criteria
Functional Requirements
Savecreates valid HuggingFace-compatible tokenizer filesToJSONproduces complete and accurate JSON representationsState Preservation
Compatibility Requirements
Performance Requirements
Testing Requirements
Documentation Requirements
Technical Considerations
File Format Compatibility
Memory Management
Atomic Operations
Error Handling
Security Considerations
Example Usage
Related Issues