Overview
Add comprehensive tokenizer training capabilities to enable creation of custom tokenizers from scratch or adaptation of existing tokenizers to new domains. This feature provides the foundation for building domain-specific tokenizers and fine-tuning existing models.
Features to Implement
Train Method
- Signature:
func (t *Tokenizer) Train(files []string, trainer TokenizerTrainer) error
- Purpose: Train tokenizer on text files using specified training algorithm
- Parameters:
files: List of training text file paths
trainer: Configuration for training algorithm (BPE, WordPiece, Unigram, etc.)
TrainFromIterator Method
- Signature:
func (t *Tokenizer) TrainFromIterator(iterator TextIterator, trainer TokenizerTrainer) error
- Purpose: Train tokenizer from an iterator/stream of text data
- Parameters:
iterator: Interface for streaming text data
trainer: Training configuration
Training Configuration Types
- Support for different training algorithms:
- BPE (Byte-Pair Encoding): Most common subword tokenization
- WordPiece: Used by BERT and similar models
- Unigram: Used by SentencePiece and T5
- Word-level: Simple whitespace/punctuation splitting
Training Parameters
- Vocabulary size configuration
- Special tokens specification
- Training algorithm hyperparameters
- Merge operations and frequency thresholds
Implementation Requirements
Go Layer (tokenizers.go)
Trainer Configuration Structs
Rust Layer (src/lib.rs)
FFI Bridge (library.go)
Acceptance Criteria
Functional Requirements
Training Algorithm Support
Data Handling
Performance Requirements
Testing Requirements
Documentation Requirements
Technical Considerations
Memory Management
- Efficient processing of large training corpora
- Streaming support to avoid loading entire datasets in memory
- Proper cleanup of training artifacts and intermediate data
Training Algorithm Integration
- Leverage tokenizers crate training implementations
- Ensure compatibility with HuggingFace tokenizer formats
- Support for advanced training features (merges, special token handling)
Progress Monitoring
- Real-time progress reporting during training
- Ability to cancel long-running training operations
- Training metrics and statistics collection
Error Handling
- Clear error messages for training failures
- Validation of training parameters before starting
- Graceful handling of corrupted or malformed training data
Performance Optimization
- Parallel processing where possible
- Efficient data structures for vocabulary building
- Memory-mapped file access for large training files
Example Usage
// BPE Training from files
bpeTrainer := &BPETrainer{
VocabSize: 30000,
MinFrequency: 2,
SpecialTokens: []string{"[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"},
ShowProgress: true,
EndOfWordSuffix: "</w>",
}
trainingFiles := []string{"corpus1.txt", "corpus2.txt", "corpus3.txt"}
err := tokenizer.Train(trainingFiles, bpeTrainer)
if err != nil {
log.Fatalf("Training failed: %v", err)
}
// WordPiece Training from iterator
wpTrainer := &WordPieceTrainer{
VocabSize: 30000,
MinFrequency: 2,
SpecialTokens: []string{"[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"},
UNKToken: "[UNK]",
MaxInputCharsPerWord: 100,
}
// Custom text iterator
textIterator := &FileLineIterator{
Files: []string{"domain_corpus.txt"},
}
err = tokenizer.TrainFromIterator(textIterator, wpTrainer)
if err != nil {
log.Fatalf("Training from iterator failed: %v", err)
}
// Test the trained tokenizer
encoding, err := tokenizer.Encode("This is a test of the trained tokenizer.")
if err != nil {
log.Fatal(err)
}
fmt.Printf("Encoded: %v", encoding.Tokens)
// Unigram training for SentencePiece-style tokenizer
unigramTrainer := &UnigramTrainer{
VocabSize: 32000,
SpecialTokens: []string{"<unk>", "<s>", "</s>"},
UNKToken: "<unk>",
ShrinkingFactor: 0.75,
}
err = tokenizer.Train([]string{"multilingual_corpus.txt"}, unigramTrainer)
if err != nil {
log.Fatal(err)
}
Related Issues
- Builds upon vocabulary access and token management features
- Enables custom tokenizer creation for domain-specific applications
- Foundation for advanced NLP pipeline customization
- Part of advanced features milestone providing complete tokenizer API
Overview
Add comprehensive tokenizer training capabilities to enable creation of custom tokenizers from scratch or adaptation of existing tokenizers to new domains. This feature provides the foundation for building domain-specific tokenizers and fine-tuning existing models.
Features to Implement
Train Method
func (t *Tokenizer) Train(files []string, trainer TokenizerTrainer) errorfiles: List of training text file pathstrainer: Configuration for training algorithm (BPE, WordPiece, Unigram, etc.)TrainFromIterator Method
func (t *Tokenizer) TrainFromIterator(iterator TextIterator, trainer TokenizerTrainer) erroriterator: Interface for streaming text datatrainer: Training configurationTraining Configuration Types
Training Parameters
Implementation Requirements
Go Layer (tokenizers.go)
TokenizerTrainerinterface for training configurationBPETrainerWordPieceTrainerUnigramTrainerWordLevelTrainerTextIteratorinterface for streaming text dataTrainmethod toTokenizerstructTrainFromIteratormethod toTokenizerstructTrainer Configuration Structs
BPETrainerwith fields:VocabSize intMinFrequency intSpecialTokens []stringShowProgress boolEndOfWordSuffix stringWordPieceTrainerwith fields:VocabSize intMinFrequency intSpecialTokens []stringUNKToken stringMaxInputCharsPerWord intUnigramTrainerwith fields:VocabSize intSpecialTokens []stringUNKToken stringShrinkingFactor float64WordLevelTrainerwith fields:VocabSize intMinFrequency intSpecialTokens []stringUNKToken stringRust Layer (src/lib.rs)
train_bpetrain_wordpiecetrain_unigramtrain_wordlevelFFI Bridge (library.go)
Acceptance Criteria
Functional Requirements
Trainsuccessfully creates functional tokenizers from text filesTrainFromIteratorworks with streaming text dataTraining Algorithm Support
Data Handling
Performance Requirements
Testing Requirements
Documentation Requirements
Technical Considerations
Memory Management
Training Algorithm Integration
Progress Monitoring
Error Handling
Performance Optimization
Example Usage
Related Issues