Skip to content

[ENH] Implement Training Support (Train, TrainFromIterator) #68

@tazarov

Description

@tazarov

Overview

Add comprehensive tokenizer training capabilities to enable creation of custom tokenizers from scratch or adaptation of existing tokenizers to new domains. This feature provides the foundation for building domain-specific tokenizers and fine-tuning existing models.

Features to Implement

Train Method

  • Signature: func (t *Tokenizer) Train(files []string, trainer TokenizerTrainer) error
  • Purpose: Train tokenizer on text files using specified training algorithm
  • Parameters:
    • files: List of training text file paths
    • trainer: Configuration for training algorithm (BPE, WordPiece, Unigram, etc.)

TrainFromIterator Method

  • Signature: func (t *Tokenizer) TrainFromIterator(iterator TextIterator, trainer TokenizerTrainer) error
  • Purpose: Train tokenizer from an iterator/stream of text data
  • Parameters:
    • iterator: Interface for streaming text data
    • trainer: Training configuration

Training Configuration Types

  • Support for different training algorithms:
    • BPE (Byte-Pair Encoding): Most common subword tokenization
    • WordPiece: Used by BERT and similar models
    • Unigram: Used by SentencePiece and T5
    • Word-level: Simple whitespace/punctuation splitting

Training Parameters

  • Vocabulary size configuration
  • Special tokens specification
  • Training algorithm hyperparameters
  • Merge operations and frequency thresholds

Implementation Requirements

Go Layer (tokenizers.go)

  • Add TokenizerTrainer interface for training configuration
  • Implement concrete trainer types:
    • BPETrainer
    • WordPieceTrainer
    • UnigramTrainer
    • WordLevelTrainer
  • Add TextIterator interface for streaming text data
  • Add Train method to Tokenizer struct
  • Add TrainFromIterator method to Tokenizer struct
  • Training progress callback support
  • Configuration validation and error handling

Trainer Configuration Structs

  • BPETrainer with fields:
    • VocabSize int
    • MinFrequency int
    • SpecialTokens []string
    • ShowProgress bool
    • EndOfWordSuffix string
  • WordPieceTrainer with fields:
    • VocabSize int
    • MinFrequency int
    • SpecialTokens []string
    • UNKToken string
    • MaxInputCharsPerWord int
  • UnigramTrainer with fields:
    • VocabSize int
    • SpecialTokens []string
    • UNKToken string
    • ShrinkingFactor float64
  • WordLevelTrainer with fields:
    • VocabSize int
    • MinFrequency int
    • SpecialTokens []string
    • UNKToken string

Rust Layer (src/lib.rs)

  • Add training FFI functions for each trainer type:
    • train_bpe
    • train_wordpiece
    • train_unigram
    • train_wordlevel
  • Integration with tokenizers crate training APIs
  • File reading and text processing for training data
  • Iterator-based training support
  • Progress reporting through callbacks
  • Memory-efficient handling of large training datasets
  • Training result serialization back to Go

FFI Bridge (library.go)

  • Define training function signatures
  • Handle trainer configuration marshaling
  • File path and iterator data transfer
  • Progress callback mechanism across FFI boundary
  • Memory management for training data and results

Acceptance Criteria

Functional Requirements

  • Train successfully creates functional tokenizers from text files
  • TrainFromIterator works with streaming text data
  • All trainer types (BPE, WordPiece, Unigram, WordLevel) produce working tokenizers
  • Trained tokenizers can encode/decode text correctly
  • Special tokens are properly integrated into trained vocabularies
  • Training parameters affect vocabulary generation as expected
  • Vocabulary size constraints are respected

Training Algorithm Support

  • BPE training produces subword vocabularies with merge operations
  • WordPiece training creates WordPiece-compatible vocabularies
  • Unigram training produces probabilistic subword models
  • Word-level training creates word-based vocabularies
  • Each algorithm respects its specific hyperparameters

Data Handling

  • Large training files are processed efficiently without memory issues
  • Iterator-based training supports streaming large datasets
  • Unicode text is handled correctly across all training algorithms
  • Training progress can be monitored through callbacks
  • Malformed training data is handled gracefully with clear errors

Performance Requirements

  • Training completes in reasonable time for typical dataset sizes
  • Memory usage scales appropriately with dataset size
  • Training can be interrupted and restarted if needed
  • Progress reporting doesn't significantly impact training speed

Testing Requirements

  • Unit tests for each trainer type with small datasets
  • Integration tests comparing trained tokenizers with reference implementations
  • Performance tests with large training datasets
  • Memory usage tests for training on large corpora
  • Unicode and multilingual training tests
  • Special token handling tests in training
  • Error handling tests for malformed training data
  • Iterator vs file-based training comparison tests

Documentation Requirements

  • Go doc comments for all training interfaces and methods
  • Comprehensive training examples for each algorithm type
  • Guidelines for choosing training parameters
  • Performance tuning recommendations
  • Examples of domain-specific tokenizer training

Technical Considerations

Memory Management

  • Efficient processing of large training corpora
  • Streaming support to avoid loading entire datasets in memory
  • Proper cleanup of training artifacts and intermediate data

Training Algorithm Integration

  • Leverage tokenizers crate training implementations
  • Ensure compatibility with HuggingFace tokenizer formats
  • Support for advanced training features (merges, special token handling)

Progress Monitoring

  • Real-time progress reporting during training
  • Ability to cancel long-running training operations
  • Training metrics and statistics collection

Error Handling

  • Clear error messages for training failures
  • Validation of training parameters before starting
  • Graceful handling of corrupted or malformed training data

Performance Optimization

  • Parallel processing where possible
  • Efficient data structures for vocabulary building
  • Memory-mapped file access for large training files

Example Usage

// BPE Training from files
bpeTrainer := &BPETrainer{
    VocabSize:        30000,
    MinFrequency:     2,
    SpecialTokens:    []string{"[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"},
    ShowProgress:     true,
    EndOfWordSuffix:  "</w>",
}

trainingFiles := []string{"corpus1.txt", "corpus2.txt", "corpus3.txt"}
err := tokenizer.Train(trainingFiles, bpeTrainer)
if err != nil {
    log.Fatalf("Training failed: %v", err)
}

// WordPiece Training from iterator
wpTrainer := &WordPieceTrainer{
    VocabSize:             30000,
    MinFrequency:          2,
    SpecialTokens:         []string{"[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"},
    UNKToken:              "[UNK]",
    MaxInputCharsPerWord:  100,
}

// Custom text iterator
textIterator := &FileLineIterator{
    Files: []string{"domain_corpus.txt"},
}

err = tokenizer.TrainFromIterator(textIterator, wpTrainer)
if err != nil {
    log.Fatalf("Training from iterator failed: %v", err)
}

// Test the trained tokenizer
encoding, err := tokenizer.Encode("This is a test of the trained tokenizer.")
if err != nil {
    log.Fatal(err)
}
fmt.Printf("Encoded: %v", encoding.Tokens)

// Unigram training for SentencePiece-style tokenizer
unigramTrainer := &UnigramTrainer{
    VocabSize:        32000,
    SpecialTokens:    []string{"<unk>", "<s>", "</s>"},
    UNKToken:         "<unk>",
    ShrinkingFactor:  0.75,
}

err = tokenizer.Train([]string{"multilingual_corpus.txt"}, unigramTrainer)
if err != nil {
    log.Fatal(err)
}

Related Issues

  • Builds upon vocabulary access and token management features
  • Enables custom tokenizer creation for domain-specific applications
  • Foundation for advanced NLP pipeline customization
  • Part of advanced features milestone providing complete tokenizer API

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions