Skip to content

[ENH] Implement Enhanced Encoding Information (WordIDs, SequenceIDs, Mapping Methods) #67

@tazarov

Description

@tazarov

Overview

Extend the Encoding struct to provide detailed information about the tokenization process, including word-to-token mappings, sequence identification, and positional information. This enables advanced analysis of how text is tokenized and supports sophisticated downstream processing.

Features to Implement

Enhanced Encoding Struct

  • Extend existing Encoding struct with additional fields for detailed tokenization information
  • Maintain backward compatibility with existing encoding functionality

Word ID Tracking

  • Field: WordIDs []int - Maps each token to its original word index
  • Purpose: Track which tokens belong to which words in the original text
  • Use Case: Word-level analysis, attention visualization, alignment tasks

Sequence ID Tracking

  • Field: SequenceIDs []int - Identifies which sequence each token belongs to
  • Purpose: Support multi-sequence inputs (e.g., question-answer pairs)
  • Use Case: BERT-style sequence pair processing, segment identification

Token Position Mapping

  • Field: TokenToWordMapping []int - Direct token-to-word index mapping
  • Field: WordToTokenMapping [][]int - Maps each word to its constituent token indices
  • Purpose: Bidirectional mapping between words and tokens

Character Span Information

  • Field: CharSpans [][2]int - Character start/end positions for each token
  • Purpose: Map tokens back to original character positions in input text
  • Use Case: Highlighting, extraction, character-level analysis

Implementation Requirements

Go Layer (tokenizers.go)

  • Extend Encoding struct with new fields:
    • WordIDs []int
    • SequenceIDs []int
    • CharSpans [][2]int
    • TokenToWordMapping []int
    • WordToTokenMapping [][]int
  • Add methods to access and query encoding information:
    • func (e *Encoding) GetWordIDs() []int
    • func (e *Encoding) GetSequenceIDs() []int
    • func (e *Encoding) GetCharSpans() [][2]int
    • func (e *Encoding) TokensForWord(wordIndex int) []int
    • func (e *Encoding) WordForToken(tokenIndex int) int
  • Update encoding creation to populate new fields
  • Maintain backward compatibility with existing code

Rust Layer (src/lib.rs)

  • Modify encoding FFI functions to return enhanced information
  • Extract word IDs from tokenizers crate encoding
  • Extract sequence IDs for multi-sequence inputs
  • Calculate character spans for token positions
  • Build word-to-token and token-to-word mappings
  • Efficient serialization of complex encoding data

FFI Bridge (library.go)

  • Update encoding data structures for enhanced information transfer
  • Handle complex data marshaling (arrays of arrays, spans)
  • Memory management for enhanced encoding data
  • Backward compatibility with existing encoding transfers

Acceptance Criteria

Functional Requirements

  • WordIDs correctly maps tokens to their source words
  • SequenceIDs properly identifies sequence boundaries for multi-sequence inputs
  • CharSpans accurately represents character positions for each token
  • Word-to-token mapping correctly identifies all tokens for each word
  • Token-to-word mapping correctly identifies source word for each token
  • Enhanced information works with all encoding options (padding, truncation, etc.)
  • Backward compatibility maintained - existing code continues to work unchanged

Multi-Sequence Support

  • Proper sequence ID assignment for BERT-style [CLS] text1 [SEP] text2 [SEP] format
  • Correct handling of special tokens in sequence identification
  • Support for variable-length sequence pairs

Subword Token Handling

  • Correct word ID assignment for subword tokens (e.g., WordPiece, BPE)
  • Proper handling of tokens that span word boundaries
  • Accurate character span calculation for subword tokens

Testing Requirements

  • Unit tests for enhanced encoding with single sequences
  • Unit tests for multi-sequence encoding (BERT-style pairs)
  • Tests for subword tokenization scenarios (WordPiece, BPE, SentencePiece)
  • Character span accuracy tests with Unicode text
  • Word-to-token mapping tests with various text types
  • Integration tests with different tokenizer models
  • Performance tests for enhanced encoding overhead
  • Backward compatibility tests ensuring existing code works

Documentation Requirements

  • Go doc comments for all new fields and methods
  • Usage examples for enhanced encoding analysis
  • Documentation of use cases for each type of mapping
  • Examples of multi-sequence processing

Technical Considerations

Memory Efficiency

  • Additional encoding information increases memory usage
  • Consider lazy computation of mappings if not always needed
  • Efficient storage of sparse mapping data

Performance Impact

  • Minimize overhead when enhanced information is not needed
  • Efficient computation of word-to-token mappings
  • Fast character span calculation

Unicode Handling

  • Proper character span calculation with multi-byte UTF-8 characters
  • Correct word boundary detection across different languages
  • Handling of complex scripts and text normalization

Special Token Integration

  • Proper handling of special tokens (CLS, SEP, PAD, etc.) in mappings
  • Clear documentation of how special tokens affect word/sequence IDs
  • Consistent behavior across different tokenizer types

Example Usage

// Single sequence encoding with enhanced information
text := "Hello world! How are you?"
encoding, err := tokenizer.Encode(text)
if err != nil {
    log.Fatal(err)
}

// Access enhanced information
wordIDs := encoding.GetWordIDs()
charSpans := encoding.GetCharSpans()

fmt.Printf("Tokens: %v", encoding.Tokens)
fmt.Printf("Word IDs: %v", wordIDs)
fmt.Printf("Character spans: %v", charSpans)

// Find tokens for specific word
word2Tokens := encoding.TokensForWord(2) // Get tokens for word index 2
fmt.Printf("Tokens for word 2: %v", word2Tokens)

// Find word for specific token  
token5Word := encoding.WordForToken(5) // Get word for token index 5
fmt.Printf("Token 5 belongs to word: %d", token5Word)

// Multi-sequence encoding
question := "What is the capital?"
context := "The capital of France is Paris."
encoding, err = tokenizer.EncodeSequencePair(question, context)
if err != nil {
    log.Fatal(err)
}

sequenceIDs := encoding.GetSequenceIDs()
fmt.Printf("Sequence IDs: %v", sequenceIDs) // 0s for question, 1s for context

// Character-level analysis
for i, span := range encoding.GetCharSpans() {
    if span[0] >= 0 { // Valid span (not special token)
        tokenText := text[span[0]:span[1]]
        fmt.Printf("Token %d '%s' spans chars %d-%d", i, encoding.Tokens[i], span[0], span[1])
    }
}

Related Issues

  • Builds upon core encoding functionality
  • Enables advanced tokenization analysis and visualization
  • Foundation for attention analysis and model interpretability features
  • Part of extended functionality milestone for advanced use cases

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions