Overview
Extend the Encoding struct to provide detailed information about the tokenization process, including word-to-token mappings, sequence identification, and positional information. This enables advanced analysis of how text is tokenized and supports sophisticated downstream processing.
Features to Implement
Enhanced Encoding Struct
- Extend existing
Encoding struct with additional fields for detailed tokenization information
- Maintain backward compatibility with existing encoding functionality
Word ID Tracking
- Field:
WordIDs []int - Maps each token to its original word index
- Purpose: Track which tokens belong to which words in the original text
- Use Case: Word-level analysis, attention visualization, alignment tasks
Sequence ID Tracking
- Field:
SequenceIDs []int - Identifies which sequence each token belongs to
- Purpose: Support multi-sequence inputs (e.g., question-answer pairs)
- Use Case: BERT-style sequence pair processing, segment identification
Token Position Mapping
- Field:
TokenToWordMapping []int - Direct token-to-word index mapping
- Field:
WordToTokenMapping [][]int - Maps each word to its constituent token indices
- Purpose: Bidirectional mapping between words and tokens
Character Span Information
- Field:
CharSpans [][2]int - Character start/end positions for each token
- Purpose: Map tokens back to original character positions in input text
- Use Case: Highlighting, extraction, character-level analysis
Implementation Requirements
Go Layer (tokenizers.go)
Rust Layer (src/lib.rs)
FFI Bridge (library.go)
Acceptance Criteria
Functional Requirements
Multi-Sequence Support
Subword Token Handling
Testing Requirements
Documentation Requirements
Technical Considerations
Memory Efficiency
- Additional encoding information increases memory usage
- Consider lazy computation of mappings if not always needed
- Efficient storage of sparse mapping data
Performance Impact
- Minimize overhead when enhanced information is not needed
- Efficient computation of word-to-token mappings
- Fast character span calculation
Unicode Handling
- Proper character span calculation with multi-byte UTF-8 characters
- Correct word boundary detection across different languages
- Handling of complex scripts and text normalization
Special Token Integration
- Proper handling of special tokens (CLS, SEP, PAD, etc.) in mappings
- Clear documentation of how special tokens affect word/sequence IDs
- Consistent behavior across different tokenizer types
Example Usage
// Single sequence encoding with enhanced information
text := "Hello world! How are you?"
encoding, err := tokenizer.Encode(text)
if err != nil {
log.Fatal(err)
}
// Access enhanced information
wordIDs := encoding.GetWordIDs()
charSpans := encoding.GetCharSpans()
fmt.Printf("Tokens: %v", encoding.Tokens)
fmt.Printf("Word IDs: %v", wordIDs)
fmt.Printf("Character spans: %v", charSpans)
// Find tokens for specific word
word2Tokens := encoding.TokensForWord(2) // Get tokens for word index 2
fmt.Printf("Tokens for word 2: %v", word2Tokens)
// Find word for specific token
token5Word := encoding.WordForToken(5) // Get word for token index 5
fmt.Printf("Token 5 belongs to word: %d", token5Word)
// Multi-sequence encoding
question := "What is the capital?"
context := "The capital of France is Paris."
encoding, err = tokenizer.EncodeSequencePair(question, context)
if err != nil {
log.Fatal(err)
}
sequenceIDs := encoding.GetSequenceIDs()
fmt.Printf("Sequence IDs: %v", sequenceIDs) // 0s for question, 1s for context
// Character-level analysis
for i, span := range encoding.GetCharSpans() {
if span[0] >= 0 { // Valid span (not special token)
tokenText := text[span[0]:span[1]]
fmt.Printf("Token %d '%s' spans chars %d-%d", i, encoding.Tokens[i], span[0], span[1])
}
}
Related Issues
- Builds upon core encoding functionality
- Enables advanced tokenization analysis and visualization
- Foundation for attention analysis and model interpretability features
- Part of extended functionality milestone for advanced use cases
Overview
Extend the Encoding struct to provide detailed information about the tokenization process, including word-to-token mappings, sequence identification, and positional information. This enables advanced analysis of how text is tokenized and supports sophisticated downstream processing.
Features to Implement
Enhanced Encoding Struct
Encodingstruct with additional fields for detailed tokenization informationWord ID Tracking
WordIDs []int- Maps each token to its original word indexSequence ID Tracking
SequenceIDs []int- Identifies which sequence each token belongs toToken Position Mapping
TokenToWordMapping []int- Direct token-to-word index mappingWordToTokenMapping [][]int- Maps each word to its constituent token indicesCharacter Span Information
CharSpans [][2]int- Character start/end positions for each tokenImplementation Requirements
Go Layer (tokenizers.go)
Encodingstruct with new fields:WordIDs []intSequenceIDs []intCharSpans [][2]intTokenToWordMapping []intWordToTokenMapping [][]intfunc (e *Encoding) GetWordIDs() []intfunc (e *Encoding) GetSequenceIDs() []intfunc (e *Encoding) GetCharSpans() [][2]intfunc (e *Encoding) TokensForWord(wordIndex int) []intfunc (e *Encoding) WordForToken(tokenIndex int) intRust Layer (src/lib.rs)
FFI Bridge (library.go)
Acceptance Criteria
Functional Requirements
WordIDscorrectly maps tokens to their source wordsSequenceIDsproperly identifies sequence boundaries for multi-sequence inputsCharSpansaccurately represents character positions for each tokenMulti-Sequence Support
Subword Token Handling
Testing Requirements
Documentation Requirements
Technical Considerations
Memory Efficiency
Performance Impact
Unicode Handling
Special Token Integration
Example Usage
Related Issues