[ENH] Implement Enhanced Encoding Information (WordIDs, SequenceIDs, Mapping Methods)

## Overview

Extend the Encoding struct to provide detailed information about the tokenization process, including word-to-token mappings, sequence identification, and positional information. This enables advanced analysis of how text is tokenized and supports sophisticated downstream processing.

## Features to Implement

### Enhanced Encoding Struct
- Extend existing `Encoding` struct with additional fields for detailed tokenization information
- Maintain backward compatibility with existing encoding functionality

### Word ID Tracking
- **Field**: `WordIDs []int` - Maps each token to its original word index  
- **Purpose**: Track which tokens belong to which words in the original text
- **Use Case**: Word-level analysis, attention visualization, alignment tasks

### Sequence ID Tracking  
- **Field**: `SequenceIDs []int` - Identifies which sequence each token belongs to
- **Purpose**: Support multi-sequence inputs (e.g., question-answer pairs)
- **Use Case**: BERT-style sequence pair processing, segment identification

### Token Position Mapping
- **Field**: `TokenToWordMapping []int` - Direct token-to-word index mapping
- **Field**: `WordToTokenMapping [][]int` - Maps each word to its constituent token indices
- **Purpose**: Bidirectional mapping between words and tokens

### Character Span Information
- **Field**: `CharSpans [][2]int` - Character start/end positions for each token
- **Purpose**: Map tokens back to original character positions in input text
- **Use Case**: Highlighting, extraction, character-level analysis

## Implementation Requirements

### Go Layer (tokenizers.go)
- [ ] Extend `Encoding` struct with new fields:
  - `WordIDs []int`
  - `SequenceIDs []int`  
  - `CharSpans [][2]int`
  - `TokenToWordMapping []int`
  - `WordToTokenMapping [][]int`
- [ ] Add methods to access and query encoding information:
  - `func (e *Encoding) GetWordIDs() []int`
  - `func (e *Encoding) GetSequenceIDs() []int`
  - `func (e *Encoding) GetCharSpans() [][2]int`
  - `func (e *Encoding) TokensForWord(wordIndex int) []int`
  - `func (e *Encoding) WordForToken(tokenIndex int) int`
- [ ] Update encoding creation to populate new fields
- [ ] Maintain backward compatibility with existing code

### Rust Layer (src/lib.rs)
- [ ] Modify encoding FFI functions to return enhanced information
- [ ] Extract word IDs from tokenizers crate encoding
- [ ] Extract sequence IDs for multi-sequence inputs
- [ ] Calculate character spans for token positions
- [ ] Build word-to-token and token-to-word mappings
- [ ] Efficient serialization of complex encoding data

### FFI Bridge (library.go)
- [ ] Update encoding data structures for enhanced information transfer
- [ ] Handle complex data marshaling (arrays of arrays, spans)
- [ ] Memory management for enhanced encoding data
- [ ] Backward compatibility with existing encoding transfers

## Acceptance Criteria

### Functional Requirements
- [ ] `WordIDs` correctly maps tokens to their source words
- [ ] `SequenceIDs` properly identifies sequence boundaries for multi-sequence inputs
- [ ] `CharSpans` accurately represents character positions for each token
- [ ] Word-to-token mapping correctly identifies all tokens for each word
- [ ] Token-to-word mapping correctly identifies source word for each token
- [ ] Enhanced information works with all encoding options (padding, truncation, etc.)
- [ ] Backward compatibility maintained - existing code continues to work unchanged

### Multi-Sequence Support
- [ ] Proper sequence ID assignment for BERT-style [CLS] text1 [SEP] text2 [SEP] format
- [ ] Correct handling of special tokens in sequence identification
- [ ] Support for variable-length sequence pairs

### Subword Token Handling
- [ ] Correct word ID assignment for subword tokens (e.g., WordPiece, BPE)
- [ ] Proper handling of tokens that span word boundaries
- [ ] Accurate character span calculation for subword tokens

### Testing Requirements
- [ ] Unit tests for enhanced encoding with single sequences
- [ ] Unit tests for multi-sequence encoding (BERT-style pairs)
- [ ] Tests for subword tokenization scenarios (WordPiece, BPE, SentencePiece)
- [ ] Character span accuracy tests with Unicode text
- [ ] Word-to-token mapping tests with various text types
- [ ] Integration tests with different tokenizer models
- [ ] Performance tests for enhanced encoding overhead
- [ ] Backward compatibility tests ensuring existing code works

### Documentation Requirements
- [ ] Go doc comments for all new fields and methods
- [ ] Usage examples for enhanced encoding analysis
- [ ] Documentation of use cases for each type of mapping
- [ ] Examples of multi-sequence processing

## Technical Considerations

### Memory Efficiency
- Additional encoding information increases memory usage
- Consider lazy computation of mappings if not always needed
- Efficient storage of sparse mapping data

### Performance Impact
- Minimize overhead when enhanced information is not needed
- Efficient computation of word-to-token mappings
- Fast character span calculation

### Unicode Handling
- Proper character span calculation with multi-byte UTF-8 characters
- Correct word boundary detection across different languages
- Handling of complex scripts and text normalization

### Special Token Integration
- Proper handling of special tokens (CLS, SEP, PAD, etc.) in mappings
- Clear documentation of how special tokens affect word/sequence IDs
- Consistent behavior across different tokenizer types

## Example Usage

```go
// Single sequence encoding with enhanced information
text := "Hello world! How are you?"
encoding, err := tokenizer.Encode(text)
if err != nil {
    log.Fatal(err)
}

// Access enhanced information
wordIDs := encoding.GetWordIDs()
charSpans := encoding.GetCharSpans()

fmt.Printf("Tokens: %v", encoding.Tokens)
fmt.Printf("Word IDs: %v", wordIDs)
fmt.Printf("Character spans: %v", charSpans)

// Find tokens for specific word
word2Tokens := encoding.TokensForWord(2) // Get tokens for word index 2
fmt.Printf("Tokens for word 2: %v", word2Tokens)

// Find word for specific token  
token5Word := encoding.WordForToken(5) // Get word for token index 5
fmt.Printf("Token 5 belongs to word: %d", token5Word)

// Multi-sequence encoding
question := "What is the capital?"
context := "The capital of France is Paris."
encoding, err = tokenizer.EncodeSequencePair(question, context)
if err != nil {
    log.Fatal(err)
}

sequenceIDs := encoding.GetSequenceIDs()
fmt.Printf("Sequence IDs: %v", sequenceIDs) // 0s for question, 1s for context

// Character-level analysis
for i, span := range encoding.GetCharSpans() {
    if span[0] >= 0 { // Valid span (not special token)
        tokenText := text[span[0]:span[1]]
        fmt.Printf("Token %d '%s' spans chars %d-%d", i, encoding.Tokens[i], span[0], span[1])
    }
}
```

## Related Issues
- Builds upon core encoding functionality
- Enables advanced tokenization analysis and visualization
- Foundation for attention analysis and model interpretability features
- Part of extended functionality milestone for advanced use cases

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Implement Enhanced Encoding Information (WordIDs, SequenceIDs, Mapping Methods) #67

Overview

Features to Implement

Enhanced Encoding Struct

Word ID Tracking

Sequence ID Tracking

Token Position Mapping

Character Span Information

Implementation Requirements

Go Layer (tokenizers.go)

Rust Layer (src/lib.rs)

FFI Bridge (library.go)

Acceptance Criteria

Functional Requirements

Multi-Sequence Support

Subword Token Handling

Testing Requirements

Documentation Requirements

Technical Considerations

Memory Efficiency

Performance Impact

Unicode Handling

Special Token Integration

Example Usage

Related Issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[ENH] Implement Enhanced Encoding Information (WordIDs, SequenceIDs, Mapping Methods) #67

Description

Overview

Features to Implement

Enhanced Encoding Struct

Word ID Tracking

Sequence ID Tracking

Token Position Mapping

Character Span Information

Implementation Requirements

Go Layer (tokenizers.go)

Rust Layer (src/lib.rs)

FFI Bridge (library.go)

Acceptance Criteria

Functional Requirements

Multi-Sequence Support

Subword Token Handling

Testing Requirements

Documentation Requirements

Technical Considerations

Memory Efficiency

Performance Impact

Unicode Handling

Special Token Integration

Example Usage

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions