Implement EncodePair method for Tokenizer#96
Conversation
Implements EncodePair and EncodePairs methods to encode sequence pairs,
enabling efficient query-document pair encoding for reranking tasks.
**Key Features:**
- EncodePairs: Batch encoding of multiple sequence pairs with parallel processing
- EncodePair: Convenience wrapper for single pair encoding
- Zero ABI breaking changes: New FFI function encode_batch_pairs
**Implementation:**
- Rust: encode_batch_pairs FFI function using tokenizer.encode_batch
- Go: EncodePairs method with array handling and EncodePair convenience wrapper
- Tests: Comprehensive test coverage for both single and batch pair encoding
**Use Case:**
Rerankers typically score 1 query against N documents. Batch pair encoding
provides significant performance improvement through parallel processing.
**API:**
```go
// Single pair
result, err := tokenizer.EncodePair("query", "document", opts...)
// Batch pairs (efficient for reranking)
results, err := tokenizer.EncodePairs(
[]string{"query1", "query2"},
[]string{"doc1", "doc2"},
opts...
)
```
Code ReviewThis PR implements batch pair encoding functionality with solid implementation and comprehensive testing. Here's my analysis: ✅ Strengths1. Clean API Design
2. Robust Error Handling
3. Comprehensive Test Coverage
4. Memory Management
|
When Rust source files (src/ or Cargo.*) change in a PR, force local library build instead of downloading from releases. This ensures CI tests run against the latest code changes including new FFI functions. Fixes the encode_batch_pairs undefined symbol error in CI by building the library locally when Rust code changes are detected.
The previous approach using git diff HEAD~1 failed because GitHub Actions uses shallow clones (fetch-depth: 1) where HEAD~1 doesn't exist. The error was silenced, causing the script to always download old releases. New approach: Detect PR context using GITHUB_BASE_REF environment variable which is only set during pull_request events. When set, always build locally to ensure tests run against the latest code. Benefits: - Simple and reliable - no git history needed - PRs always test current code (correctness over speed) - Pushes to main still use cached releases (fast) - No workflow changes required
Follow-up: Critical Null-Termination Issue Still PresentThe critical null-termination bug I identified in my previous review (lines 452-453 in tokenizers.go) has not been addressed and remains a blocker for this PR. The ProblemGo strings are not null-terminated, but the Rust FFI expects null-terminated C strings. Using Current Code (BROKEN):for i := 0; i < len(sequences); i++ {
cSequences[i] = unsafe.StringData(sequences[i]) // ❌ No null terminator
cPairs[i] = unsafe.StringData(pairs[i]) // ❌ No null terminator
}Why Tests Pass (False Positive)Tests may currently pass due to:
This is undefined behavior and will fail unpredictably in production. Required FixAdd null terminators to all strings: // Convert Go strings to null-terminated C strings
cSequences := make([]*byte, len(sequences))
cPairs := make([]*byte, len(pairs))
cSeqBytes := make([][]byte, len(sequences))
cPairBytes := make([][]byte, len(pairs))
for i := 0; i < len(sequences); i++ {
// Append null terminator and keep reference to prevent GC
cSeqBytes[i] = append([]byte(sequences[i]), 0)
cPairBytes[i] = append([]byte(pairs[i]), 0)
cSequences[i] = &cSeqBytes[i][0]
cPairs[i] = &cPairBytes[i][0]
}Alternative: Use syscall Packageimport "syscall"
cSeqPtrs := make([]*byte, len(sequences))
cPairPtrs := make([]*byte, len(pairs))
for i := 0; i < len(sequences); i++ {
seqPtr, err := syscall.BytePtrFromString(sequences[i])
if err != nil {
return nil, errors.Wrap(err, "invalid sequence string")
}
pairPtr, err := syscall.BytePtrFromString(pairs[i])
if err != nil {
return nil, errors.Wrap(err, "invalid pair string")
}
cSeqPtrs[i] = seqPtr
cPairPtrs[i] = pairPtr
}Verification NeededAfter fixing, please add a test with strings that would expose this bug: func TestEncodePairsNullTermination(t *testing.T) {
// String without natural null bytes nearby in memory
longSeq := strings.Repeat("A", 100)
longPair := strings.Repeat("B", 100)
results, err := tok.EncodePairs([]string{longSeq}, []string{longPair})
require.NoError(t, err)
// Verify encoding is correct length
}This must be fixed before merge. All other aspects of the PR are excellent, but this bug could cause crashes, data corruption, or security issues in production. |
Go strings are not null-terminated, but Rust's CStr::from_ptr() expects null-terminated C strings. The existing Encode() method works because purego automatically adds null terminators when passing 'string' type parameters. However, EncodePairs() uses **byte (array of byte pointers) which bypasses purego's automatic conversion. Using unsafe.StringData() without null terminators causes undefined behavior when Rust scans for the null byte. Changes: - Manually append null terminator (0 byte) to each string - Keep references to prevent GC during FFI call - Add test with long strings to verify null termination works This fixes a critical bug where memory corruption or crashes could occur when Rust's CStr::from_ptr() reads past string boundaries.
Code Review SummaryThis PR implements pair encoding functionality for reranking use cases. The implementation is solid with good test coverage. Here are my findings: Strengths ✓
Issues Found1. Memory Leak in Error Path (src/lib.rs:406-415)Severity: High In Location: src/lib.rs:406-415 for allocated_token in vec_tokens {
drop(std::ffi::CString::from_raw(allocated_token));
}
return ERROR_CSTRING_CONVERSION_FAILED;Fix: Wrap buffer allocation in RAII guards or clean up all previously written buffers on error. 2. Inconsistent Null Termination Handling (tokenizers.go:448-453)Severity: Medium The code appends null terminators to create C strings, but Go's string-to-byte conversion may already include unexpected data patterns. While the test with long strings (100+ chars) likely catches issues, the implementation could be more explicit. Location: tokenizers.go:448-453 cSeqBytes[i] = append([]byte(sequences[i]), 0)
cPairBytes[i] = append([]byte(pairs[i]), 0)Recommendation: Consider using a helper function to centralize C string creation for consistency with other encoding methods. 3. CI Script DuplicationSeverity: Low The PR detection logic is duplicated across Linux, macOS, and Windows sections in the action.yml file. This increases maintenance burden. Recommendation: Consider extracting to a shared script or using composite actions for better DRY. Minor Suggestions
Test Coverage AssessmentExcellent coverage including:
Security Considerations
VerdictApprove with minor fix required for the memory leak in error path. The other issues are low priority but should be addressed in follow-up work. Great implementation overall! Closes #95 effectively. |
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'Go Benchmark'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.20.
| Benchmark suite | Current: 5294667 | Previous: 828678a | Ratio |
|---|---|---|---|
BenchmarkDecode/WithSpecialTokens |
20150 ns/op 740 B/op 10 allocs/op |
16745 ns/op 740 B/op 10 allocs/op |
1.20 |
BenchmarkDecode/WithSpecialTokens - ns/op |
20150 ns/op |
16745 ns/op |
1.20 |
This comment was automatically generated by workflow using github-action-benchmark.
CC: @tazarov
Benchmark Comparison |
When encode_batch_pairs encounters an error during buffer allocation (e.g., CString conversion failure), it was only cleaning up the current iteration's tokens but leaking all buffers from previous successful iterations. Changes: - Use two-phase allocation: collect all buffers in temp storage first - Only write to output array if ALL allocations succeed - On error, clean up all buffers in temp storage using new helper - Add free_buffer_contents() helper for cleanup without pointer deref This ensures either: - All buffers successfully written to output (caller frees them), OR - No buffers written to output (all cleaned up, error returned) Fixes potential multi-megabyte memory leak when processing large batches with malformed token data containing interior null bytes.
PR Review - EncodePair ImplementationExcellent work! This PR successfully implements pair encoding for reranking use cases. The code addresses all previously identified issues. ✅ Strengths
🔍 Minor Suggestions
📊 Verification
✅ RecommendationApprove and merge. All critical issues resolved. Minor suggestions are optional enhancements for future PRs. Closes #95 effectively. |
Implements EncodePair and EncodePairs methods to encode sequence pairs, enabling efficient query-document pair encoding for reranking tasks.
Key Features:
Implementation:
Use Case:
Rerankers typically score 1 query against N documents. Batch pair encoding provides significant performance improvement through parallel processing.
API:
Pull Request
Description
Type of Change
Testing
Checklist
Platform Testing
If this PR affects cross-platform functionality, please test on:
Additional Notes
Closes #95