Implement EncodePair method for Tokenizer by tazarov · Pull Request #96 · amikos-tech/pure-tokenizers

tazarov · 2025-11-07T10:57:45Z

Implements EncodePair and EncodePairs methods to encode sequence pairs, enabling efficient query-document pair encoding for reranking tasks.

Key Features:

EncodePairs: Batch encoding of multiple sequence pairs with parallel processing
EncodePair: Convenience wrapper for single pair encoding
Zero ABI breaking changes: New FFI function encode_batch_pairs

Implementation:

Rust: encode_batch_pairs FFI function using tokenizer.encode_batch
Go: EncodePairs method with array handling and EncodePair convenience wrapper
Tests: Comprehensive test coverage for both single and batch pair encoding

Use Case:
Rerankers typically score 1 query against N documents. Batch pair encoding provides significant performance improvement through parallel processing.

API:

// Single pair
result, err := tokenizer.EncodePair("query", "document", opts...)

// Batch pairs (efficient for reranking)
results, err := tokenizer.EncodePairs(
    []string{"query1", "query2"},
    []string{"doc1", "doc2"},
    opts...
)

Pull Request

Description

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Performance improvement
Code refactoring

Testing

Tests pass locally
Added tests for new functionality
Tested on multiple platforms (if applicable)
Cross-compilation works (if Rust changes)

Checklist

My code follows the project's style guidelines
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Platform Testing

If this PR affects cross-platform functionality, please test on:

Additional Notes

Closes #95

Implements EncodePair and EncodePairs methods to encode sequence pairs, enabling efficient query-document pair encoding for reranking tasks. **Key Features:** - EncodePairs: Batch encoding of multiple sequence pairs with parallel processing - EncodePair: Convenience wrapper for single pair encoding - Zero ABI breaking changes: New FFI function encode_batch_pairs **Implementation:** - Rust: encode_batch_pairs FFI function using tokenizer.encode_batch - Go: EncodePairs method with array handling and EncodePair convenience wrapper - Tests: Comprehensive test coverage for both single and batch pair encoding **Use Case:** Rerankers typically score 1 query against N documents. Batch pair encoding provides significant performance improvement through parallel processing. **API:** ```go // Single pair result, err := tokenizer.EncodePair("query", "document", opts...) // Batch pairs (efficient for reranking) results, err := tokenizer.EncodePairs( []string{"query1", "query2"}, []string{"doc1", "doc2"}, opts... ) ```

claude · 2025-11-07T10:59:36Z

Code Review

This PR implements batch pair encoding functionality with solid implementation and comprehensive testing. Here's my analysis:

✅ Strengths

1. Clean API Design

EncodePair (convenience) and EncodePairs (batch) provide intuitive API
Leverages existing EncodeOptions infrastructure consistently
Zero ABI breaking changes - new FFI function only

2. Robust Error Handling

Proper validation: null checks, length matching, empty batch handling
Memory cleanup on token conversion failure (src/lib.rs:435-438)
Consistent error propagation pattern

3. Comprehensive Test Coverage

Single pair, batch pairs, empty inputs, mismatched lengths
Tests verify tokens, type IDs, attention masks, offsets
Edge case: empty second sequence

4. Memory Management

Proper buffer cleanup in Go (tokenizers.go:509)
std::mem::forget pattern matches existing code
Token cleanup on CString conversion failure

⚠️ Critical Issue: Null-Terminated Strings

Problem: The Go code uses unsafe.StringData to pass strings to C (tokenizers.go:452-453):

cSequences[i] = unsafe.StringData(sequences[i])
cPairs[i] = unsafe.StringData(pairs[i])

Issue: unsafe.StringData returns a pointer to the string's underlying byte array, but Go strings are NOT null-terminated. The Rust FFI expects null-terminated C strings (*const libc::c_char) per the safety documentation (src/lib.rs:335-336).

Impact: This will cause undefined behavior - the Rust CStr::from_ptr will read past the string boundary until it finds a null byte, potentially:

Reading garbage data
Segmentation faults
Data corruption

Fix Required: Strings must be null-terminated. Compare with the single Encode function (tokenizers.go:389) which passes the string directly - purego handles this automatically for single strings but NOT for string arrays.

Recommended Solution:

// Convert Go strings to null-terminated C strings
cSequences := make([]*byte, len(sequences))
cPairs := make([]*byte, len(pairs))

for i := 0; i < len(sequences); i++ {
    // Append null terminator
    seqBytes := append([]byte(sequences[i]), 0)
    pairBytes := append([]byte(pairs[i]), 0)
    cSequences[i] = &seqBytes[0]
    cPairs[i] = &pairBytes[0]
}

Alternatively, use syscall.BytePtrFromString which handles null termination.

🔍 Minor Issues

1. Incomplete Error Cleanup (src/lib.rs:434-440)
When token CString conversion fails mid-iteration, only tokens are cleaned but previously allocated resources (IDs, type_ids, etc.) from earlier iterations leak.

Suggestion: Track allocated buffers and clean them on error, or use RAII patterns.

2. Missing Documentation
The new FFI function should be documented in project docs (CLAUDE.md mentions ABI compatibility but doesn't list available functions).

3. Test Gap
No test for invalid UTF-8 in pair encoding (though existing validation handles it).

📊 Performance

Parallel batch encoding via tokenizer.encode_batch (src/lib.rs:404) is excellent for reranking use cases. Good choice.

🔐 Security

No security concerns beyond the null-termination issue which could cause crashes/memory corruption.

✅ Verdict

Strong implementation but the null-termination bug is a blocker that must be fixed before merge. The tests currently pass likely due to lucky memory layout having zeros after strings, but this is unreliable.

After fixing the string handling:

✅ Merge-ready
Consider adding the suggested error cleanup improvement in a follow-up

When Rust source files (src/ or Cargo.*) change in a PR, force local library build instead of downloading from releases. This ensures CI tests run against the latest code changes including new FFI functions. Fixes the encode_batch_pairs undefined symbol error in CI by building the library locally when Rust code changes are detected.

The previous approach using git diff HEAD~1 failed because GitHub Actions uses shallow clones (fetch-depth: 1) where HEAD~1 doesn't exist. The error was silenced, causing the script to always download old releases. New approach: Detect PR context using GITHUB_BASE_REF environment variable which is only set during pull_request events. When set, always build locally to ensure tests run against the latest code. Benefits: - Simple and reliable - no git history needed - PRs always test current code (correctness over speed) - Pushes to main still use cached releases (fast) - No workflow changes required

claude · 2025-11-07T12:27:04Z

Follow-up: Critical Null-Termination Issue Still Present

The critical null-termination bug I identified in my previous review (lines 452-453 in tokenizers.go) has not been addressed and remains a blocker for this PR.

The Problem

Go strings are not null-terminated, but the Rust FFI expects null-terminated C strings. Using unsafe.StringData without adding null terminators will cause undefined behavior.

Current Code (BROKEN):

for i := 0; i < len(sequences); i++ {
    cSequences[i] = unsafe.StringData(sequences[i])  // ❌ No null terminator
    cPairs[i] = unsafe.StringData(pairs[i])          // ❌ No null terminator
}

Why Tests Pass (False Positive)

Tests may currently pass due to:

Lucky memory layout with zeros after strings
Short test strings where garbage bytes happen to be non-printable
Specific allocator behavior that isn't guaranteed

This is undefined behavior and will fail unpredictably in production.

Required Fix

Add null terminators to all strings:

// Convert Go strings to null-terminated C strings
cSequences := make([]*byte, len(sequences))
cPairs := make([]*byte, len(pairs))
cSeqBytes := make([][]byte, len(sequences))
cPairBytes := make([][]byte, len(pairs))

for i := 0; i < len(sequences); i++ {
    // Append null terminator and keep reference to prevent GC
    cSeqBytes[i] = append([]byte(sequences[i]), 0)
    cPairBytes[i] = append([]byte(pairs[i]), 0)
    cSequences[i] = &cSeqBytes[i][0]
    cPairs[i] = &cPairBytes[i][0]
}

Alternative: Use syscall Package

import "syscall"

cSeqPtrs := make([]*byte, len(sequences))
cPairPtrs := make([]*byte, len(pairs))

for i := 0; i < len(sequences); i++ {
    seqPtr, err := syscall.BytePtrFromString(sequences[i])
    if err != nil {
        return nil, errors.Wrap(err, "invalid sequence string")
    }
    pairPtr, err := syscall.BytePtrFromString(pairs[i])
    if err != nil {
        return nil, errors.Wrap(err, "invalid pair string")
    }
    cSeqPtrs[i] = seqPtr
    cPairPtrs[i] = pairPtr
}

Verification Needed

After fixing, please add a test with strings that would expose this bug:

func TestEncodePairsNullTermination(t *testing.T) {
    // String without natural null bytes nearby in memory
    longSeq := strings.Repeat("A", 100)
    longPair := strings.Repeat("B", 100)
    
    results, err := tok.EncodePairs([]string{longSeq}, []string{longPair})
    require.NoError(t, err)
    // Verify encoding is correct length
}

This must be fixed before merge. All other aspects of the PR are excellent, but this bug could cause crashes, data corruption, or security issues in production.

Go strings are not null-terminated, but Rust's CStr::from_ptr() expects null-terminated C strings. The existing Encode() method works because purego automatically adds null terminators when passing 'string' type parameters. However, EncodePairs() uses **byte (array of byte pointers) which bypasses purego's automatic conversion. Using unsafe.StringData() without null terminators causes undefined behavior when Rust scans for the null byte. Changes: - Manually append null terminator (0 byte) to each string - Keep references to prevent GC during FFI call - Add test with long strings to verify null termination works This fixes a critical bug where memory corruption or crashes could occur when Rust's CStr::from_ptr() reads past string boundaries.

claude · 2025-11-07T12:44:21Z

Code Review Summary

This PR implements pair encoding functionality for reranking use cases. The implementation is solid with good test coverage. Here are my findings:

Strengths ✓

Zero ABI breaking changes - Correctly adds new FFI function without modifying existing ones
Comprehensive test coverage - Tests cover single pairs, batches, edge cases, and long strings
Good CI improvements - PR detection logic ensures local builds for changed Rust code
Memory safety - Proper cleanup with defer and explicit buffer freeing
Error handling - Thorough validation and descriptive error messages
Clear documentation - Good API examples in PR description

Issues Found

1. Memory Leak in Error Path (src/lib.rs:406-415)

Severity: High

In encode_batch_pairs, if token CString conversion fails mid-loop, already allocated tokens are cleaned up, but other buffers (ids, type_ids, etc.) for previous iterations are leaked.

Location: src/lib.rs:406-415

for allocated_token in vec_tokens {
    drop(std::ffi::CString::from_raw(allocated_token));
}
return ERROR_CSTRING_CONVERSION_FAILED;

Fix: Wrap buffer allocation in RAII guards or clean up all previously written buffers on error.

2. Inconsistent Null Termination Handling (tokenizers.go:448-453)

Severity: Medium

The code appends null terminators to create C strings, but Go's string-to-byte conversion may already include unexpected data patterns. While the test with long strings (100+ chars) likely catches issues, the implementation could be more explicit.

Location: tokenizers.go:448-453

cSeqBytes[i] = append([]byte(sequences[i]), 0)
cPairBytes[i] = append([]byte(pairs[i]), 0)

Recommendation: Consider using a helper function to centralize C string creation for consistency with other encoding methods.

3. CI Script Duplication

Severity: Low

The PR detection logic is duplicated across Linux, macOS, and Windows sections in the action.yml file. This increases maintenance burden.

Recommendation: Consider extracting to a shared script or using composite actions for better DRY.

Minor Suggestions

Type safety in Go: The unsafe pointer casts at tokenizers.go:467-468 are necessary but consider adding helper functions to make the intent clearer
Documentation: Add godoc examples for EncodePairs showing the reranking use case mentioned in the PR description
Performance: Consider adding a benchmark comparing single Encode calls in a loop vs EncodePairs to demonstrate the performance benefit

Test Coverage Assessment

Excellent coverage including:

Single pair encoding
Batch encoding with multiple pairs
Empty pairs and batches
Mismatched array lengths
Various encoding options
Long string null termination edge case

Security Considerations

No security concerns identified
Proper input validation for null pointers and array bounds
Memory safety maintained with proper cleanup

Verdict

Approve with minor fix required for the memory leak in error path. The other issues are low priority but should be addressed in follow-up work.

Great implementation overall! Closes #95 effectively.

github-actions

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Go Benchmark'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.20.

Benchmark suite	Current: `5294667`	Previous: `828678a`	Ratio
`BenchmarkDecode/WithSpecialTokens`	`20150` ns/op 740 B/op 10 allocs/op	`16745` ns/op 740 B/op 10 allocs/op	`1.20`
`BenchmarkDecode/WithSpecialTokens - ns/op`	`20150` ns/op	`16745` ns/op	`1.20`

This comment was automatically generated by workflow using github-action-benchmark.

CC: @tazarov

github-actions · 2025-11-07T13:01:37Z

Benchmark Comparison

goos: linux
goarch: amd64
pkg: github.com/amikos-tech/pure-tokenizers
cpu: AMD EPYC 7763 64-Core Processor                
                                  │ base_bench.txt │            pr_bench.txt            │
                                  │     sec/op     │    sec/op     vs base              │
Encode/Short-4                        9.452µ ± ∞ ¹   9.698µ ± ∞ ¹  +2.60% (p=0.008 n=5)
Encode/Medium-4                       42.89µ ± ∞ ¹   43.99µ ± ∞ ¹  +2.56% (p=0.008 n=5)
Encode/Long-4                         333.8µ ± ∞ ¹   335.6µ ± ∞ ¹       ~ (p=0.421 n=5)
EncodeWithOptions/Default-4           42.55µ ± ∞ ¹   43.80µ ± ∞ ¹       ~ (p=0.095 n=5)
EncodeWithOptions/WithTypeIDs-4       43.09µ ± ∞ ¹   43.92µ ± ∞ ¹  +1.93% (p=0.008 n=5)
EncodeWithOptions/WithTokens-4        43.24µ ± ∞ ¹   43.42µ ± ∞ ¹       ~ (p=0.421 n=5)
EncodeWithOptions/WithOffsets-4       43.31µ ± ∞ ¹   43.86µ ± ∞ ¹       ~ (p=0.056 n=5)
EncodeWithOptions/AllOptions-4        45.19µ ± ∞ ¹   46.42µ ± ∞ ¹  +2.73% (p=0.008 n=5)
Decode/WithSpecialTokens-4            19.95µ ± ∞ ¹   18.99µ ± ∞ ¹  -4.85% (p=0.016 n=5)
Decode/SkipSpecialTokens-4            20.07µ ± ∞ ¹   19.15µ ± ∞ ¹  -4.60% (p=0.008 n=5)
BatchEncode-4                         443.2µ ± ∞ ¹   448.2µ ± ∞ ¹       ~ (p=0.151 n=5)
FromHuggingFace/CreationOnly-4        37.67m ± ∞ ¹   37.60m ± ∞ ¹       ~ (p=0.841 n=5)
FromHuggingFace/FullLifecycle-4       37.69m ± ∞ ¹   37.39m ± ∞ ¹       ~ (p=0.095 n=5)
VocabSize-4                           3.161m ± ∞ ¹   3.253m ± ∞ ¹       ~ (p=0.421 n=5)
EncodeDecode/Short-4                  14.80µ ± ∞ ¹   14.30µ ± ∞ ¹  -3.36% (p=0.016 n=5)
EncodeDecode/Medium-4                 65.70µ ± ∞ ¹   67.09µ ± ∞ ¹  +2.12% (p=0.008 n=5)
EncodeDecode/Long-4                   494.8µ ± ∞ ¹   498.2µ ± ∞ ¹  +0.69% (p=0.008 n=5)
Truncation-4                          326.7µ ± ∞ ¹   338.8µ ± ∞ ¹       ~ (p=0.222 n=5)
Padding-4                             120.4µ ± ∞ ¹   123.4µ ± ∞ ¹  +2.49% (p=0.008 n=5)
ConcurrentCacheRead-4                 4.659µ ± ∞ ¹   4.686µ ± ∞ ¹       ~ (p=0.222 n=5)
ConcurrentCacheValidation-4           5.548µ ± ∞ ¹   5.435µ ± ∞ ¹  -2.04% (p=0.016 n=5)
ConcurrentHFCacheLookup-4             9.164µ ± ∞ ¹   8.966µ ± ∞ ¹  -2.16% (p=0.008 n=5)
DownloadWithFailureRecovery-4          1.153 ± ∞ ¹    1.076 ± ∞ ¹       ~ (p=0.310 n=5)
ConcurrentDownloadsWithFailures-4     43.02m ± ∞ ¹   45.91m ± ∞ ¹       ~ (p=0.056 n=5)
FromHuggingFaceWithCache-4            10.55µ ± ∞ ¹   10.40µ ± ∞ ¹  -1.45% (p=0.008 n=5)
FromHuggingFaceWithoutCache-4         144.3µ ± ∞ ¹   141.1µ ± ∞ ¹  -2.20% (p=0.008 n=5)
geomean                               166.4µ         166.8µ        +0.23%
¹ need >= 6 samples for confidence interval at level 0.95

                                  │ base_bench.txt │             pr_bench.txt              │
                                  │      B/op      │     B/op       vs base                │
Encode/Short-4                         920.0 ± ∞ ¹     920.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
Encode/Medium-4                      1.516Ki ± ∞ ¹   1.516Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
Encode/Long-4                        6.703Ki ± ∞ ¹   6.703Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/Default-4          1.516Ki ± ∞ ¹   1.516Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/WithTypeIDs-4      1.609Ki ± ∞ ¹   1.609Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/WithTokens-4       1.516Ki ± ∞ ¹   1.516Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/WithOffsets-4      1.703Ki ± ∞ ¹   1.703Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/AllOptions-4       2.109Ki ± ∞ ¹   2.109Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
Decode/WithSpecialTokens-4             740.0 ± ∞ ¹     740.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
Decode/SkipSpecialTokens-4             740.0 ± ∞ ¹     740.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
BatchEncode-4                        11.30Ki ± ∞ ¹   11.30Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
FromHuggingFace/CreationOnly-4       6.146Mi ± ∞ ¹   6.146Mi ± ∞ ¹       ~ (p=0.841 n=5)
FromHuggingFace/FullLifecycle-4      6.154Mi ± ∞ ¹   6.134Mi ± ∞ ¹       ~ (p=0.310 n=5)
VocabSize-4                            288.0 ± ∞ ¹     288.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeDecode/Short-4                 1.516Ki ± ∞ ¹   1.516Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeDecode/Medium-4                2.242Ki ± ∞ ¹   2.242Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeDecode/Long-4                  8.430Ki ± ∞ ¹   8.430Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
Truncation-4                         5.500Ki ± ∞ ¹   5.500Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
Padding-4                            15.89Ki ± ∞ ¹   15.89Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
ConcurrentCacheRead-4                2.062Ki ± ∞ ¹   2.062Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
ConcurrentCacheValidation-4          3.023Ki ± ∞ ¹   3.023Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
ConcurrentHFCacheLookup-4            3.180Ki ± ∞ ¹   3.179Ki ± ∞ ¹       ~ (p=0.119 n=5)
DownloadWithFailureRecovery-4        59.52Ki ± ∞ ¹   59.97Ki ± ∞ ¹       ~ (p=0.841 n=5)
ConcurrentDownloadsWithFailures-4    18.82Ki ± ∞ ¹   18.79Ki ± ∞ ¹       ~ (p=0.421 n=5)
FromHuggingFaceWithCache-4           1.727Ki ± ∞ ¹   1.727Ki ± ∞ ¹       ~ (p=1.000 n=5)
FromHuggingFaceWithoutCache-4        16.21Ki ± ∞ ¹   16.21Ki ± ∞ ¹       ~ (p=0.111 n=5)
geomean                              5.425Ki         5.425Ki        +0.01%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

                                  │ base_bench.txt │             pr_bench.txt             │
                                  │   allocs/op    │  allocs/op    vs base                │
Encode/Short-4                         16.00 ± ∞ ¹    16.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
Encode/Medium-4                        35.00 ± ∞ ¹    35.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
Encode/Long-4                          165.0 ± ∞ ¹    165.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/Default-4            35.00 ± ∞ ¹    35.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/WithTypeIDs-4        36.00 ± ∞ ¹    36.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/WithTokens-4         35.00 ± ∞ ¹    35.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/WithOffsets-4        36.00 ± ∞ ¹    36.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/AllOptions-4         41.00 ± ∞ ¹    41.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
Decode/WithSpecialTokens-4             10.00 ± ∞ ¹    10.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
Decode/SkipSpecialTokens-4             10.00 ± ∞ ¹    10.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
BatchEncode-4                          261.0 ± ∞ ¹    261.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
FromHuggingFace/CreationOnly-4        92.19k ± ∞ ¹   92.19k ± ∞ ¹  +0.01% (p=0.008 n=5)
FromHuggingFace/FullLifecycle-4       92.19k ± ∞ ¹   92.20k ± ∞ ¹  +0.01% (p=0.008 n=5)
VocabSize-4                            5.000 ± ∞ ¹    5.000 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeDecode/Short-4                   26.00 ± ∞ ¹    26.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeDecode/Medium-4                  45.00 ± ∞ ¹    45.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeDecode/Long-4                    175.0 ± ∞ ¹    175.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
Truncation-4                           127.0 ± ∞ ¹    127.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
Padding-4                              535.0 ± ∞ ¹    535.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
ConcurrentCacheRead-4                  25.00 ± ∞ ¹    25.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
ConcurrentCacheValidation-4            43.00 ± ∞ ¹    43.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
ConcurrentHFCacheLookup-4              38.00 ± ∞ ¹    38.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
DownloadWithFailureRecovery-4          457.0 ± ∞ ¹    450.0 ± ∞ ¹       ~ (p=0.548 n=5)
ConcurrentDownloadsWithFailures-4      231.0 ± ∞ ¹    231.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
FromHuggingFaceWithCache-4             7.000 ± ∞ ¹    7.000 ± ∞ ¹       ~ (p=1.000 n=5) ²
FromHuggingFaceWithoutCache-4          217.0 ± ∞ ¹    217.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
geomean                                89.79          89.74        -0.06%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

When encode_batch_pairs encounters an error during buffer allocation (e.g., CString conversion failure), it was only cleaning up the current iteration's tokens but leaking all buffers from previous successful iterations. Changes: - Use two-phase allocation: collect all buffers in temp storage first - Only write to output array if ALL allocations succeed - On error, clean up all buffers in temp storage using new helper - Add free_buffer_contents() helper for cleanup without pointer deref This ensures either: - All buffers successfully written to output (caller frees them), OR - No buffers written to output (all cleaned up, error returned) Fixes potential multi-megabyte memory leak when processing large batches with malformed token data containing interior null bytes.

claude · 2025-11-07T13:27:25Z

PR Review - EncodePair Implementation

Excellent work! This PR successfully implements pair encoding for reranking use cases. The code addresses all previously identified issues.

✅ Strengths

Critical fixes applied: Null-termination (tokenizers.go:456-459) and memory cleanup (src/lib.rs:444-446) properly implemented
Clean API design: EncodePair and EncodePairs provide intuitive convenience and batch interfaces
Zero ABI breaking changes: New FFI function only, maintains backward compatibility
Comprehensive testing: Covers edge cases including long strings, empty inputs, mismatched lengths
Proper error handling: Thorough validation with clear error messages
CI improvements: PR detection logic ensures Rust changes are built locally for testing

🔍 Minor Suggestions

Benchmark for pairs: Consider adding BenchmarkEncodePair and BenchmarkEncodePairs to demonstrate performance benefits over sequential Encode calls
Godoc example: Add a code example showing the reranking use case in function documentation
CI script organization (.github/actions/get-rust-library/action.yml): The PR detection logic is duplicated across three platforms. Minor maintenance improvement opportunity for future refactoring.

📊 Verification

Memory safety: Two-phase allocation pattern prevents leaks ✓
Test coverage: All critical paths tested ✓
Performance: Parallel batch encoding via encode_batch ✓
Documentation: Clear comments on null-termination ✓

✅ Recommendation

Approve and merge. All critical issues resolved. Minor suggestions are optional enhancements for future PRs.

Closes #95 effectively.

tazarov added 2 commits November 7, 2025 13:22

github-actions Bot reviewed Nov 7, 2025

View reviewed changes

tazarov merged commit 0a5055c into main Nov 7, 2025
21 checks passed

tazarov deleted the claude/implement-encode-pair-method-011CUtLHXzsJLofz1cnS49Ps branch November 7, 2025 14:02

tazarov mentioned this pull request Nov 8, 2025

[DOC] Add comprehensive release documentation and automation skill #97

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement EncodePair method for Tokenizer#96

Implement EncodePair method for Tokenizer#96
tazarov merged 5 commits into
mainfrom
claude/implement-encode-pair-method-011CUtLHXzsJLofz1cnS49Ps

tazarov commented Nov 7, 2025

Uh oh!

claude Bot commented Nov 7, 2025

Uh oh!

claude Bot commented Nov 7, 2025

Uh oh!

claude Bot commented Nov 7, 2025

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot commented Nov 7, 2025 •

edited

Loading

Uh oh!

claude Bot commented Nov 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tazarov commented Nov 7, 2025

Pull Request

Description

Type of Change

Testing

Checklist

Platform Testing

Additional Notes

Uh oh!

claude Bot commented Nov 7, 2025

Code Review

✅ Strengths

⚠️ Critical Issue: Null-Terminated Strings

🔍 Minor Issues

📊 Performance

🔐 Security

✅ Verdict

Uh oh!

claude Bot commented Nov 7, 2025

Follow-up: Critical Null-Termination Issue Still Present

The Problem

Current Code (BROKEN):

Why Tests Pass (False Positive)

Required Fix

Alternative: Use syscall Package

Verification Needed

Uh oh!

claude Bot commented Nov 7, 2025

Code Review Summary

Strengths ✓

Issues Found

1. Memory Leak in Error Path (src/lib.rs:406-415)

2. Inconsistent Null Termination Handling (tokenizers.go:448-453)

3. CI Script Duplication

Minor Suggestions

Test Coverage Assessment

Security Considerations

Verdict

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

⚠️ Performance Alert ⚠️

Uh oh!

github-actions Bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Comparison

Uh oh!

claude Bot commented Nov 7, 2025

PR Review - EncodePair Implementation

✅ Strengths

🔍 Minor Suggestions

📊 Verification

✅ Recommendation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Nov 7, 2025 •

edited

Loading