Skip to content

Implement EncodePair method for Tokenizer#96

Merged
tazarov merged 5 commits into
mainfrom
claude/implement-encode-pair-method-011CUtLHXzsJLofz1cnS49Ps
Nov 7, 2025
Merged

Implement EncodePair method for Tokenizer#96
tazarov merged 5 commits into
mainfrom
claude/implement-encode-pair-method-011CUtLHXzsJLofz1cnS49Ps

Conversation

@tazarov

@tazarov tazarov commented Nov 7, 2025

Copy link
Copy Markdown
Contributor

Implements EncodePair and EncodePairs methods to encode sequence pairs, enabling efficient query-document pair encoding for reranking tasks.

Key Features:

  • EncodePairs: Batch encoding of multiple sequence pairs with parallel processing
  • EncodePair: Convenience wrapper for single pair encoding
  • Zero ABI breaking changes: New FFI function encode_batch_pairs

Implementation:

  • Rust: encode_batch_pairs FFI function using tokenizer.encode_batch
  • Go: EncodePairs method with array handling and EncodePair convenience wrapper
  • Tests: Comprehensive test coverage for both single and batch pair encoding

Use Case:
Rerankers typically score 1 query against N documents. Batch pair encoding provides significant performance improvement through parallel processing.

API:

// Single pair
result, err := tokenizer.EncodePair("query", "document", opts...)

// Batch pairs (efficient for reranking)
results, err := tokenizer.EncodePairs(
    []string{"query1", "query2"},
    []string{"doc1", "doc2"},
    opts...
)

Pull Request

Description

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Performance improvement
  • Code refactoring

Testing

  • Tests pass locally
  • Added tests for new functionality
  • Tested on multiple platforms (if applicable)
  • Cross-compilation works (if Rust changes)

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Platform Testing

If this PR affects cross-platform functionality, please test on:

  • Linux (x86_64)
  • Linux (aarch64)
  • macOS (Intel)
  • macOS (Apple Silicon)
  • Windows (x86_64)

Additional Notes

Closes #95

Implements EncodePair and EncodePairs methods to encode sequence pairs,
enabling efficient query-document pair encoding for reranking tasks.

**Key Features:**
- EncodePairs: Batch encoding of multiple sequence pairs with parallel processing
- EncodePair: Convenience wrapper for single pair encoding
- Zero ABI breaking changes: New FFI function encode_batch_pairs

**Implementation:**
- Rust: encode_batch_pairs FFI function using tokenizer.encode_batch
- Go: EncodePairs method with array handling and EncodePair convenience wrapper
- Tests: Comprehensive test coverage for both single and batch pair encoding

**Use Case:**
Rerankers typically score 1 query against N documents. Batch pair encoding
provides significant performance improvement through parallel processing.

**API:**
```go
// Single pair
result, err := tokenizer.EncodePair("query", "document", opts...)

// Batch pairs (efficient for reranking)
results, err := tokenizer.EncodePairs(
    []string{"query1", "query2"},
    []string{"doc1", "doc2"},
    opts...
)
```
@claude

claude Bot commented Nov 7, 2025

Copy link
Copy Markdown

Code Review

This PR implements batch pair encoding functionality with solid implementation and comprehensive testing. Here's my analysis:

✅ Strengths

1. Clean API Design

  • EncodePair (convenience) and EncodePairs (batch) provide intuitive API
  • Leverages existing EncodeOptions infrastructure consistently
  • Zero ABI breaking changes - new FFI function only

2. Robust Error Handling

  • Proper validation: null checks, length matching, empty batch handling
  • Memory cleanup on token conversion failure (src/lib.rs:435-438)
  • Consistent error propagation pattern

3. Comprehensive Test Coverage

  • Single pair, batch pairs, empty inputs, mismatched lengths
  • Tests verify tokens, type IDs, attention masks, offsets
  • Edge case: empty second sequence

4. Memory Management

  • Proper buffer cleanup in Go (tokenizers.go:509)
  • std::mem::forget pattern matches existing code
  • Token cleanup on CString conversion failure

⚠️ Critical Issue: Null-Terminated Strings

Problem: The Go code uses unsafe.StringData to pass strings to C (tokenizers.go:452-453):

cSequences[i] = unsafe.StringData(sequences[i])
cPairs[i] = unsafe.StringData(pairs[i])

Issue: unsafe.StringData returns a pointer to the string's underlying byte array, but Go strings are NOT null-terminated. The Rust FFI expects null-terminated C strings (*const libc::c_char) per the safety documentation (src/lib.rs:335-336).

Impact: This will cause undefined behavior - the Rust CStr::from_ptr will read past the string boundary until it finds a null byte, potentially:

  • Reading garbage data
  • Segmentation faults
  • Data corruption

Fix Required: Strings must be null-terminated. Compare with the single Encode function (tokenizers.go:389) which passes the string directly - purego handles this automatically for single strings but NOT for string arrays.

Recommended Solution:

// Convert Go strings to null-terminated C strings
cSequences := make([]*byte, len(sequences))
cPairs := make([]*byte, len(pairs))

for i := 0; i < len(sequences); i++ {
    // Append null terminator
    seqBytes := append([]byte(sequences[i]), 0)
    pairBytes := append([]byte(pairs[i]), 0)
    cSequences[i] = &seqBytes[0]
    cPairs[i] = &pairBytes[0]
}

Alternatively, use syscall.BytePtrFromString which handles null termination.

🔍 Minor Issues

1. Incomplete Error Cleanup (src/lib.rs:434-440)
When token CString conversion fails mid-iteration, only tokens are cleaned but previously allocated resources (IDs, type_ids, etc.) from earlier iterations leak.

Suggestion: Track allocated buffers and clean them on error, or use RAII patterns.

2. Missing Documentation
The new FFI function should be documented in project docs (CLAUDE.md mentions ABI compatibility but doesn't list available functions).

3. Test Gap
No test for invalid UTF-8 in pair encoding (though existing validation handles it).

📊 Performance

Parallel batch encoding via tokenizer.encode_batch (src/lib.rs:404) is excellent for reranking use cases. Good choice.

🔐 Security

No security concerns beyond the null-termination issue which could cause crashes/memory corruption.

✅ Verdict

Strong implementation but the null-termination bug is a blocker that must be fixed before merge. The tests currently pass likely due to lucky memory layout having zeros after strings, but this is unreliable.

After fixing the string handling:

  • ✅ Merge-ready
  • Consider adding the suggested error cleanup improvement in a follow-up

When Rust source files (src/ or Cargo.*) change in a PR, force local
library build instead of downloading from releases. This ensures CI tests
run against the latest code changes including new FFI functions.

Fixes the encode_batch_pairs undefined symbol error in CI by building
the library locally when Rust code changes are detected.
The previous approach using git diff HEAD~1 failed because GitHub Actions
uses shallow clones (fetch-depth: 1) where HEAD~1 doesn't exist. The error
was silenced, causing the script to always download old releases.

New approach: Detect PR context using GITHUB_BASE_REF environment variable
which is only set during pull_request events. When set, always build locally
to ensure tests run against the latest code.

Benefits:
- Simple and reliable - no git history needed
- PRs always test current code (correctness over speed)
- Pushes to main still use cached releases (fast)
- No workflow changes required
@claude

claude Bot commented Nov 7, 2025

Copy link
Copy Markdown

Follow-up: Critical Null-Termination Issue Still Present

The critical null-termination bug I identified in my previous review (lines 452-453 in tokenizers.go) has not been addressed and remains a blocker for this PR.

The Problem

Go strings are not null-terminated, but the Rust FFI expects null-terminated C strings. Using unsafe.StringData without adding null terminators will cause undefined behavior.

Current Code (BROKEN):

for i := 0; i < len(sequences); i++ {
    cSequences[i] = unsafe.StringData(sequences[i])  // ❌ No null terminator
    cPairs[i] = unsafe.StringData(pairs[i])          // ❌ No null terminator
}

Why Tests Pass (False Positive)

Tests may currently pass due to:

  1. Lucky memory layout with zeros after strings
  2. Short test strings where garbage bytes happen to be non-printable
  3. Specific allocator behavior that isn't guaranteed

This is undefined behavior and will fail unpredictably in production.

Required Fix

Add null terminators to all strings:

// Convert Go strings to null-terminated C strings
cSequences := make([]*byte, len(sequences))
cPairs := make([]*byte, len(pairs))
cSeqBytes := make([][]byte, len(sequences))
cPairBytes := make([][]byte, len(pairs))

for i := 0; i < len(sequences); i++ {
    // Append null terminator and keep reference to prevent GC
    cSeqBytes[i] = append([]byte(sequences[i]), 0)
    cPairBytes[i] = append([]byte(pairs[i]), 0)
    cSequences[i] = &cSeqBytes[i][0]
    cPairs[i] = &cPairBytes[i][0]
}

Alternative: Use syscall Package

import "syscall"

cSeqPtrs := make([]*byte, len(sequences))
cPairPtrs := make([]*byte, len(pairs))

for i := 0; i < len(sequences); i++ {
    seqPtr, err := syscall.BytePtrFromString(sequences[i])
    if err != nil {
        return nil, errors.Wrap(err, "invalid sequence string")
    }
    pairPtr, err := syscall.BytePtrFromString(pairs[i])
    if err != nil {
        return nil, errors.Wrap(err, "invalid pair string")
    }
    cSeqPtrs[i] = seqPtr
    cPairPtrs[i] = pairPtr
}

Verification Needed

After fixing, please add a test with strings that would expose this bug:

func TestEncodePairsNullTermination(t *testing.T) {
    // String without natural null bytes nearby in memory
    longSeq := strings.Repeat("A", 100)
    longPair := strings.Repeat("B", 100)
    
    results, err := tok.EncodePairs([]string{longSeq}, []string{longPair})
    require.NoError(t, err)
    // Verify encoding is correct length
}

This must be fixed before merge. All other aspects of the PR are excellent, but this bug could cause crashes, data corruption, or security issues in production.

Go strings are not null-terminated, but Rust's CStr::from_ptr() expects
null-terminated C strings. The existing Encode() method works because
purego automatically adds null terminators when passing 'string' type
parameters.

However, EncodePairs() uses **byte (array of byte pointers) which bypasses
purego's automatic conversion. Using unsafe.StringData() without null
terminators causes undefined behavior when Rust scans for the null byte.

Changes:
- Manually append null terminator (0 byte) to each string
- Keep references to prevent GC during FFI call
- Add test with long strings to verify null termination works

This fixes a critical bug where memory corruption or crashes could occur
when Rust's CStr::from_ptr() reads past string boundaries.
@claude

claude Bot commented Nov 7, 2025

Copy link
Copy Markdown

Code Review Summary

This PR implements pair encoding functionality for reranking use cases. The implementation is solid with good test coverage. Here are my findings:

Strengths ✓

  1. Zero ABI breaking changes - Correctly adds new FFI function without modifying existing ones
  2. Comprehensive test coverage - Tests cover single pairs, batches, edge cases, and long strings
  3. Good CI improvements - PR detection logic ensures local builds for changed Rust code
  4. Memory safety - Proper cleanup with defer and explicit buffer freeing
  5. Error handling - Thorough validation and descriptive error messages
  6. Clear documentation - Good API examples in PR description

Issues Found

1. Memory Leak in Error Path (src/lib.rs:406-415)

Severity: High

In encode_batch_pairs, if token CString conversion fails mid-loop, already allocated tokens are cleaned up, but other buffers (ids, type_ids, etc.) for previous iterations are leaked.

Location: src/lib.rs:406-415

for allocated_token in vec_tokens {
    drop(std::ffi::CString::from_raw(allocated_token));
}
return ERROR_CSTRING_CONVERSION_FAILED;

Fix: Wrap buffer allocation in RAII guards or clean up all previously written buffers on error.

2. Inconsistent Null Termination Handling (tokenizers.go:448-453)

Severity: Medium

The code appends null terminators to create C strings, but Go's string-to-byte conversion may already include unexpected data patterns. While the test with long strings (100+ chars) likely catches issues, the implementation could be more explicit.

Location: tokenizers.go:448-453

cSeqBytes[i] = append([]byte(sequences[i]), 0)
cPairBytes[i] = append([]byte(pairs[i]), 0)

Recommendation: Consider using a helper function to centralize C string creation for consistency with other encoding methods.

3. CI Script Duplication

Severity: Low

The PR detection logic is duplicated across Linux, macOS, and Windows sections in the action.yml file. This increases maintenance burden.

Recommendation: Consider extracting to a shared script or using composite actions for better DRY.

Minor Suggestions

  1. Type safety in Go: The unsafe pointer casts at tokenizers.go:467-468 are necessary but consider adding helper functions to make the intent clearer

  2. Documentation: Add godoc examples for EncodePairs showing the reranking use case mentioned in the PR description

  3. Performance: Consider adding a benchmark comparing single Encode calls in a loop vs EncodePairs to demonstrate the performance benefit

Test Coverage Assessment

Excellent coverage including:

  • Single pair encoding
  • Batch encoding with multiple pairs
  • Empty pairs and batches
  • Mismatched array lengths
  • Various encoding options
  • Long string null termination edge case

Security Considerations

  • No security concerns identified
  • Proper input validation for null pointers and array bounds
  • Memory safety maintained with proper cleanup

Verdict

Approve with minor fix required for the memory leak in error path. The other issues are low priority but should be addressed in follow-up work.

Great implementation overall! Closes #95 effectively.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Go Benchmark'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.20.

Benchmark suite Current: 5294667 Previous: 828678a Ratio
BenchmarkDecode/WithSpecialTokens 20150 ns/op 740 B/op 10 allocs/op 16745 ns/op 740 B/op 10 allocs/op 1.20
BenchmarkDecode/WithSpecialTokens - ns/op 20150 ns/op 16745 ns/op 1.20

This comment was automatically generated by workflow using github-action-benchmark.

CC: @tazarov

@github-actions

github-actions Bot commented Nov 7, 2025

Copy link
Copy Markdown

Benchmark Comparison

goos: linux
goarch: amd64
pkg: github.com/amikos-tech/pure-tokenizers
cpu: AMD EPYC 7763 64-Core Processor                
                                  │ base_bench.txt │            pr_bench.txt            │
                                  │     sec/op     │    sec/op     vs base              │
Encode/Short-4                        9.452µ ± ∞ ¹   9.698µ ± ∞ ¹  +2.60% (p=0.008 n=5)
Encode/Medium-4                       42.89µ ± ∞ ¹   43.99µ ± ∞ ¹  +2.56% (p=0.008 n=5)
Encode/Long-4                         333.8µ ± ∞ ¹   335.6µ ± ∞ ¹       ~ (p=0.421 n=5)
EncodeWithOptions/Default-4           42.55µ ± ∞ ¹   43.80µ ± ∞ ¹       ~ (p=0.095 n=5)
EncodeWithOptions/WithTypeIDs-4       43.09µ ± ∞ ¹   43.92µ ± ∞ ¹  +1.93% (p=0.008 n=5)
EncodeWithOptions/WithTokens-4        43.24µ ± ∞ ¹   43.42µ ± ∞ ¹       ~ (p=0.421 n=5)
EncodeWithOptions/WithOffsets-4       43.31µ ± ∞ ¹   43.86µ ± ∞ ¹       ~ (p=0.056 n=5)
EncodeWithOptions/AllOptions-4        45.19µ ± ∞ ¹   46.42µ ± ∞ ¹  +2.73% (p=0.008 n=5)
Decode/WithSpecialTokens-4            19.95µ ± ∞ ¹   18.99µ ± ∞ ¹  -4.85% (p=0.016 n=5)
Decode/SkipSpecialTokens-4            20.07µ ± ∞ ¹   19.15µ ± ∞ ¹  -4.60% (p=0.008 n=5)
BatchEncode-4                         443.2µ ± ∞ ¹   448.2µ ± ∞ ¹       ~ (p=0.151 n=5)
FromHuggingFace/CreationOnly-4        37.67m ± ∞ ¹   37.60m ± ∞ ¹       ~ (p=0.841 n=5)
FromHuggingFace/FullLifecycle-4       37.69m ± ∞ ¹   37.39m ± ∞ ¹       ~ (p=0.095 n=5)
VocabSize-4                           3.161m ± ∞ ¹   3.253m ± ∞ ¹       ~ (p=0.421 n=5)
EncodeDecode/Short-4                  14.80µ ± ∞ ¹   14.30µ ± ∞ ¹  -3.36% (p=0.016 n=5)
EncodeDecode/Medium-4                 65.70µ ± ∞ ¹   67.09µ ± ∞ ¹  +2.12% (p=0.008 n=5)
EncodeDecode/Long-4                   494.8µ ± ∞ ¹   498.2µ ± ∞ ¹  +0.69% (p=0.008 n=5)
Truncation-4                          326.7µ ± ∞ ¹   338.8µ ± ∞ ¹       ~ (p=0.222 n=5)
Padding-4                             120.4µ ± ∞ ¹   123.4µ ± ∞ ¹  +2.49% (p=0.008 n=5)
ConcurrentCacheRead-4                 4.659µ ± ∞ ¹   4.686µ ± ∞ ¹       ~ (p=0.222 n=5)
ConcurrentCacheValidation-4           5.548µ ± ∞ ¹   5.435µ ± ∞ ¹  -2.04% (p=0.016 n=5)
ConcurrentHFCacheLookup-4             9.164µ ± ∞ ¹   8.966µ ± ∞ ¹  -2.16% (p=0.008 n=5)
DownloadWithFailureRecovery-4          1.153 ± ∞ ¹    1.076 ± ∞ ¹       ~ (p=0.310 n=5)
ConcurrentDownloadsWithFailures-4     43.02m ± ∞ ¹   45.91m ± ∞ ¹       ~ (p=0.056 n=5)
FromHuggingFaceWithCache-4            10.55µ ± ∞ ¹   10.40µ ± ∞ ¹  -1.45% (p=0.008 n=5)
FromHuggingFaceWithoutCache-4         144.3µ ± ∞ ¹   141.1µ ± ∞ ¹  -2.20% (p=0.008 n=5)
geomean                               166.4µ         166.8µ        +0.23%
¹ need >= 6 samples for confidence interval at level 0.95

                                  │ base_bench.txt │             pr_bench.txt              │
                                  │      B/op      │     B/op       vs base                │
Encode/Short-4                         920.0 ± ∞ ¹     920.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
Encode/Medium-4                      1.516Ki ± ∞ ¹   1.516Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
Encode/Long-4                        6.703Ki ± ∞ ¹   6.703Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/Default-4          1.516Ki ± ∞ ¹   1.516Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/WithTypeIDs-4      1.609Ki ± ∞ ¹   1.609Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/WithTokens-4       1.516Ki ± ∞ ¹   1.516Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/WithOffsets-4      1.703Ki ± ∞ ¹   1.703Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/AllOptions-4       2.109Ki ± ∞ ¹   2.109Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
Decode/WithSpecialTokens-4             740.0 ± ∞ ¹     740.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
Decode/SkipSpecialTokens-4             740.0 ± ∞ ¹     740.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
BatchEncode-4                        11.30Ki ± ∞ ¹   11.30Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
FromHuggingFace/CreationOnly-4       6.146Mi ± ∞ ¹   6.146Mi ± ∞ ¹       ~ (p=0.841 n=5)
FromHuggingFace/FullLifecycle-4      6.154Mi ± ∞ ¹   6.134Mi ± ∞ ¹       ~ (p=0.310 n=5)
VocabSize-4                            288.0 ± ∞ ¹     288.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeDecode/Short-4                 1.516Ki ± ∞ ¹   1.516Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeDecode/Medium-4                2.242Ki ± ∞ ¹   2.242Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeDecode/Long-4                  8.430Ki ± ∞ ¹   8.430Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
Truncation-4                         5.500Ki ± ∞ ¹   5.500Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
Padding-4                            15.89Ki ± ∞ ¹   15.89Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
ConcurrentCacheRead-4                2.062Ki ± ∞ ¹   2.062Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
ConcurrentCacheValidation-4          3.023Ki ± ∞ ¹   3.023Ki ± ∞ ¹       ~ (p=1.000 n=5) ²
ConcurrentHFCacheLookup-4            3.180Ki ± ∞ ¹   3.179Ki ± ∞ ¹       ~ (p=0.119 n=5)
DownloadWithFailureRecovery-4        59.52Ki ± ∞ ¹   59.97Ki ± ∞ ¹       ~ (p=0.841 n=5)
ConcurrentDownloadsWithFailures-4    18.82Ki ± ∞ ¹   18.79Ki ± ∞ ¹       ~ (p=0.421 n=5)
FromHuggingFaceWithCache-4           1.727Ki ± ∞ ¹   1.727Ki ± ∞ ¹       ~ (p=1.000 n=5)
FromHuggingFaceWithoutCache-4        16.21Ki ± ∞ ¹   16.21Ki ± ∞ ¹       ~ (p=0.111 n=5)
geomean                              5.425Ki         5.425Ki        +0.01%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

                                  │ base_bench.txt │             pr_bench.txt             │
                                  │   allocs/op    │  allocs/op    vs base                │
Encode/Short-4                         16.00 ± ∞ ¹    16.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
Encode/Medium-4                        35.00 ± ∞ ¹    35.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
Encode/Long-4                          165.0 ± ∞ ¹    165.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/Default-4            35.00 ± ∞ ¹    35.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/WithTypeIDs-4        36.00 ± ∞ ¹    36.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/WithTokens-4         35.00 ± ∞ ¹    35.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/WithOffsets-4        36.00 ± ∞ ¹    36.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeWithOptions/AllOptions-4         41.00 ± ∞ ¹    41.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
Decode/WithSpecialTokens-4             10.00 ± ∞ ¹    10.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
Decode/SkipSpecialTokens-4             10.00 ± ∞ ¹    10.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
BatchEncode-4                          261.0 ± ∞ ¹    261.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
FromHuggingFace/CreationOnly-4        92.19k ± ∞ ¹   92.19k ± ∞ ¹  +0.01% (p=0.008 n=5)
FromHuggingFace/FullLifecycle-4       92.19k ± ∞ ¹   92.20k ± ∞ ¹  +0.01% (p=0.008 n=5)
VocabSize-4                            5.000 ± ∞ ¹    5.000 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeDecode/Short-4                   26.00 ± ∞ ¹    26.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeDecode/Medium-4                  45.00 ± ∞ ¹    45.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
EncodeDecode/Long-4                    175.0 ± ∞ ¹    175.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
Truncation-4                           127.0 ± ∞ ¹    127.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
Padding-4                              535.0 ± ∞ ¹    535.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
ConcurrentCacheRead-4                  25.00 ± ∞ ¹    25.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
ConcurrentCacheValidation-4            43.00 ± ∞ ¹    43.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
ConcurrentHFCacheLookup-4              38.00 ± ∞ ¹    38.00 ± ∞ ¹       ~ (p=1.000 n=5) ²
DownloadWithFailureRecovery-4          457.0 ± ∞ ¹    450.0 ± ∞ ¹       ~ (p=0.548 n=5)
ConcurrentDownloadsWithFailures-4      231.0 ± ∞ ¹    231.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
FromHuggingFaceWithCache-4             7.000 ± ∞ ¹    7.000 ± ∞ ¹       ~ (p=1.000 n=5) ²
FromHuggingFaceWithoutCache-4          217.0 ± ∞ ¹    217.0 ± ∞ ¹       ~ (p=1.000 n=5) ²
geomean                                89.79          89.74        -0.06%
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal

When encode_batch_pairs encounters an error during buffer allocation
(e.g., CString conversion failure), it was only cleaning up the current
iteration's tokens but leaking all buffers from previous successful
iterations.

Changes:
- Use two-phase allocation: collect all buffers in temp storage first
- Only write to output array if ALL allocations succeed
- On error, clean up all buffers in temp storage using new helper
- Add free_buffer_contents() helper for cleanup without pointer deref

This ensures either:
- All buffers successfully written to output (caller frees them), OR
- No buffers written to output (all cleaned up, error returned)

Fixes potential multi-megabyte memory leak when processing large batches
with malformed token data containing interior null bytes.
@claude

claude Bot commented Nov 7, 2025

Copy link
Copy Markdown

PR Review - EncodePair Implementation

Excellent work! This PR successfully implements pair encoding for reranking use cases. The code addresses all previously identified issues.

✅ Strengths

  1. Critical fixes applied: Null-termination (tokenizers.go:456-459) and memory cleanup (src/lib.rs:444-446) properly implemented
  2. Clean API design: EncodePair and EncodePairs provide intuitive convenience and batch interfaces
  3. Zero ABI breaking changes: New FFI function only, maintains backward compatibility
  4. Comprehensive testing: Covers edge cases including long strings, empty inputs, mismatched lengths
  5. Proper error handling: Thorough validation with clear error messages
  6. CI improvements: PR detection logic ensures Rust changes are built locally for testing

🔍 Minor Suggestions

  1. Benchmark for pairs: Consider adding BenchmarkEncodePair and BenchmarkEncodePairs to demonstrate performance benefits over sequential Encode calls

  2. Godoc example: Add a code example showing the reranking use case in function documentation

  3. CI script organization (.github/actions/get-rust-library/action.yml): The PR detection logic is duplicated across three platforms. Minor maintenance improvement opportunity for future refactoring.

📊 Verification

  • Memory safety: Two-phase allocation pattern prevents leaks ✓
  • Test coverage: All critical paths tested ✓
  • Performance: Parallel batch encoding via encode_batch
  • Documentation: Clear comments on null-termination ✓

✅ Recommendation

Approve and merge. All critical issues resolved. Minor suggestions are optional enhancements for future PRs.

Closes #95 effectively.

@tazarov tazarov merged commit 0a5055c into main Nov 7, 2025
21 checks passed
@tazarov tazarov deleted the claude/implement-encode-pair-method-011CUtLHXzsJLofz1cnS49Ps branch November 7, 2025 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ENH] Pair encoding

2 participants