Implement zero-copy tokenization for identifiers, strings, and comments #2136

eyalleshem · 2025-12-18T20:49:58Z

This PR implements zero-copy tokenization by using borrowed strings (&str) instead of owned strings (String) for identifiers, string literals, and comments. This eliminates unnecessary string allocations during the tokenization
process.

Changes

Modified Token variants to store &'a str instead of String for:
- Word tokens (identifiers like table/column names)
- SingleQuotedString literals
- Whitespace
- Comments (single-line and multi-line)
Implemented case-insensitive keyword lookup without to_uppercase() allocation
Added tokenize_bench criterion benchmark for performance measurement

Performance Impact

Benchmark results using a complex 27KB SQL query with CTEs, joins, window functions, and extensive comments:

tokenization/tokenize_complex_sql
time: [254.68 µs 254.81 µs 254.97 µs]
change: [−60.885% −60.682% −60.482%] (p = 0.00 < 0.05)
Performance has improved.

This change introduces a lifetime parameter 'a to BorrowedToken enum to prepare for zero-copy tokenization support. This is a foundational step toward reducing memory allocations during SQL parsing. Changes: - Added lifetime parameter to BorrowedToken<'a> enum - Added _Phantom(Cow<'a, str>) variant to carry the lifetime - Implemented Visit and VisitMut traits for Cow<'a, str> to support the visitor pattern with the new lifetime parameter - Fixed lifetime issues in visitor tests by using tokenized_owned() instead of tokenize() where owned tokens are required - Type alias Token = BorrowedToken<'static> maintains backward compatibility

…hitespace Convert token string fields to use Cow<'a, str> to enable zero-copy tokenization for commonly used tokens: - Word.value: Regular identifiers and keywords now borrow from source - SingleQuotedString: String literals borrow when no escape processing needed - Whitespace: Single-line and multi-line comments borrow from source Also add benchmark for measuring tokenization performance

eyalleshem force-pushed the tokenize_with_borrow_1 branch from 60646bd to 247ca2d Compare December 18, 2025 21:41

eyalleshem force-pushed the tokenize_with_borrow_1 branch from 247ca2d to b26333e Compare December 18, 2025 21:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement zero-copy tokenization for identifiers, strings, and comments #2136

Implement zero-copy tokenization for identifiers, strings, and comments #2136

Uh oh!

eyalleshem commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Implement zero-copy tokenization for identifiers, strings, and comments #2136

Are you sure you want to change the base?

Implement zero-copy tokenization for identifiers, strings, and comments #2136

Uh oh!

Conversation

eyalleshem commented Dec 18, 2025

Changes

Performance Impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant