Skip to content

Conversation

@eyalleshem
Copy link
Contributor

This PR implements zero-copy tokenization by using borrowed strings (&str) instead of owned strings (String) for identifiers, string literals, and comments. This eliminates unnecessary string allocations during the tokenization
process.

Changes

  • Modified Token variants to store &'a str instead of String for:
    • Word tokens (identifiers like table/column names)
    • SingleQuotedString literals
    • Whitespace
    • Comments (single-line and multi-line)
  • Implemented case-insensitive keyword lookup without to_uppercase() allocation
  • Added tokenize_bench criterion benchmark for performance measurement

Performance Impact

Benchmark results using a complex 27KB SQL query with CTEs, joins, window functions, and extensive comments:

tokenization/tokenize_complex_sql
time: [254.68 µs 254.81 µs 254.97 µs]
change: [−60.885% −60.682% −60.482%] (p = 0.00 < 0.05)
Performance has improved.

  This change introduces a lifetime parameter 'a to BorrowedToken enum
  to prepare for zero-copy tokenization support. This is a foundational
  step toward reducing memory allocations during SQL parsing.

  Changes:
  - Added lifetime parameter to BorrowedToken<'a> enum
  - Added _Phantom(Cow<'a, str>) variant to carry the lifetime
  - Implemented Visit and VisitMut traits for Cow<'a, str> to support
    the visitor pattern with the new lifetime parameter
  - Fixed lifetime issues in visitor tests by using tokenized_owned()
    instead of tokenize() where owned tokens are required
  - Type alias Token = BorrowedToken<'static> maintains backward
    compatibility
@eyalleshem eyalleshem force-pushed the tokenize_with_borrow_1 branch from 60646bd to 247ca2d Compare December 18, 2025 21:41
…hitespace

  Convert token string fields to use Cow<'a, str> to enable zero-copy tokenization
  for commonly used tokens:
  - Word.value: Regular identifiers and keywords now borrow from source
  - SingleQuotedString: String literals borrow when no escape processing needed
  - Whitespace: Single-line and multi-line comments borrow from source

Also add benchmark for measuring tokenization performance
@eyalleshem eyalleshem force-pushed the tokenize_with_borrow_1 branch from 247ca2d to b26333e Compare December 18, 2025 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant