Skip to content

Conversation

@eshishki
Copy link
Contributor

@eshishki eshishki commented Dec 24, 2025

This patch adds proper UTF-8 character-based n-gram computation for
ngram_search and ngram_search_case_insensitive functions.

Previously, n-grams were computed byte-by-byte, which produced incorrect
results for non-ASCII text (Cyrillic, Chinese, etc.). Now n-grams are
computed based on UTF-8 characters using the existing UTF8_BYTE_LENGTH_TABLE.

Changes:

  • Add session variable ngram_search_support_utf8 (default: false)
  • Add utf8_tolower() utility function using ICU for proper Unicode case folding
  • Fix ngram_search to iterate by UTF-8 characters when enabled
  • Fix bloom filter index writer to use UTF-8 case folding
  • Fix bloom filter query to use UTF-8 case folding

Note: Bloom filter index already used UTF-8 for n-gram extraction,
but case-insensitive mode used ASCII tolower. This is now fixed.

Why I'm doing:

What I'm doing:

Fixes #67208

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 4.0
    • 3.5
    • 3.4
    • 3.3

Note

Adds optional UTF-8 character-based n-gram processing and Unicode case-folding for string similarity and indexing.

  • Introduces session/runtime flag ngram_search_support_utf8 (FE SessionVariable, thrift, BE RuntimeState) to enable UTF-8 mode
  • Reworks ngram_search (and case-insensitive variant) to iterate by UTF-8 characters, preserve ASCII fast paths, and use ICU-based utf8_tolower; dispatches based on the flag
  • Updates n-gram Bloom filter writer to perform UTF-8-aware lowercasing and detect ASCII for fast paths
  • Adjusts function-call bloom filter preprocessing: early UTF-8 validation and Unicode lowercasing for case-insensitive queries
  • Adds util/utf8.cpp (ICU-backed) and wires it in CMake; adds SQL tests covering Cyrillic/Chinese cases

Written by Cursor Bugbot for commit cef7e7d. This will update automatically on new commits. Configure here.

@eshishki eshishki requested review from a team as code owners December 24, 2025 15:00
@wanpengfei-git wanpengfei-git requested a review from a team December 24, 2025 15:00
  This patch adds proper UTF-8 character-based n-gram computation for
  ngram_search and ngram_search_case_insensitive functions.

  Previously, n-grams were computed byte-by-byte, which produced incorrect
  results for non-ASCII text (Cyrillic, Chinese, etc.). Now n-grams are
  computed based on UTF-8 characters using the existing UTF8_BYTE_LENGTH_TABLE.

  Changes:
  - Add session variable `ngram_search_support_utf8` (default: false)
  - Add utf8_tolower() utility function using ICU for proper Unicode case folding
  - Fix ngram_search to iterate by UTF-8 characters when enabled
  - Fix bloom filter index writer to use UTF-8 case folding
  - Fix bloom filter query to use UTF-8 case folding

  Note: Bloom filter index already used UTF-8 for n-gram extraction,
  but case-insensitive mode used ASCII tolower. This is now fixed.
@eshishki eshishki force-pushed the feature/ngram-utf8-upstream branch from da6c111 to 026c4d1 Compare December 24, 2025 16:38
@github-actions
Copy link

[BE Incremental Coverage Report]

pass : 5 / 5 (100.00%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 src/util/utf8.h 5 5 100.00% []

@alvin-celerdata
Copy link
Contributor

@cursor review

} else {
buf.assign(haystack.get_data(), haystack.get_size());
std::transform(buf.begin(), buf.end(), buf.begin(), [](unsigned char c) { return std::tolower(c); });
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant UTF-8 lowercase conversion in const+const code path

In UTF-8 case-insensitive mode, when both haystack and needle are constant, haystack_const_and_needle_const calls tolower_utf8 on the haystack, then passes the result to calculateDistanceWithHaystack, which calls tolower_utf8 again on the already-lowercased input. This double conversion is wasteful since ICU calls have overhead. The haystack_vector_and_needle_const function correctly avoids pre-conversion for UTF-8 mode (relying on calculateDistanceWithHaystack to handle it), but haystack_const_and_needle_const doesn't follow the same pattern.

Additional Locations (1)

Fix in Cursor Fix in Web


void utf8_tolower(const char* src, size_t src_len, std::string& dst) {
UErrorCode err_code = U_ZERO_ERROR;
UCaseMap* case_map = ucasemap_open("", U_FOLD_CASE_DEFAULT, &err_code);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this UCaseMap real map like std::map?
the map should be readonly, so it can be loaded only once before using it.

// Use UTF-8 aware tolower for proper Unicode case folding
std::string lower_ngram;
Slice lower_ngram_slice = cur_ngram.tolower(lower_ngram);
utf8_tolower(cur_ngram.get_data(), cur_ngram.get_size(), lower_ngram);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utf8_tolower costs more time than its ASCII's version, so we should check the string if it is an ASCII string or not, if so, we use ASCII to_lower instead.

slice_gram_num from LN#215 can used to judge if the cur_slice is an ASCII string. if slice_gram_num == cur_slice.size, then it is.

Use ASCII tolower fast-path where possible in ngram bloom filter paths
and ngram search.
Cache UCaseMap once in be/src/util/utf8.cpp
Add ASCII detection fast-paths in ngram bloom filter and ngram code
@sonarqubecloud
Copy link

@github-actions
Copy link

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

@github-actions
Copy link

[FE Incremental Coverage Report]

pass : 2 / 2 (100.00%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/qe/SessionVariable.java 2 2 100.00% []

@alvin-celerdata
Copy link
Contributor

@cursor review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Add UTF-8 support for ngram_search functions

4 participants