-
Notifications
You must be signed in to change notification settings - Fork 2.2k
[Feature] Add UTF-8 support for ngram_search functions #67207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This patch adds proper UTF-8 character-based n-gram computation for ngram_search and ngram_search_case_insensitive functions. Previously, n-grams were computed byte-by-byte, which produced incorrect results for non-ASCII text (Cyrillic, Chinese, etc.). Now n-grams are computed based on UTF-8 characters using the existing UTF8_BYTE_LENGTH_TABLE. Changes: - Add session variable `ngram_search_support_utf8` (default: false) - Add utf8_tolower() utility function using ICU for proper Unicode case folding - Fix ngram_search to iterate by UTF-8 characters when enabled - Fix bloom filter index writer to use UTF-8 case folding - Fix bloom filter query to use UTF-8 case folding Note: Bloom filter index already used UTF-8 for n-gram extraction, but case-insensitive mode used ASCII tolower. This is now fixed.
da6c111 to
026c4d1
Compare
[BE Incremental Coverage Report]✅ pass : 5 / 5 (100.00%) file detail
|
|
@cursor review |
| } else { | ||
| buf.assign(haystack.get_data(), haystack.get_size()); | ||
| std::transform(buf.begin(), buf.end(), buf.begin(), [](unsigned char c) { return std::tolower(c); }); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant UTF-8 lowercase conversion in const+const code path
In UTF-8 case-insensitive mode, when both haystack and needle are constant, haystack_const_and_needle_const calls tolower_utf8 on the haystack, then passes the result to calculateDistanceWithHaystack, which calls tolower_utf8 again on the already-lowercased input. This double conversion is wasteful since ICU calls have overhead. The haystack_vector_and_needle_const function correctly avoids pre-conversion for UTF-8 mode (relying on calculateDistanceWithHaystack to handle it), but haystack_const_and_needle_const doesn't follow the same pattern.
Additional Locations (1)
be/src/util/utf8.cpp
Outdated
|
|
||
| void utf8_tolower(const char* src, size_t src_len, std::string& dst) { | ||
| UErrorCode err_code = U_ZERO_ERROR; | ||
| UCaseMap* case_map = ucasemap_open("", U_FOLD_CASE_DEFAULT, &err_code); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this UCaseMap real map like std::map?
the map should be readonly, so it can be loaded only once before using it.
| // Use UTF-8 aware tolower for proper Unicode case folding | ||
| std::string lower_ngram; | ||
| Slice lower_ngram_slice = cur_ngram.tolower(lower_ngram); | ||
| utf8_tolower(cur_ngram.get_data(), cur_ngram.get_size(), lower_ngram); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utf8_tolower costs more time than its ASCII's version, so we should check the string if it is an ASCII string or not, if so, we use ASCII to_lower instead.
slice_gram_num from LN#215 can used to judge if the cur_slice is an ASCII string. if slice_gram_num == cur_slice.size, then it is.
Use ASCII tolower fast-path where possible in ngram bloom filter paths and ngram search. Cache UCaseMap once in be/src/util/utf8.cpp Add ASCII detection fast-paths in ngram bloom filter and ngram code
|
[Java-Extensions Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[FE Incremental Coverage Report]✅ pass : 2 / 2 (100.00%) file detail
|
|
@cursor review |



This patch adds proper UTF-8 character-based n-gram computation for
ngram_search and ngram_search_case_insensitive functions.
Previously, n-grams were computed byte-by-byte, which produced incorrect
results for non-ASCII text (Cyrillic, Chinese, etc.). Now n-grams are
computed based on UTF-8 characters using the existing UTF8_BYTE_LENGTH_TABLE.
Changes:
ngram_search_support_utf8(default: false)Note: Bloom filter index already used UTF-8 for n-gram extraction,
but case-insensitive mode used ASCII tolower. This is now fixed.
Why I'm doing:
What I'm doing:
Fixes #67208
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check:
Note
Adds optional UTF-8 character-based n-gram processing and Unicode case-folding for string similarity and indexing.
ngram_search_support_utf8(FESessionVariable, thrift, BERuntimeState) to enable UTF-8 modengram_search(and case-insensitive variant) to iterate by UTF-8 characters, preserve ASCII fast paths, and use ICU-basedutf8_tolower; dispatches based on the flagutil/utf8.cpp(ICU-backed) and wires it in CMake; adds SQL tests covering Cyrillic/Chinese casesWritten by Cursor Bugbot for commit cef7e7d. This will update automatically on new commits. Configure here.