[Feature] Add UTF-8 support for ngram_search functions #67207

eshishki · 2025-12-24T15:00:02Z

This patch adds proper UTF-8 character-based n-gram computation for
ngram_search and ngram_search_case_insensitive functions.

Previously, n-grams were computed byte-by-byte, which produced incorrect
results for non-ASCII text (Cyrillic, Chinese, etc.). Now n-grams are
computed based on UTF-8 characters using the existing UTF8_BYTE_LENGTH_TABLE.

Changes:

Add session variable ngram_search_support_utf8 (default: false)
Add utf8_tolower() utility function using ICU for proper Unicode case folding
Fix ngram_search to iterate by UTF-8 characters when enabled
Fix bloom filter index writer to use UTF-8 case folding
Fix bloom filter query to use UTF-8 case folding

Note: Bloom filter index already used UTF-8 for n-gram extraction,
but case-insensitive mode used ASCII tolower. This is now fixed.

Why I'm doing:

What I'm doing:

Fixes #67208

What type of PR is this:

Does this PR entail a change in behavior?

Yes, this PR will result in a change in behavior.
No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

Interface/UI changes: syntax, type conversion, expression evaluation, display information
Parameter changes: default values, similar parameters but with different default values
Policy changes: use new policy to replace old one, functionality automatically enabled
Feature removed
Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

I have added test cases for my bug fix or my new feature
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function
This is a backport pr

Bugfix cherry-pick branch check:

Note

Adds optional UTF-8 character-based n-gram processing and Unicode case-folding for string similarity and indexing.

Introduces session/runtime flag ngram_search_support_utf8 (FE SessionVariable, thrift, BE RuntimeState) to enable UTF-8 mode
Reworks ngram_search (and case-insensitive variant) to iterate by UTF-8 characters, preserve ASCII fast paths, and use ICU-based utf8_tolower; dispatches based on the flag
Updates n-gram Bloom filter writer to perform UTF-8-aware lowercasing and detect ASCII for fast paths
Adjusts function-call bloom filter preprocessing: early UTF-8 validation and Unicode lowercasing for case-insensitive queries
Adds util/utf8.cpp (ICU-backed) and wires it in CMake; adds SQL tests covering Cyrillic/Chinese cases

^{Written by Cursor Bugbot for commit cef7e7d. This will update automatically on new commits. Configure here.}

This patch adds proper UTF-8 character-based n-gram computation for ngram_search and ngram_search_case_insensitive functions. Previously, n-grams were computed byte-by-byte, which produced incorrect results for non-ASCII text (Cyrillic, Chinese, etc.). Now n-grams are computed based on UTF-8 characters using the existing UTF8_BYTE_LENGTH_TABLE. Changes: - Add session variable `ngram_search_support_utf8` (default: false) - Add utf8_tolower() utility function using ICU for proper Unicode case folding - Fix ngram_search to iterate by UTF-8 characters when enabled - Fix bloom filter index writer to use UTF-8 case folding - Fix bloom filter query to use UTF-8 case folding Note: Bloom filter index already used UTF-8 for n-gram extraction, but case-insensitive mode used ASCII tolower. This is now fixed.

github-actions · 2025-12-24T16:45:58Z

[BE Incremental Coverage Report]

✅ pass : 5 / 5 (100.00%)

file detail

	path	covered_line	new_line	coverage	not_covered_line_detail
🔵	src/util/utf8.h	5	5	100.00%	[]

alvin-celerdata · 2025-12-24T17:32:08Z

@cursor review

cursor · 2025-12-24T17:41:21Z

be/src/exprs/ngram.cpp

+            } else {
+                buf.assign(haystack.get_data(), haystack.get_size());
+                std::transform(buf.begin(), buf.end(), buf.begin(), [](unsigned char c) { return std::tolower(c); });
+            }


Redundant UTF-8 lowercase conversion in const+const code path

In UTF-8 case-insensitive mode, when both haystack and needle are constant, haystack_const_and_needle_const calls tolower_utf8 on the haystack, then passes the result to calculateDistanceWithHaystack, which calls tolower_utf8 again on the already-lowercased input. This double conversion is wasteful since ICU calls have overhead. The haystack_vector_and_needle_const function correctly avoids pre-conversion for UTF-8 mode (relying on calculateDistanceWithHaystack to handle it), but haystack_const_and_needle_const doesn't follow the same pattern.

Additional Locations (1)

be/src/exprs/ngram.cpp#L324-L328

satanson · 2025-12-25T02:23:47Z

be/src/util/utf8.cpp

+
+void utf8_tolower(const char* src, size_t src_len, std::string& dst) {
+    UErrorCode err_code = U_ZERO_ERROR;
+    UCaseMap* case_map = ucasemap_open("", U_FOLD_CASE_DEFAULT, &err_code);


Is this UCaseMap real map like std::map?
the map should be readonly, so it can be loaded only once before using it.

satanson · 2025-12-25T02:32:41Z

be/src/storage/rowset/bloom_filter_index_writer.cpp

+                        // Use UTF-8 aware tolower for proper Unicode case folding
                        std::string lower_ngram;
-                        Slice lower_ngram_slice = cur_ngram.tolower(lower_ngram);
+                        utf8_tolower(cur_ngram.get_data(), cur_ngram.get_size(), lower_ngram);


utf8_tolower costs more time than its ASCII's version, so we should check the string if it is an ASCII string or not, if so, we use ASCII to_lower instead.

slice_gram_num from LN#215 can used to judge if the cur_slice is an ASCII string. if slice_gram_num == cur_slice.size, then it is.

Use ASCII tolower fast-path where possible in ngram bloom filter paths and ngram search. Cache UCaseMap once in be/src/util/utf8.cpp Add ASCII detection fast-paths in ngram bloom filter and ngram code

sonarqubecloud · 2025-12-25T12:46:55Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2025-12-25T14:37:19Z

[Java-Extensions Incremental Coverage Report]

✅ pass : 0 / 0 (0%)

github-actions · 2025-12-25T14:39:21Z

[FE Incremental Coverage Report]

✅ pass : 2 / 2 (100.00%)

file detail

	path	covered_line	new_line	coverage	not_covered_line_detail
🔵	com/starrocks/qe/SessionVariable.java	2	2	100.00%	[]

alvin-celerdata · 2025-12-25T18:09:15Z

@cursor review

eshishki requested review from a team as code owners December 24, 2025 15:00

wanpengfei-git added the PROTO-REVIEW label Dec 24, 2025

wanpengfei-git requested a review from a team December 24, 2025 15:00

mergify bot assigned eshishki Dec 24, 2025

eshishki force-pushed the feature/ngram-utf8-upstream branch from da6c111 to 026c4d1 Compare December 24, 2025 16:38

cursor bot reviewed Dec 24, 2025

View reviewed changes

satanson reviewed Dec 25, 2025

View reviewed changes

Avoid per-call ucasemap_open/close by caching UCaseMap in utf8_tolower.

cef7e7d

Use ASCII tolower fast-path where possible in ngram bloom filter paths and ngram search. Cache UCaseMap once in be/src/util/utf8.cpp Add ASCII detection fast-paths in ngram bloom filter and ngram code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Add UTF-8 support for ngram_search functions #67207

[Feature] Add UTF-8 support for ngram_search functions #67207

eshishki commented Dec 24, 2025 •

edited by cursor bot

Loading

Uh oh!

github-actions bot commented Dec 24, 2025

Uh oh!

alvin-celerdata commented Dec 24, 2025

Uh oh!

cursor bot Dec 24, 2025

Uh oh!

satanson Dec 25, 2025

Uh oh!

satanson Dec 25, 2025

Uh oh!

sonarqubecloud bot commented Dec 25, 2025

Uh oh!

github-actions bot commented Dec 25, 2025

Uh oh!

github-actions bot commented Dec 25, 2025

Uh oh!

alvin-celerdata commented Dec 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Feature] Add UTF-8 support for ngram_search functions #67207

Are you sure you want to change the base?

[Feature] Add UTF-8 support for ngram_search functions #67207

Conversation

eshishki commented Dec 24, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why I'm doing:

What I'm doing:

What type of PR is this:

Checklist:

Bugfix cherry-pick branch check:

Uh oh!

github-actions bot commented Dec 24, 2025

[BE Incremental Coverage Report]

file detail

Uh oh!

alvin-celerdata commented Dec 24, 2025

Uh oh!

cursor bot Dec 24, 2025

Choose a reason for hiding this comment

Redundant UTF-8 lowercase conversion in const+const code path

Uh oh!

satanson Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

satanson Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Dec 25, 2025

Quality Gate passed

Uh oh!

github-actions bot commented Dec 25, 2025

[Java-Extensions Incremental Coverage Report]

Uh oh!

github-actions bot commented Dec 25, 2025

[FE Incremental Coverage Report]

file detail

Uh oh!

alvin-celerdata commented Dec 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eshishki commented Dec 24, 2025 •

edited by cursor bot

Loading