feat: impl `NgramIndex` for `FuseTable`, improve like query performance #17852

KKould · 2025-04-25T03:16:56Z

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

part of: #17724

Implement Ngram Index to improve the retrieval speed of Like query

Its working principle is to insert String type data into multiple substrings in the form of ngram and insert them into BloomFilter. When querying Like, it determines whether there is a substring after ngram that does not exist in BloomFilter to filter out the Block that must not have data in Like in advance.

Therefore, when using Ngram Index, the insertion time will be longer due to ngram (depending on the length of each line of string and the total number of data lines).

Storage

Ngram Index is essentially a data segmentation method based on Bloom Index using Ngram. Therefore, Ngram Index shares Meta with Bloom Index and uses the same storage file.

Benchmark

Using amazon_reviews as the benchmark, the total data size is 39.2 GB, and review_body is 17 GB

CREATE OR REPLACE TABLE `amazon_reviews_ngram` (
                                  `review_date` int(11) NULL,
                                  `marketplace` varchar(20) NULL,
                                  `customer_id` bigint(20) NULL,
                                  `review_id` varchar(40) NULL,
                                  `product_id` varchar(10) NULL,
                                  `product_parent` bigint(20) NULL,
                                  `product_title` varchar(500) NULL,
                                  `product_category` varchar(50) NULL,
                                  `star_rating` smallint(6) NULL,
                                  `helpful_votes` int(11) NULL,
                                  `total_votes` int(11) NULL,
                                  `vine` boolean NULL,
                                  `verified_purchase` boolean NULL,
                                  `review_headline` varchar(500) NULL,
                                  `review_body` string NULL,
                                  NGRAM INDEX idx1 (review_body) gram_size = 10 bloom_size = 2097152
) Engine = Fuse bloom_index_columns='review_body';

copy into amazon_reviews_ngram from @data/ngram_test/amazon_reviews_2010.snappy.parquet file_format = (type = PARQUET);
copy into amazon_reviews_ngram from @data/ngram_test/amazon_reviews_2011.snappy.parquet file_format = (type = PARQUET);
copy into amazon_reviews_ngram from @data/ngram_test/amazon_reviews_2012.snappy.parquet file_format = (type = PARQUET);
copy into amazon_reviews_ngram from @data/ngram_test/amazon_reviews_2013.snappy.parquet file_format = (type = PARQUET);
copy into amazon_reviews_ngram from @data/ngram_test/amazon_reviews_2014.snappy.parquet file_format = (type = PARQUET);
copy into amazon_reviews_ngram from @data/ngram_test/amazon_reviews_2015.snappy.parquet file_format = (type = PARQUET);

Using this SQL to test Ngram, the total file size of BloomFilter is 1.5 GB

Query:

SELECT
    product_id,
    any(product_title),
    AVG(star_rating) AS rating,
    COUNT() AS count
FROM
    amazon_reviews_ngram
WHERE
    review_body LIKE '%The first track with Chris Botti is beautiful%'
GROUP BY
    product_id
ORDER BY
    count DESC,
    rating DESC,
    product_id
    LIMIT 5;

Ngram:

1 row read in 1.126 sec. Processed 786.43 thousand row, 444.15 MiB (698.43 thousand rows/s, 394.45 MiB/s)

Without Ngram:

1 row read in 13.045 sec. Processed 135.59 million row, 52.91 GiB (10.39 million rows/s, 4.06 GiB/s)

Insert:

Ngram:

2010: 38.227 sec
2011: 46.212 sec
2012: 67.140 sec
2013: 112.430 sec
2014: 132.978 sec
2015: 102.655 sec

Without Ngram:

2010: 6.090 sec
2011: 6.468 sec
2012: 9.751 sec
2013: 15.562 sec
2014: 23.374 sec
2015: 14.587 sec

Tips: The factors that affect the insertion time are as follows:

The length of each row of data
Number of data rows
BloomFilter Bitmap Size
N (gram_size) of Ngram

Therefore, this benchmark is the parameter I chose for query purposes. In actual applications, users need to weigh the insertion speed and filtering effect.

Tests

Unit Test
Logic Test
Benchmark Test
No Test - Explain why

Type of change

Bug Fix (non-breaking change which fixes an issue)
New Feature (non-breaking change which adds functionality)
Breaking Change (fix or feature that could cause existing functionality not to work as expected)
Documentation Update
Refactoring
Performance Improvement
Other (please describe):

This change is

tests/sqllogictests/suites/ee/04_ee_inverted_index/04_0000_inverted_index_base.test

tests/sqllogictests/suites/base/09_fuse_engine/09_0006_func_fuse_history.test

src/query/sql/src/planner/binder/ddl/index.rs

src/query/storages/common/index/src/bloom_index.rs

Signed-off-by: Kould <[email protected]>

src/query/storages/common/index/src/bloom_index.rs

Signed-off-by: Kould <[email protected]>

KKould · 2025-04-28T06:17:17Z

Please note that the filter has been adjusted: the original BloomFilter has been removed, and the size is controlled by taking the remainder using Xor8Filter. This may have a significant impact on the benchmark, and it still needs to be tested.

Updated to Readme

Signed-off-by: Kould <[email protected]>

src/query/sql/src/planner/binder/ddl/index.rs

src/query/storages/common/table_meta/src/meta/v2/segment.rs

src/query/storages/common/index/src/bloom_index.rs

src/query/storages/common/index/tests/it/bloom_pruner.rs

Signed-off-by: Kould <[email protected]>

src/query/storages/common/index/src/bloom_index.rs

src/query/storages/common/index/src/filters/xor8/bloom_filter.rs

Signed-off-by: Kould <[email protected]>

src/query/storages/common/index/src/filters/xor8/mod.rs

…byte Signed-off-by: Kould <[email protected]>

dantengsky · 2025-05-06T06:48:47Z

LGTM. One concern to consider for future optimization work:

Hash derivation strategy in BloomFilter::find/add methods

Perhaps the Kirsch-Mitzenmacher approach:
"Given two hash functions h_1(x) and h_2(x), an i-th additional hash function g_i(x) can be simulated as g_i(x) = h_1(x) + i * h_2(x)"
could provide a lower false positive rate in ngram scenario?

github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Apr 25, 2025

KKould force-pushed the feat/ngram_index branch 5 times, most recently from ba2213e to 8aabb9e Compare April 25, 2025 07:10

KKould commented Apr 25, 2025

View reviewed changes

tests/sqllogictests/suites/ee/04_ee_inverted_index/04_0000_inverted_index_base.test Outdated Show resolved Hide resolved

KKould force-pushed the feat/ngram_index branch from 8aabb9e to c88200b Compare April 25, 2025 07:36

KKould commented Apr 25, 2025

View reviewed changes

tests/sqllogictests/suites/base/09_fuse_engine/09_0006_func_fuse_history.test Outdated Show resolved Hide resolved

KKould force-pushed the feat/ngram_index branch from c88200b to 69d798f Compare April 25, 2025 08:54

KKould marked this pull request as ready for review April 25, 2025 09:27

b41sh self-requested a review April 25, 2025 11:22

b41sh reviewed Apr 25, 2025

View reviewed changes

src/query/sql/src/planner/binder/ddl/index.rs Outdated Show resolved Hide resolved

src/query/storages/common/index/src/bloom_index.rs Outdated Show resolved Hide resolved

src/query/storages/common/index/src/bloom_index.rs Outdated Show resolved Hide resolved

KKould force-pushed the feat/ngram_index branch 3 times, most recently from 28f2ae4 to af65547 Compare April 25, 2025 17:03

KKould added 5 commits April 26, 2025 23:34

feat: impl NgramIndex for FuseTable, improve like query performance

7e7d81d

Signed-off-by: Kould <[email protected]>

test: add explain test for ngram index

c652737

Signed-off-by: Kould <[email protected]>

chore: fix ci fail

61e0111

Signed-off-by: Kould <[email protected]>

chore: fix ci fail

cf5b818

Signed-off-by: Kould <[email protected]>

chore: add ngram index options check

67e6f07

Signed-off-by: Kould <[email protected]>

KKould force-pushed the feat/ngram_index branch from af65547 to 67e6f07 Compare April 26, 2025 15:34

sundy-li reviewed Apr 27, 2025

View reviewed changes

src/query/storages/common/index/src/bloom_index.rs Outdated Show resolved Hide resolved

sundy-li reviewed Apr 27, 2025

View reviewed changes

src/query/storages/common/index/src/bloom_index.rs Outdated Show resolved Hide resolved

chore: Logic for distinguishing partial ngrams from bloom indexes

6ed13ee

Signed-off-by: Kould <[email protected]>

KKould force-pushed the feat/ngram_index branch from ef734e0 to 762d575 Compare April 28, 2025 06:13

chore: FilterImpl uses Xor8Filter instead

ef24126

Signed-off-by: Kould <[email protected]>

KKould force-pushed the feat/ngram_index branch from 762d575 to ef24126 Compare April 28, 2025 06:19

chore: fix filter size on logic test

272c69f

Signed-off-by: Kould <[email protected]>

b41sh reviewed Apr 29, 2025

View reviewed changes

refactor: impl new BloomFilter

42cc778

Signed-off-by: Kould <[email protected]>

KKould force-pushed the feat/ngram_index branch 2 times, most recently from 6c72b5c to 88f3cd4 Compare April 29, 2025 12:16

chore: codefmt

9d37617

Signed-off-by: Kould <[email protected]>

KKould force-pushed the feat/ngram_index branch from 88f3cd4 to 9d37617 Compare April 29, 2025 15:05

test: add unit test for BloomFilter

79f3354

Signed-off-by: Kould <[email protected]>

b41sh reviewed Apr 30, 2025

View reviewed changes

src/query/storages/common/index/src/bloom_index.rs Show resolved Hide resolved

src/query/storages/common/index/src/filters/xor8/bloom_filter.rs Outdated Show resolved Hide resolved

KKould force-pushed the feat/ngram_index branch 2 times, most recently from b806534 to 1d304e8 Compare April 30, 2025 04:09

chore: codefmt

07623b4

Signed-off-by: Kould <[email protected]>

KKould force-pushed the feat/ngram_index branch from 1d304e8 to 07623b4 Compare April 30, 2025 05:54

sundy-li reviewed Apr 30, 2025

View reviewed changes

src/query/storages/common/index/src/filters/xor8/mod.rs Outdated Show resolved Hide resolved

chore: FilterImpl::to_bytes determine the Filter type by the first …

6278283

…byte Signed-off-by: Kould <[email protected]>

sundy-li approved these changes Apr 30, 2025

View reviewed changes

dantengsky approved these changes May 6, 2025

View reviewed changes

dantengsky merged commit 23f254f into databendlabs:main May 6, 2025
76 checks passed

KKould mentioned this pull request May 7, 2025

chore: remove useless logic test on explain_ngram_index.test #17887

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: impl `NgramIndex` for `FuseTable`, improve like query performance #17852

feat: impl `NgramIndex` for `FuseTable`, improve like query performance #17852

Uh oh!

KKould commented Apr 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KKould commented Apr 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dantengsky commented May 6, 2025

Uh oh!

Uh oh!

Uh oh!

feat: impl NgramIndex for FuseTable, improve like query performance #17852

feat: impl NgramIndex for FuseTable, improve like query performance #17852

Uh oh!

Conversation

KKould commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Storage

Benchmark

Query:

Insert:

Tests

Type of change

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KKould commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dantengsky commented May 6, 2025

Uh oh!

Uh oh!

Uh oh!

feat: impl `NgramIndex` for `FuseTable`, improve like query performance #17852

feat: impl `NgramIndex` for `FuseTable`, improve like query performance #17852

KKould commented Apr 25, 2025 •

edited

Loading

KKould commented Apr 28, 2025 •

edited

Loading