Skip to content

feat: impl NgramIndex for FuseTable, improve like query performance #17852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

KKould
Copy link
Member

@KKould KKould commented Apr 25, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

part of: #17724

Implement Ngram Index to improve the retrieval speed of Like query

Its working principle is to insert String type data into multiple substrings in the form of ngram and insert them into BloomFilter. When querying Like, it determines whether there is a substring after ngram that does not exist in BloomFilter to filter out the Block that must not have data in Like in advance.

Therefore, when using Ngram Index, the insertion time will be longer due to ngram (depending on the length of each line of string and the total number of data lines).

Storage

Ngram Index is essentially a data segmentation method based on Bloom Index using Ngram. Therefore, Ngram Index shares Meta with Bloom Index and uses the same storage file.

Benchmark

Using amazon_reviews as the benchmark, the total data size is 39.2 GB, and review_body is 17 GB

CREATE OR REPLACE TABLE `amazon_reviews_ngram` (
                                  `review_date` int(11) NULL,
                                  `marketplace` varchar(20) NULL,
                                  `customer_id` bigint(20) NULL,
                                  `review_id` varchar(40) NULL,
                                  `product_id` varchar(10) NULL,
                                  `product_parent` bigint(20) NULL,
                                  `product_title` varchar(500) NULL,
                                  `product_category` varchar(50) NULL,
                                  `star_rating` smallint(6) NULL,
                                  `helpful_votes` int(11) NULL,
                                  `total_votes` int(11) NULL,
                                  `vine` boolean NULL,
                                  `verified_purchase` boolean NULL,
                                  `review_headline` varchar(500) NULL,
                                  `review_body` string NULL,
                                  NGRAM INDEX idx1 (review_body) gram_size = 10 bitmap_size = 2097152
) Engine = Fuse bloom_index_columns='review_body';

copy into amazon_reviews_ngram from @data/ngram_test/amazon_reviews_2010.snappy.parquet file_format = (type = PARQUET);
copy into amazon_reviews_ngram from @data/ngram_test/amazon_reviews_2011.snappy.parquet file_format = (type = PARQUET);
copy into amazon_reviews_ngram from @data/ngram_test/amazon_reviews_2012.snappy.parquet file_format = (type = PARQUET);
copy into amazon_reviews_ngram from @data/ngram_test/amazon_reviews_2013.snappy.parquet file_format = (type = PARQUET);
copy into amazon_reviews_ngram from @data/ngram_test/amazon_reviews_2014.snappy.parquet file_format = (type = PARQUET);
copy into amazon_reviews_ngram from @data/ngram_test/amazon_reviews_2015.snappy.parquet file_format = (type = PARQUET);

Using this SQL to test Ngram, the total file size of BloomFilter is 1.5 GB

Query:

SELECT
    product_id,
    any(product_title),
    AVG(star_rating) AS rating,
    COUNT() AS count
FROM
    amazon_reviews_ngram
WHERE
    review_body LIKE '%The first track with Chris Botti is beautiful%'
GROUP BY
    product_id
ORDER BY
    count DESC,
    rating DESC,
    product_id
    LIMIT 5;

Ngram:

  • 1 row read in 1.162 sec. Processed 3.22 million row, 1.70 GiB (2.77 million rows/s, 1.47 GiB/s)
    Not Ngram:
  • 1 row read in 13.045 sec. Processed 135.59 million row, 52.91 GiB (10.39 million rows/s, 4.06 GiB/s)

Insert:

Ngram:

  • 2010: 45.310 sec
  • 2011: 52.052 sec
  • 2012: 77.412 sec
  • 2013: 129.136 sec
  • 2014: 156.244 sec
  • 2015: 104.428 sec

Not Ngram:

  • 2010: 6.090 sec
  • 2011: 6.468 sec
  • 2012: 9.751 sec
  • 2013: 15.562 sec
  • 2014: 23.374 sec
  • 2015: 14.587 sec

Tips: The factors that affect the insertion time are as follows:

  • The length of each row of data
  • Number of data rows
  • BloomFilter Bitmap Size
  • N (gram_size) of Ngram

Therefore, this benchmark is the parameter I chose for query purposes. In actual applications, users need to weigh the insertion speed and filtering effect.

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Apr 25, 2025
@KKould KKould force-pushed the feat/ngram_index branch 3 times, most recently from 63f8325 to b107792 Compare April 25, 2025 04:27
@KKould KKould force-pushed the feat/ngram_index branch from b107792 to e330d25 Compare April 25, 2025 04:34
Signed-off-by: Kould <[email protected]>
@KKould KKould force-pushed the feat/ngram_index branch from ba2213e to 8aabb9e Compare April 25, 2025 07:10
@@ -476,7 +476,7 @@ idx2 INVERTED books(title, author, description)index_record='"basic"' tokenizer=
query III
select row_count, bloom_filter_size, inverted_index_size from fuse_block('test_index', 't1')
----
10 438 2390
10 439 2390
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a bit to distinguish whether the filter is Xor8 or Bloom

Signed-off-by: Kould <[email protected]>
@KKould KKould force-pushed the feat/ngram_index branch from 8aabb9e to c88200b Compare April 25, 2025 07:36
@@ -21,7 +21,7 @@ insert into t values (1)
query III
select block_count, row_count, index_size from fuse_snapshot('db_09_0006', 't') order by row_count desc limit 1
----
1 1 0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the index size 0? The bloom filter is created by default, so after inserting a row of data, the index size will always be greater than 1 in theory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant