[docs] Add document for Prefix Lookup (#504)

swuferhong · web-flow · commit 0626b16464a6 · 2025-03-01T22:41:30.000+08:00
diff --git a/website/docs/engine-flink/ddl.md b/website/docs/engine-flink/ddl.md
@@ -58,7 +58,7 @@ DROP DATABASE my_db;
 
 ### PrimaryKey Table
 
-The following SQL statement will create a [PrimaryKey Table](table-design/table-types/pk-table.md) with a primary key consisting of shop_id and user_id.
+The following SQL statement will create a [PrimaryKey Table](table-design/table-types/pk-table/index.md) with a primary key consisting of shop_id and user_id.
 ```sql title="Flink SQL"
 CREATE TABLE my_pk_table (
   shop_id BIGINT,
diff --git a/website/docs/engine-flink/lookups.md b/website/docs/engine-flink/lookups.md
@@ -7,11 +7,13 @@ sidebar_position: 5
 Flink lookup joins are important because they enable efficient, real-time enrichment of streaming data with reference data, a common requirement in many real-time analytics and processing scenarios.
 
 
-## Instructions
-- Use a primary key table as a dimension table,  and the join condition must include all primary keys of the dimension table.
+## Lookup
+
+### Instructions
+- Use a primary key table as a dimension table, and the join condition must include all primary keys of the dimension table.
 - Fluss lookup join is in asynchronous mode by default for higher throughput. You can change the mode of lookup join as synchronous mode by setting the SQL Hint `'lookup.async' = 'false'`.
 
-## Examples
+### Examples
 1. Create two tables.
 ```sql title="Flink SQL"
 CREATE TABLE `fluss_catalog`.`my_db`.`orders` (
@@ -24,6 +26,7 @@ CREATE TABLE `fluss_catalog`.`my_db`.`orders` (
   `o_clerk` CHAR(15) NOT NULL,
   `o_shippriority` INT NOT NULL,
   `o_comment` STRING NOT NULL,
+  `o_dt` STRING NOT NULL,
   PRIMARY KEY (o_orderkey) NOT ENFORCED
 );
 ```
@@ -83,7 +86,45 @@ FOR SYSTEM_TIME AS OF `o`.`ptime` AS `c`
 ON `o`.`o_custkey` = `c`.`c_custkey`;
 ```
 
-## Lookup Options
+### Examples (Partitioned Table)
+
+Continuing from the previous example, if our dimension table is a Fluss partitioned primary key table, as follows:
+
+```sql title="Flink SQL"
+CREATE TABLE `fluss_catalog`.`my_db`.`customer_partitioned` (
+  `c_custkey` INT NOT NULL,
+  `c_name` STRING NOT NULL,
+  `c_address` STRING NOT NULL,
+  `c_nationkey` INT NOT NULL,
+  `c_phone` CHAR(15) NOT NULL,
+  `c_acctbal` DECIMAL(15, 2) NOT NULL,
+  `c_mktsegment` CHAR(10) NOT NULL,
+  `c_comment` STRING NOT NULL,
+  `dt` STRING NOT NULL,
+  PRIMARY KEY (`c_custkey`, `dt`) NOT ENFORCED
+) 
+PARTITIONED BY (`dt`)
+WITH (
+    'table.auto-partition.enabled' = 'true',
+    'table.auto-partition.time-unit' = 'year'
+);
+```
+
+To do a lookup join with the Fluss partitioned primary key table, we need to specify the 
+primary keys (including partition key) in the join condition.
+```sql title="Flink SQL"
+INSERT INTO lookup_join_sink
+SELECT `o`.`o_orderkey`, `o`.`o_totalprice`, `c`.`c_name`, `c`.`c_address`
+FROM 
+(SELECT `orders`.*, proctime() AS ptime FROM `orders`) AS `o`
+LEFT JOIN `customer_partitioned`
+FOR SYSTEM_TIME AS OF `o`.`ptime` AS `c`
+ON `o`.`o_custkey` = `c`.`c_custkey` AND  `o`.`o_dt` = `c`.`dt`;
+```
+
+For more details about Fluss partitioned table, see [Partitioned Tables](/docs/table-design/data-distribution/partitioning.md).
+
+### Lookup Options
 
 
 | Option                                          | Type     | Required | Default     | Description                                                                                                                                                                                                                                                                                                                                             |
@@ -94,3 +135,135 @@ ON `o`.`o_custkey` = `c`.`c_custkey`;
 | lookup.partial-cache.expire-after-write         | Duration | optional | (none)      | Duration to expire an entry in the cache after writing.                                                                                                                                                                                                                                                                                                 |
 | lookup.partial-cache.cache-missing-key          | Boolean  | optional | true        | Whether to store an empty value into the cache if the lookup key doesn't match any rows in the table.                                                                                                                                                                                                                                                   |
 | lookup.partial-cache.max-rows                   | Long     | optional | true        | The maximum number of rows to store in the cache.                                                                                                                                                                                                                                                                                                       |
+
+
+## Prefix Lookup
+
+### Instructions
+
+- Use a primary key table as a dimension table, and the join condition must a prefix subset of the primary keys of the dimension table.
+- The bucket key of Fluss dimension table need to set as the join key when creating Fluss table. 
+- Fluss prefix lookup join is in asynchronous mode by default for higher throughput. You can change the mode of prefix lookup join as synchronous mode by setting the SQL Hint `'lookup.async' = 'false'`.
+
+
+### Examples
+1. Create two tables.
+```sql title="Flink SQL"
+CREATE TABLE `fluss_catalog`.`my_db`.`orders` (
+  `o_orderkey` INT NOT NULL,
+  `o_custkey` INT NOT NULL,
+  `o_orderstatus` CHAR(1) NOT NULL,
+  `o_totalprice` DECIMAL(15, 2) NOT NULL,
+  `o_orderdate` DATE NOT NULL,
+  `o_orderpriority` CHAR(15) NOT NULL,
+  `o_clerk` CHAR(15) NOT NULL,
+  `o_shippriority` INT NOT NULL,
+  `o_comment` STRING NOT NULL,
+  `o_dt` STRING NOT NULL,
+  PRIMARY KEY (o_orderkey) NOT ENFORCED
+);
+```
+
+```sql title="Flink SQL"
+-- primary keys are (c_custkey, c_nationkey)
+-- bucket key is (c_custkey)
+CREATE TABLE `fluss_catalog`.`my_db`.`customer` (
+  `c_custkey` INT NOT NULL,
+  `c_name` STRING NOT NULL,
+  `c_address` STRING NOT NULL,
+  `c_nationkey` INT NOT NULL,
+  `c_phone` CHAR(15) NOT NULL,
+  `c_acctbal` DECIMAL(15, 2) NOT NULL,
+  `c_mktsegment` CHAR(10) NOT NULL,
+  `c_comment` STRING NOT NULL,
+  PRIMARY KEY (`c_custkey`, `c_nationkey`) NOT ENFORCED
+) WITH (
+  'bucket.key' = 'c_custkey' 
+);
+```
+
+2. Perform prefix lookup.
+```sql title="Flink SQL"
+USE CATALOG fluss_catalog;
+```
+
+```sql title="Flink SQL"
+USE my_db;
+```
+
+```sql title="Flink SQL"
+CREATE TEMPORARY TABLE lookup_join_sink
+(
+   order_key INT NOT NULL,
+   order_totalprice DECIMAL(15, 2) NOT NULL,
+   customer_name STRING NOT NULL,
+   customer_address STRING NOT NULL
+) WITH ('connector' = 'blackhole');
+```
+
+```sql title="Flink SQL"
+-- prefix look up join in asynchronous mode.
+INSERT INTO lookup_join_sink
+SELECT `o`.`o_orderkey`, `o`.`o_totalprice`, `c`.`c_name`, `c`.`c_address`
+FROM 
+(SELECT `orders`.*, proctime() AS ptime FROM `orders`) AS `o`
+LEFT JOIN `customer`
+FOR SYSTEM_TIME AS OF `o`.`ptime` AS `c`
+ON `o`.`o_custkey` = `c`.`c_custkey`;
+
+-- join key is a prefix set of dimension table primary keys.
+```
+
+```sql title="Flink SQL"
+-- prefix look up join in synchronous mode.
+INSERT INTO lookup_join_sink
+SELECT `o`.`o_orderkey`, `o`.`o_totalprice`, `c`.`c_name`, `c`.`c_address`
+FROM 
+(SELECT `orders`.*, proctime() AS ptime FROM `orders`) AS `o`
+LEFT JOIN `customer` /*+ OPTIONS('lookup.async' = 'false') */
+FOR SYSTEM_TIME AS OF `o`.`ptime` AS `c`
+ON `o`.`o_custkey` = `c`.`c_custkey`;
+```
+
+### Examples (Partitioned Table)
+
+Continuing from the previous prefix lookup example, if our dimension table is a Fluss partitioned primary key table, as follows:
+
+```sql title="Flink SQL"
+-- primary keys are (c_custkey, c_nationkey, dt)
+-- bucket key is (c_custkey)
+CREATE TABLE `fluss_catalog`.`my_db`.`customer_partitioned` (
+  `c_custkey` INT NOT NULL,
+  `c_name` STRING NOT NULL,
+  `c_address` STRING NOT NULL,
+  `c_nationkey` INT NOT NULL,
+  `c_phone` CHAR(15) NOT NULL,
+  `c_acctbal` DECIMAL(15, 2) NOT NULL,
+  `c_mktsegment` CHAR(10) NOT NULL,
+  `c_comment` STRING NOT NULL,
+  `dt` STRING NOT NULL,
+  PRIMARY KEY (`c_custkey`, `c_nationkey`, `dt`) NOT ENFORCED
+) 
+PARTITIONED BY (`dt`)
+WITH (
+    'bucket.key' = 'c_custkey',
+    'table.auto-partition.enabled' = 'true',
+    'table.auto-partition.time-unit' = 'year'
+);
+```
+
+To do a prefix lookup with the Fluss partitioned primary key table, the prefix lookup join key is in pattern of
+`a prefix subset of primary keys (excluding partition key)` + `partition key`.
+```sql title="Flink SQL"
+INSERT INTO lookup_join_sink
+SELECT `o`.`o_orderkey`, `o`.`o_totalprice`, `c`.`c_name`, `c`.`c_address`
+FROM 
+(SELECT `orders`.*, proctime() AS ptime FROM `orders`) AS `o`
+LEFT JOIN `customer_partitioned`
+FOR SYSTEM_TIME AS OF `o`.`ptime` AS `c`
+ON `o`.`o_custkey` = `c`.`c_custkey` AND  `o`.`o_dt` = `c`.`dt`;
+
+-- join key is a prefix set of dimension table primary keys (excluding partition key) + partition key.
+```
+
+For more details about Fluss partitioned table, see [Partitioned Tables](/docs/table-design/data-distribution/partitioning.md).
diff --git a/website/docs/table-design/table-types/pk-table/index.md b/website/docs/table-design/table-types/pk-table/index.md
@@ -85,10 +85,6 @@ The following merge engines are supported:
 1. [FirstRow Merge Engine](/docs/table-design/table-types/pk-table/merge-engines/first-row)
 2. [Versioned Merge Engine](/docs/table-design/table-types/pk-table/merge-engines/versioned)
 
-## Data Queries
-
-For primary key tables, Fluss supports querying data directly based on the key. Please refer to
-the [Flink Reads](../../../engine-flink/reads.md) for detailed instructions.
 
 ## Changelog Generation
 
@@ -117,10 +113,22 @@ be generated.
 -D(1, 4.0, 'banana')
 ```
 
-## Data Consumption
+## Data Queries
+
+For primary key tables, Fluss supports various kinds of querying abilities.
+
+### Reads
 
-For a primary key table, the default consumption method is a full snapshot followed by incremental data. First, the
+For a primary key table, the default read method is a full snapshot followed by incremental data. First, the
 snapshot data of the table is consumed, followed by the binlog data of the table.
 
-It is also possible to only consume the binlog data of the table. For more details, please refer to
-the [Flink Reads](../../../engine-flink/reads.md)
+It is also possible to only consume the binlog data of the table. For more details, please refer to the [Flink Reads](/docs/engine-flink/reads.md)
+
+### Lookup
+
+Fluss primary key table can lookup data by the primary keys. If the key exists in Fluss, lookup will return a unique row. it always used in [Flink Lookup Join](/docs/engine-flink//lookups.md#lookup).
+
+### Prefix Lookup
+
+Fluss primary key table can also do prefix lookup by the prefix subset primary keys. Unlike lookup, prefix lookup
+will scan data based on the prefix of primary keys and may return multiple rows. It always used in [Flink Prefix Lookup Join](/docs/engine-flink/lookups.md#prefix-lookup).