Skip to content

Commit 5197c62

Browse files
authored
[docs] Add documentation for column pruning and partition pruning (#1080)
1 parent b2f5a6e commit 5197c62

File tree

1 file changed

+104
-0
lines changed

1 file changed

+104
-0
lines changed

website/docs/engine-flink/reads.md

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,110 @@ You can also do streaming read without reading the snapshot data, you can use `l
5151
SELECT * FROM my_table /*+ OPTIONS('scan.startup.mode' = 'latest') */;
5252
```
5353

54+
### Column Pruning
55+
56+
Column pruning minimizes I/O by reading only the columns used in a query and ignoring unused ones at the storage layer.
57+
In Fluss, column pruning is implemented using [Apache Arrow](https://arrow.apache.org/) as the default log format to optimize streaming reads from Log Tables and change logs of PrimaryKey Tables.
58+
Benchmark results show that column pruning can reach 10x read performance improvement, and reduce unnecessary network traffic (reduce 80% I/O if 80% columns are not used).
59+
60+
:::note
61+
1. Column pruning is only available when the table uses the Arrow log format (`'table.log.format' = 'arrow'`), which is enabled by default.
62+
2. Reading log data from remote storage currently does not support column pruning.
63+
:::
64+
65+
#### Example
66+
67+
**1. Create a table**
68+
```sql title="Flink SQL"
69+
CREATE TABLE `testcatalog`.`testdb`.`log_table` (
70+
`c_custkey` INT NOT NULL,
71+
`c_name` STRING NOT NULL,
72+
`c_address` STRING NOT NULL,
73+
`c_nationkey` INT NOT NULL,
74+
`c_phone` STRING NOT NULL,
75+
`c_acctbal` DECIMAL(15, 2) NOT NULL,
76+
`c_mktsegment` STRING NOT NULL,
77+
`c_comment` STRING NOT NULL
78+
);
79+
```
80+
81+
**2. Query a single column:**
82+
```sql title="Flink SQL"
83+
SELECT `c_name` FROM `testcatalog`.`testdb`.`log_table`;
84+
```
85+
86+
**3. Verify with `EXPLAIN`:**
87+
```sql title="Flink SQL"
88+
EXPLAIN SELECT `c_name` FROM `testcatalog`.`testdb`.`log_table`;
89+
```
90+
91+
**Output:**
92+
93+
```
94+
== Optimized Execution Plan ==
95+
TableSourceScan(table=[[testcatalog, testdb, log_table, project=[c_name]]], fields=[c_name])
96+
```
97+
98+
This confirms that only the `c_name` column is being read from storage.
99+
100+
### Partition Pruning
101+
102+
Partition pruning is an optimization technique for Fluss partitioned tables. It reduces the number of partitions scanned during a query by filtering based on partition keys.
103+
This optimization is especially useful in streaming scenarios for [Multi-Field Partitioned Tables](table-design/data-distribution/partitioning.md#multi-field-partitioned-tables) that has many partitions.
104+
The partition pruning also supports dynamically pruning new created partitions during streaming read.
105+
106+
:::note
107+
1. Currently, **only equality conditions** (e.g., `c_nationkey = 'US'`) are supported for partition pruning. Operators like `<`, `>`, `OR`, and `IN` are not yet supported.
108+
:::
109+
110+
#### Example
111+
112+
**1. Create a partitioned table:**
113+
```sql title="Flink SQL"
114+
CREATE TABLE `testcatalog`.`testdb`.`log_partitioned_table` (
115+
`c_custkey` INT NOT NULL,
116+
`c_name` STRING NOT NULL,
117+
`c_address` STRING NOT NULL,
118+
`c_nationkey` INT NOT NULL,
119+
`c_phone` STRING NOT NULL,
120+
`c_acctbal` DECIMAL(15, 2) NOT NULL,
121+
`c_mktsegment` STRING NOT NULL,
122+
`c_comment` STRING NOT NULL,
123+
`dt` STRING NOT NULL
124+
) PARTITIONED BY (`c_nationkey`,`dt`);
125+
```
126+
127+
**2. Query with partition filter:**
128+
```sql title="Flink SQL"
129+
SELECT * FROM `testcatalog`.`testdb`.`log_partitioned_table` WHERE `c_nationkey` = 'US';
130+
```
131+
132+
Fluss source will scan only the partitions where `c_nationkey = 'US'`.
133+
For example, if the following partitions exist:
134+
- `US,2025-06-13`
135+
- `China,2025-06-13`
136+
- `US,2025-06-14`
137+
- `China,2025-06-14`
138+
139+
Only `US,2025-06-13` and `US,2025-06-14` will be read.
140+
141+
As new partitions like `US,2025-06-15`, `China,2025-06-15` are created, partition `US,2025-06-15` will be automatically included in the stream, while `China,2025-06-15` will be dynamically filtered out based on the partition pruning condition.
142+
143+
**3. Verify with `EXPLAIN`:**
144+
145+
```sql title="Flink SQL"
146+
EXPLAIN SELECT * FROM `testcatalog`.`testdb`.`log_partitioned_table` WHERE `c_nationkey` = 'US';
147+
```
148+
149+
**Output:**
150+
151+
```text
152+
== Optimized Execution Plan ==
153+
TableSourceScan(table=[[testcatalog, testdb, log_partitioned_table, filter=[=(c_nationkey, _UTF-16LE'US':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")]]], fields=[c_custkey, c_name, c_address, c_nationkey, c_phone, c_acctbal, c_mktsegment, c_comment, dt])
154+
```
155+
156+
This confirms that only partitions matching `c_nationkey = 'US'` will be scanned.
157+
54158
## Batch Read
55159

56160
### Limit Read

0 commit comments

Comments
 (0)