Skip to content

Commit f29018e

Browse files
committed
[docs] Add document for Column Pruning
1 parent ae08d34 commit f29018e

File tree

1 file changed

+41
-12
lines changed

1 file changed

+41
-12
lines changed

website/docs/table-design/table-types/log-table.md

Lines changed: 41 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -39,16 +39,47 @@ Log Tables in Fluss allow real-time data consumption, preserving the order of da
3939
- For two data records from the same partition but different buckets, the consumption order is not guaranteed because different buckets may be processed concurrently by different data consumption jobs.
4040

4141

42-
## Log Tiering
43-
Log Table supports tiering data to different storage tiers. See more details about [Remote Log](/docs/maintenance/tiered-storage/remote-storage/).
42+
## Column Pruning
43+
44+
Column pruning is a technique used to reduce the amount of data that needs to be read from storage by eliminating unnecessary columns from the query.
45+
Fluss supports column pruning for Log Tables and the changelog of PrimaryKey Tables, which can significantly improve query performance by reducing the amount of data that needs to be read from storage and lowering networking costs.
46+
47+
What sets Fluss apart is its ability to apply **column pruning during streaming reads**, a capability that is both unique and industry-leading. This ensures that even in real-time streaming scenarios, only the required columns are processed, minimizing resource usage and maximizing efficiency.
48+
49+
In Fluss, Log Tables are stored in a columnar format by default (i.e., Apache Arrow).
50+
This format stores data column-by-column rather than row-by-row, making it highly efficient for column pruning.
51+
When a query specifies only a subset of columns, Fluss skips reading irrelevant columns entirely.
52+
This enables efficient column pruning during query execution and ensures that only the required columns are read,
53+
minimizing I/O overhead and improving overall system efficiency.
54+
55+
During query execution, query engines like Flink analyzes the query to identify the columns required for processing and tells Fluss to only read the necessary columns.
56+
For example the following streaming query:
57+
58+
```sql
59+
SELECT id, name FROM log_table WHERE timestamp > '2023-01-01';
60+
```
61+
62+
In this query, only the `id`, `name`, and `timestamp` columns are accessed. Other columns (e.g., `address`, `status`) are pruned and not read from storage.
63+
4464

4565
## Log Compression
46-
Log Table supports the end-to-end block compression feature for arrow log format. If enabled, data will be compressed
47-
by the writer in client, written in compressed format and decompressed by the log scanner in client. Log compression
48-
can significantly reduce storage costs on the server side.
4966

50-
Currently, Log Table is defaulted to using `ZSTD` compression codec, with the compression level set to `3`. If you want to
51-
modify the compression codec or compression level of `ZSTD` compression codec, you can create table as:
67+
**Log Table** supports end-to-end compression for the Arrow log format. Fluss leverages [Arrow native compression](https://arrow.apache.org/docs/format/Columnar.html#compression) to implement this feature,
68+
ensuring that compressed data remains compliant with the Arrow format. As a result, the compressed data can be seamlessly decompressed by any Arrow-compatible library.
69+
Additionally, compression is applied to each column independently, preserving the ability to perform column pruning on the compressed data without performance degradation.
70+
71+
When compression is enabled:
72+
- For **Log Tables**, data is compressed by the writer on the client side, written in a compressed format, and decompressed by the log scanner on the client side.
73+
- For **PrimaryKey Table changelogs**, compression is performed server-side since the changelog is generated on the server.
74+
75+
Log compression significantly reduces networking and storage costs. Benchmark results demonstrate that using the ZSTD compression with level 3 achieves a compression ratio of approximately **5x** (e.g., reducing 5GB of data to 1GB).
76+
Furthermore, read/write throughput improves substantially due to reduced networking overhead.
77+
78+
By default, the Log Table uses the `ZSTD` compression codec with a compression level of `3`.
79+
You can change the compression codec by setting the `table.log.arrow.compression.type` property to `NONE`, `LZ4_FRAME`, or `ZSTD`.
80+
You can also adjust the compression level for `ZSTD` by setting the `table.log.arrow.compression.zstd.level` property to a value between `1` and `22`.
81+
82+
For example:
5283

5384
```sql title="Flink SQL"
5485
-- Set the compression codec to LZ4_FRAME
@@ -73,15 +104,13 @@ WITH (
73104
'table.log.arrow.compression.zstd.level' = '2'
74105
);
75106
```
107+
76108
In the above example, we set the compression codec to `LZ4_FRAME` and the compression level to `2`.
77109

78110
:::note
79111
1. Currently, the compression codec and compression level are only supported for arrow format. If you set `'table.log.format'='indexed'`, the compression codec and compression level will be ignored.
80112
2. The valid range of `table.log.arrow.compression.zstd.level` is 1 to 22.
81113
:::
82114

83-
### Supported Compression Codecs
84-
Currently, Fluss supports the following compression codecs for arrow format:
85-
- `NONE`: No compression.
86-
- `LZ4_FRAME`: LZ4 frame compression.
87-
- `ZSTD`: ZSTD compression.
115+
## Log Tiering
116+
Log Table supports tiering data to different storage tiers. See more details about [Remote Log](/docs/maintenance/tiered-storage/remote-storage/).

0 commit comments

Comments
 (0)