You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: website/docs/table-design/table-types/log-table.md
+41-12Lines changed: 41 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -39,16 +39,47 @@ Log Tables in Fluss allow real-time data consumption, preserving the order of da
39
39
- For two data records from the same partition but different buckets, the consumption order is not guaranteed because different buckets may be processed concurrently by different data consumption jobs.
40
40
41
41
42
-
## Log Tiering
43
-
Log Table supports tiering data to different storage tiers. See more details about [Remote Log](/docs/maintenance/tiered-storage/remote-storage/).
42
+
## Column Pruning
43
+
44
+
Column pruning is a technique used to reduce the amount of data that needs to be read from storage by eliminating unnecessary columns from the query.
45
+
Fluss supports column pruning for Log Tables and the changelog of PrimaryKey Tables, which can significantly improve query performance by reducing the amount of data that needs to be read from storage and lowering networking costs.
46
+
47
+
What sets Fluss apart is its ability to apply **column pruning during streaming reads**, a capability that is both unique and industry-leading. This ensures that even in real-time streaming scenarios, only the required columns are processed, minimizing resource usage and maximizing efficiency.
48
+
49
+
In Fluss, Log Tables are stored in a columnar format by default (i.e., Apache Arrow).
50
+
This format stores data column-by-column rather than row-by-row, making it highly efficient for column pruning.
51
+
When a query specifies only a subset of columns, Fluss skips reading irrelevant columns entirely.
52
+
This enables efficient column pruning during query execution and ensures that only the required columns are read,
53
+
minimizing I/O overhead and improving overall system efficiency.
54
+
55
+
During query execution, query engines like Flink analyzes the query to identify the columns required for processing and tells Fluss to only read the necessary columns.
56
+
For example the following streaming query:
57
+
58
+
```sql
59
+
SELECT id, name FROM log_table WHEREtimestamp>'2023-01-01';
60
+
```
61
+
62
+
In this query, only the `id`, `name`, and `timestamp` columns are accessed. Other columns (e.g., `address`, `status`) are pruned and not read from storage.
63
+
44
64
45
65
## Log Compression
46
-
Log Table supports the end-to-end block compression feature for arrow log format. If enabled, data will be compressed
47
-
by the writer in client, written in compressed format and decompressed by the log scanner in client. Log compression
48
-
can significantly reduce storage costs on the server side.
49
66
50
-
Currently, Log Table is defaulted to using `ZSTD` compression codec, with the compression level set to `3`. If you want to
51
-
modify the compression codec or compression level of `ZSTD` compression codec, you can create table as:
67
+
**Log Table** supports end-to-end compression for the Arrow log format. Fluss leverages [Arrow native compression](https://arrow.apache.org/docs/format/Columnar.html#compression) to implement this feature,
68
+
ensuring that compressed data remains compliant with the Arrow format. As a result, the compressed data can be seamlessly decompressed by any Arrow-compatible library.
69
+
Additionally, compression is applied to each column independently, preserving the ability to perform column pruning on the compressed data without performance degradation.
70
+
71
+
When compression is enabled:
72
+
- For **Log Tables**, data is compressed by the writer on the client side, written in a compressed format, and decompressed by the log scanner on the client side.
73
+
- For **PrimaryKey Table changelogs**, compression is performed server-side since the changelog is generated on the server.
74
+
75
+
Log compression significantly reduces networking and storage costs. Benchmark results demonstrate that using the ZSTD compression with level 3 achieves a compression ratio of approximately **5x** (e.g., reducing 5GB of data to 1GB).
76
+
Furthermore, read/write throughput improves substantially due to reduced networking overhead.
77
+
78
+
By default, the Log Table uses the `ZSTD` compression codec with a compression level of `3`.
79
+
You can change the compression codec by setting the `table.log.arrow.compression.type` property to `NONE`, `LZ4_FRAME`, or `ZSTD`.
80
+
You can also adjust the compression level for `ZSTD` by setting the `table.log.arrow.compression.zstd.level` property to a value between `1` and `22`.
81
+
82
+
For example:
52
83
53
84
```sql title="Flink SQL"
54
85
-- Set the compression codec to LZ4_FRAME
@@ -73,15 +104,13 @@ WITH (
73
104
'table.log.arrow.compression.zstd.level'='2'
74
105
);
75
106
```
107
+
76
108
In the above example, we set the compression codec to `LZ4_FRAME` and the compression level to `2`.
77
109
78
110
:::note
79
111
1. Currently, the compression codec and compression level are only supported for arrow format. If you set `'table.log.format'='indexed'`, the compression codec and compression level will be ignored.
80
112
2. The valid range of `table.log.arrow.compression.zstd.level` is 1 to 22.
81
113
:::
82
114
83
-
### Supported Compression Codecs
84
-
Currently, Fluss supports the following compression codecs for arrow format:
85
-
-`NONE`: No compression.
86
-
-`LZ4_FRAME`: LZ4 frame compression.
87
-
-`ZSTD`: ZSTD compression.
115
+
## Log Tiering
116
+
Log Table supports tiering data to different storage tiers. See more details about [Remote Log](/docs/maintenance/tiered-storage/remote-storage/).
0 commit comments