You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/en/engines/table-engines/integrations/iceberg.md
+173-9Lines changed: 173 additions & 9 deletions
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ sidebar_position: 90
4
4
sidebar_label: Iceberg
5
5
---
6
6
7
-
# Iceberg Table Engine
7
+
# Iceberg Table Engine {#iceberg-table-engine}
8
8
9
9
:::warning
10
10
We recommend using the [Iceberg Table Function](/docs/sql-reference/table-functions/iceberg.md) for working with Iceberg data in ClickHouse. The Iceberg Table Function currently provides sufficient functionality, offering a partial read-only interface for Iceberg tables.
Table engine `Iceberg` is an alias to `IcebergS3` now.
73
72
74
-
**Schema Evolution**
73
+
## Schema Evolution {#schema-evolution}
75
74
At the moment, with the help of CH, you can read iceberg tables, the schema of which has changed over time. We currently support reading tables where columns have been added and removed, and their order has changed. You can also change a column where a value is required to one where NULL is allowed. Additionally, we support permitted type casting for simple types, namely:
76
75
* int -> long
77
76
* float -> double
@@ -81,14 +80,179 @@ Currently, it is not possible to change nested structures or the types of elemen
81
80
82
81
To read a table where the schema has changed after its creation with dynamic schema inference, set allow_dynamic_metadata_for_data_lakes = true when creating the table.
83
82
84
-
**Partition Pruning**
83
+
## Partition Pruning {#partition-pruning}
85
84
86
85
ClickHouse supports partition pruning during SELECT queries for Iceberg tables, which helps optimize query performance by skipping irrelevant data files. Now it works with only identity transforms and time-based transforms (hour, day, month, year). To enable partition pruning, set `use_iceberg_partition_pruning = 1`.
87
86
88
-
### Data cache {#data-cache}
87
+
88
+
## Time Travel {#time-travel}
89
+
90
+
ClickHouse supports time travel for Iceberg tables, allowing you to query historical data with a specific timestamp or snapshot ID.
91
+
92
+
### Basic usage {#basic-usage}
93
+
```sql
94
+
SELECT*FROM example_table ORDER BY1
95
+
SETTINGS iceberg_timestamp_ms =1714636800000
96
+
```
97
+
98
+
```sql
99
+
SELECT*FROM example_table ORDER BY1
100
+
SETTINGS iceberg_snapshot_id =3547395809148285433
101
+
```
102
+
103
+
Note: You cannot specify both `iceberg_timestamp_ms` and `iceberg_snapshot_id` parameters in the same query.
104
+
105
+
### Important considerations {#important-considerations}
106
+
107
+
-**Snapshots** are typically created when:
108
+
- New data is written to the table
109
+
- Some kind of data compaction is performed
110
+
111
+
-**Schema changes typically don't create snapshots** - This leads to important behaviors when using time travel with tables that have undergone schema evolution.
112
+
113
+
### Example scenarios {#example-scenarios}
114
+
115
+
All scenarios are written in Spark because CH doesn't support writing to Iceberg tables yet.
116
+
117
+
#### Scenario 1: Schema Changes Without New Snapshots {#scenario-1}
118
+
119
+
Consider this sequence of operations:
120
+
121
+
```sql
122
+
-- Create a table with two columns
123
+
CREATETABLEIF NOT EXISTS spark_catalog.db.time_travel_example (
This happens because `ALTER TABLE` doesn't create a new snapshot but for the current table Spark takes value of `schema_id` from the latest metadata file, not a snapshot.
224
+
225
+
#### Scenario 3: Historical vs. Current Schema Differences {#scenario-3}
226
+
227
+
The second one is that while doing time travel you can't get state of table before any data was written to it:
228
+
229
+
```sql
230
+
-- Create a table
231
+
CREATETABLEIF NOT EXISTS spark_catalog.db.time_travel_example_3 (
232
+
order_number int,
233
+
product_code string
234
+
)
235
+
USING iceberg
236
+
OPTIONS ('format-version'='2');
237
+
238
+
ts = now();
239
+
240
+
-- Query the table at a specific timestamp
241
+
SELECT*FROMspark_catalog.db.time_travel_example_3 TIMESTAMPAS OF ts; -- Finises with error: Cannot find a snapshot older than ts.
242
+
```
243
+
244
+
245
+
In Clickhouse the behavior is consistent with Spark. You can mentally replace Spark Select queries with Clickhouse Select queries and it will work the same way.
246
+
247
+
248
+
## Data cache {#data-cache}
89
249
90
250
`Iceberg` table engine and table function support data caching same as `S3`, `AzureBlobStorage`, `HDFS` storages. See [here](../../../engines/table-engines/integrations/s3.md#data-cache).
91
251
92
-
## See also
252
+
## Metadata cache {#metadata-cache}
253
+
254
+
`Iceberg` table engine and table function support metadata cache storing the information of manifest files, manifest list and metadata json. The cache is stored in memory. This feature is controlled by setting `use_iceberg_metadata_files_cache`, which is enabled by default.
0 commit comments