Skip to content

Commit 389519b

Browse files
committed
[Doc] Add documents related to compression codec in connector sink modules
Signed-off-by: Gavin <[email protected]>
1 parent 438b9e5 commit 389519b

File tree

6 files changed

+19
-10
lines changed

6 files changed

+19
-10
lines changed

docs/en/data_source/catalog/hive_catalog.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,9 @@ To ensure successful SQL workloads on your Hive cluster, your StarRocks cluster
3838
- Parquet and ORC files support the following compression formats: NO_COMPRESSION, SNAPPY, LZ4, ZSTD, and GZIP.
3939
- Textfile files support the NO_COMPRESSION compression format.
4040

41-
You can use the session variable [`connector_sink_compression_codec`](../../sql-reference/System_variable.md#connector_sink_compression_codec) to specify the compression algorithm used for sinking data to Hive tables.
41+
You can use the table property [`compression_codec`](../../data_source/catalog/hive_catalog.md#properties) or the system variable [`connector_sink_compression_codec`](../../sql-reference/System_variable.md#connector_sink_compression_codec) to specify the compression algorithm used for sinking data to Hive tables.
42+
43+
When writing to a Hive table, if the table's properties include a compression codec, StarRocks will preferentially use that algorithm to compress the written data. Otherwise, it will use the compression algorithm set in the system variable `connector_sink_compression_codec`.
4244

4345
## Integration preparations
4446

@@ -1021,7 +1023,7 @@ The following table describes a few key properties.
10211023
| ----------------- | ------------------------------------------------------------ |
10221024
| location | The file path in which you want to create the managed table. When you use HMS as metastore, you do not need to specify the `location` parameter, because StarRocks will create the table in the default file path of the current Hive catalog. When you use AWS Glue as metadata service:<ul><li>If you have specified the `location` parameter for the database in which you want to create the table, you do not need to specify the `location` parameter for the table. As such, the table defaults to the file path of the database to which it belongs. </li><li>If you have not specified the `location` for the database in which you want to create the table, you must specify the `location` parameter for the table.</li></ul> |
10231025
| file_format | The file format of the managed table. Supported file formats are Parquet, ORC, and Textfile. ORC and Textfile formats are supported from v3.3 onwards. Valid values: `parquet`, `orc`, and `textfile`. Default value: `parquet`. |
1024-
| compression_codec | The compression algorithm used for the managed table. This property is deprecated in v3.2.3, since which version the compression algorithm used for sinking data to Hive tables is uniformly controlled by the session variable [connector_sink_compression_codec](../../sql-reference/System_variable.md#connector_sink_compression_codec). |
1026+
| compression_codec | The compression algorithm used for the managed table. |
10251027
10261028
### Examples
10271029
@@ -1068,7 +1070,7 @@ Note that sinking data to external tables is disabled by default. To sink data t
10681070
:::note
10691071
10701072
- You can grant and revoke privileges by using [GRANT](../../sql-reference/sql-statements/account-management/GRANT.md) and [REVOKE](../../sql-reference/sql-statements/account-management/REVOKE.md).
1071-
- You can use the session variable [connector_sink_compression_codec](../../sql-reference/System_variable.md#connector_sink_compression_codec) to specify the compression algorithm used for sinking data to Hive tables.
1073+
- You can use the table property [`compression_codec`](../../data_source/catalog/hive_catalog.md#properties) or the system variable [`connector_sink_compression_codec`](../../sql-reference/System_variable.md#connector_sink_compression_codec) to specify the compression algorithm used for sinking data to Hive tables. StarRocks will prioritize using the compression codec specified in the table property.
10721074
10731075
:::
10741076

docs/en/data_source/catalog/iceberg/iceberg_catalog.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1432,7 +1432,7 @@ Description: The file format of the Iceberg table. Only the Parquet format is su
14321432
14331433
###### compression_codec
14341434
1435-
Description: The compression algorithm used for the Iceberg table. The supported compression algorithms are SNAPPY, GZIP, ZSTD, and LZ4. Default value: `gzip`. This property is deprecated in v3.2.3, since which version the compression algorithm used for sinking data to Iceberg tables is uniformly controlled by the session variable [connector_sink_compression_codec](../../../sql-reference/System_variable.md#connector_sink_compression_codec).
1435+
Description: The compression algorithm used for the Iceberg table. The supported compression algorithms are SNAPPY, GZIP, ZSTD, and LZ4. Default value: `zstd`.
14361436
14371437
---
14381438

docs/en/sql-reference/System_variable.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -349,7 +349,10 @@ Used for MySQL client compatibility. No practical usage.
349349

350350
### connector_sink_compression_codec
351351

352-
* **Description**: Specifies the compression algorithm used for writing data into Hive tables or Iceberg tables, or exporting data with Files().
352+
* **Description**: Specifies the compression algorithm used for writing data into Hive tables or Iceberg tables, or exporting data with Files(). This parameter only takes effect in the following situations:
353+
* The `compression_codec` property is not exist in the Hive tables.
354+
* The `compression_codec` and `write.parquet.compression-codec` property are not exist in the Iceberg tables.
355+
* The `compression` property is not set when `INSERT INTO FILES`.
353356
* **Valid values**: `uncompressed`, `snappy`, `lz4`, `zstd`, and `gzip`.
354357
* **Default**: uncompressed
355358
* **Data type**: String

docs/zh/data_source/catalog/hive_catalog.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,8 @@ Hive Catalog 是一种 External Catalog,自 2.3 版本开始支持。通过 Hi
3838
- Parquet 和 ORC 文件支持 NO_COMPRESSION、SNAPPY、LZ4、ZSTD 和 GZIP 压缩格式。
3939
- Textfile 文件支持 NO_COMPRESSION 压缩格式。
4040

41-
您可以通过系统变量 [`connector_sink_compression_codec`](../../sql-reference/System_variable.md#connector_sink_compression_codec) 来设置写入到 Hive 表时的压缩算法。
41+
您可以通过table属性 [`compression_codec`](../../data_source/catalog/hive_catalog.md#properties)或者系统变量 [`connector_sink_compression_codec`](../../sql-reference/System_variable.md#connector_sink_compression_codec)来设置写入到 Hive 表时的压缩算法。
42+
在写入hive table时,如果该表的属性中包含了compression codec,StarRocks优先使用该算法对写入数据进行压缩。否则,使用系统变量 connector_sink_compression_codec 中设置的压缩算法代替。
4243

4344
## 准备工作
4445

@@ -1030,7 +1031,7 @@ PARTITION BY (par_col1[, par_col2...])
10301031
| ----------------- | ------------------------------------------------------------ |
10311032
| location | Managed Table 所在的文件路径。使用 HMS 作为元数据服务时,您无需指定 `location` 参数。使用 AWS Glue 作为元数据服务时:<ul><li>如果在创建当前数据库时指定了 `location` 参数,那么在当前数据库下建表时不需要再指定 `location` 参数,StarRocks 默认把表建在当前数据库所在的文件路径下。</li><li>如果在创建当前数据库时没有指定 `location` 参数,那么在当前数据库建表时必须指定 `location` 参数。</li></ul> |
10321033
| file_format | Managed Table 的文件格式。当前支持 Parquet、ORC、Textfile 文件格式,其中 ORC 和 Textfile 文件格式自 3.3 版本起支持。取值范围:`parquet``orc``textfile`。默认值:`parquet`。 |
1033-
| compression_codec | Managed Table 的压缩格式。该属性自 3.2.3 版本起弃用,此后写入 Hive 表时的压缩算法统一由会话变量 [connector_sink_compression_codec](../../sql-reference/System_variable.md#connector_sink_compression_codec) 控制。 |
1034+
| compression_codec | Managed Table 的压缩格式。 |
10341035

10351036
### 示例
10361037

@@ -1077,7 +1078,7 @@ PARTITION BY (par_col1[, par_col2...])
10771078
:::note
10781079

10791080
- 您可以通过 [GRANT](../../sql-reference/sql-statements/account-management/GRANT.md) 和 [REVOKE](../../sql-reference/sql-statements/account-management/REVOKE.md) 操作对用户和角色进行权限的赋予和收回。
1080-
- 您可以通过会话变量 [connector_sink_compression_codec](../../sql-reference/System_variable.md#connector_sink_compression_codec) 来指定写入 Hive 表时的压缩算法。
1081+
- 您可以通过table属性 [`compression_codec`](../../data_source/catalog/hive_catalog.md/#properties)或者系统变量 [`connector_sink_compression_codec`](../../sql-reference/System_variable.md#connector_sink_compression_codec)来设置写入到 Hive 表时的压缩算法。StarRocks会优先选择table属性中指定的compression codec来使用
10811082

10821083
:::
10831084

docs/zh/data_source/catalog/iceberg/iceberg_catalog.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1470,7 +1470,7 @@ ORDER BY (column_name [sort_direction] [nulls_order], ...)
14701470

14711471
###### compression_codec
14721472

1473-
描述:Iceberg 表使用的压缩算法。支持的压缩算法有 SNAPPY、GZIP、ZSTD 和 LZ4。默认值:`gzip`。此属性在 v3.2.3 中已弃用,从该版本开始,导入数据到 Iceberg 表时使用的压缩算法由会话变量 [connector_sink_compression_codec](../../../sql-reference/System_variable.md#connector_sink_compression_codec) 统一控制
1473+
描述:Iceberg 表使用的压缩算法。支持的压缩算法有 SNAPPY、GZIP、ZSTD 和 LZ4。默认值:`zstd`
14741474

14751475
---
14761476

docs/zh/sql-reference/System_variable.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -299,7 +299,10 @@ ALTER USER 'jack' SET PROPERTIES ('session.query_timeout' = '600');
299299

300300
### connector_sink_compression_codec
301301

302-
* 描述:用于指定写入 Hive 表或 Iceberg 表时以及使用 Files() 导出数据时的压缩算法。有效值:`uncompressed``snappy``lz4``zstd``gzip`
302+
* 描述:用于指定写入 Hive 表或 Iceberg 表时以及使用 Files() 导出数据时的压缩算法。有效值:`uncompressed``snappy``lz4``zstd``gzip`。该参数只在以下情况生效:
303+
* Hive表中未指定compression codec属性。
304+
* Iceberg的表中未包含`compression_codec``write.parquet.compression-codec`属性;
305+
* `INSERT INTO FILES`时未设置 `compression`属性。
303306
* 默认值:uncompressed
304307
* 类型:String
305308
* 引入版本:v3.2.3

0 commit comments

Comments
 (0)