Skip to content

Commit d159fbe

Browse files
authored
[Feature][Connectors-V2][File]support assign encoding for file source/sink (#6489)
1 parent aded562 commit d159fbe

File tree

50 files changed

+1078
-24
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+1078
-24
lines changed

Diff for: docs/en/connector-v2/sink/CosFile.md

+6
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@ By default, we use 2PC commit to ensure `exactly-once`
6161
| xml_root_tag | string | no | RECORDS | Only used when file_format is xml. |
6262
| xml_row_tag | string | no | RECORD | Only used when file_format is xml. |
6363
| xml_use_attr_format | boolean | no | - | Only used when file_format is xml. |
64+
| encoding | string | no | "UTF-8" | Only used when file_format_type is json,text,csv,xml. |
6465

6566
### path [string]
6667

@@ -205,6 +206,11 @@ Specifies the tag name of the data rows within the XML file.
205206

206207
Specifies Whether to process data using the tag attribute format.
207208

209+
### encoding [string]
210+
211+
Only used when file_format_type is json,text,csv,xml.
212+
The encoding of the file to write. This param will be parsed by `Charset.forName(encoding)`.
213+
208214
## Example
209215

210216
For text file format with `have_partition` and `custom_filename` and `sink_columns`

Diff for: docs/en/connector-v2/sink/FtpFile.md

+6
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ By default, we use 2PC commit to ensure `exactly-once`
6060
| xml_root_tag | string | no | RECORDS | Only used when file_format is xml. |
6161
| xml_row_tag | string | no | RECORD | Only used when file_format is xml. |
6262
| xml_use_attr_format | boolean | no | - | Only used when file_format is xml. |
63+
| encoding | string | no | "UTF-8" | Only used when file_format_type is json,text,csv,xml. |
6364

6465
### host [string]
6566

@@ -210,6 +211,11 @@ Specifies the tag name of the data rows within the XML file.
210211

211212
Specifies Whether to process data using the tag attribute format.
212213

214+
### encoding [string]
215+
216+
Only used when file_format_type is json,text,csv,xml.
217+
The encoding of the file to write. This param will be parsed by `Charset.forName(encoding)`.
218+
213219
## Example
214220

215221
For text file format simple config

Diff for: docs/en/connector-v2/sink/HdfsFile.md

+1
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,7 @@ Output data to hdfs file
6767
| xml_root_tag | string | no | RECORDS | Only used when file_format is xml, specifies the tag name of the root element within the XML file. |
6868
| xml_row_tag | string | no | RECORD | Only used when file_format is xml, specifies the tag name of the data rows within the XML file |
6969
| xml_use_attr_format | boolean | no | - | Only used when file_format is xml, specifies Whether to process data using the tag attribute format. |
70+
| encoding | string | no | "UTF-8" | Only used when file_format_type is json,text,csv,xml. |
7071

7172
### Tips
7273

Diff for: docs/en/connector-v2/sink/LocalFile.md

+18
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ By default, we use 2PC commit to ensure `exactly-once`
5656
| xml_row_tag | string | no | RECORD | Only used when file_format is xml. |
5757
| xml_use_attr_format | boolean | no | - | Only used when file_format is xml. |
5858
| enable_header_write | boolean | no | false | Only used when file_format_type is text,csv.<br/> false:don't write header,true:write header. |
59+
| encoding | string | no | "UTF-8" | Only used when file_format_type is json,text,csv,xml. |
5960

6061
### path [string]
6162

@@ -188,6 +189,11 @@ Specifies Whether to process data using the tag attribute format.
188189

189190
Only used when file_format_type is text,csv.false:don't write header,true:write header.
190191

192+
### encoding [string]
193+
194+
Only used when file_format_type is json,text,csv,xml.
195+
The encoding of the file to write. This param will be parsed by `Charset.forName(encoding)`.
196+
191197
## Example
192198

193199
For orc file format simple config
@@ -201,6 +207,18 @@ LocalFile {
201207

202208
```
203209

210+
For json, text, csv or xml file format with `encoding`
211+
212+
```hocon
213+
214+
LocalFile {
215+
path = "/tmp/hive/warehouse/test2"
216+
file_format_type = "text"
217+
encoding = "gbk"
218+
}
219+
220+
```
221+
204222
For parquet file format with `sink_columns`
205223

206224
```bash

Diff for: docs/en/connector-v2/sink/OssFile.md

+6
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,7 @@ If write to `csv`, `text` file type, All column will be string.
112112
| xml_root_tag | string | no | RECORDS | Only used when file_format is xml. |
113113
| xml_row_tag | string | no | RECORD | Only used when file_format is xml. |
114114
| xml_use_attr_format | boolean | no | - | Only used when file_format is xml. |
115+
| encoding | string | no | "UTF-8" | Only used when file_format_type is json,text,csv,xml. |
115116

116117
### path [string]
117118

@@ -256,6 +257,11 @@ Specifies the tag name of the data rows within the XML file.
256257

257258
Specifies Whether to process data using the tag attribute format.
258259

260+
### encoding [string]
261+
262+
Only used when file_format_type is json,text,csv,xml.
263+
The encoding of the file to write. This param will be parsed by `Charset.forName(encoding)`.
264+
259265
## How to Create an Oss Data Synchronization Jobs
260266

261267
The following example demonstrates how to create a data synchronization job that reads data from Fake Source and writes it to the Oss:

Diff for: docs/en/connector-v2/sink/OssJindoFile.md

+6
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@ By default, we use 2PC commit to ensure `exactly-once`
6565
| xml_root_tag | string | no | RECORDS | Only used when file_format is xml. |
6666
| xml_row_tag | string | no | RECORD | Only used when file_format is xml. |
6767
| xml_use_attr_format | boolean | no | - | Only used when file_format is xml. |
68+
| encoding | string | no | "UTF-8" | Only used when file_format_type is json,text,csv,xml. |
6869

6970
### path [string]
7071

@@ -209,6 +210,11 @@ Specifies the tag name of the data rows within the XML file.
209210

210211
Specifies Whether to process data using the tag attribute format.
211212

213+
### encoding [string]
214+
215+
Only used when file_format_type is json,text,csv,xml.
216+
The encoding of the file to write. This param will be parsed by `Charset.forName(encoding)`.
217+
212218
## Example
213219

214220
For text file format with `have_partition` and `custom_filename` and `sink_columns`

Diff for: docs/en/connector-v2/sink/S3File.md

+6
Original file line numberDiff line numberDiff line change
@@ -123,6 +123,7 @@ If write to `csv`, `text` file type, All column will be string.
123123
| hadoop_s3_properties | map | no | | If you need to add a other option, you could add it here and refer to this [link](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html) |
124124
| schema_save_mode | Enum | no | CREATE_SCHEMA_WHEN_NOT_EXIST | Before turning on the synchronous task, do different treatment of the target path |
125125
| data_save_mode | Enum | no | APPEND_DATA | Before opening the synchronous task, the data file in the target path is differently processed |
126+
| encoding | string | no | "UTF-8" | Only used when file_format_type is json,text,csv,xml. |
126127

127128
### path [string]
128129

@@ -278,6 +279,11 @@ Option introduction:
278279
`APPEND_DATA`:use the path, and add new files in the path for write data.
279280
`ERROR_WHEN_DATA_EXISTS`:When there are some data files in the path, an error will is reported.
280281

282+
### encoding [string]
283+
284+
Only used when file_format_type is json,text,csv,xml.
285+
The encoding of the file to write. This param will be parsed by `Charset.forName(encoding)`.
286+
281287
## Example
282288

283289
### Simple:

Diff for: docs/en/connector-v2/sink/SftpFile.md

+6
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ By default, we use 2PC commit to ensure `exactly-once`
5959
| xml_root_tag | string | no | RECORDS | Only used when file_format is xml. |
6060
| xml_row_tag | string | no | RECORD | Only used when file_format is xml. |
6161
| xml_use_attr_format | boolean | no | - | Only used when file_format is xml. |
62+
| encoding | string | no | "UTF-8" | Only used when file_format_type is json,text,csv,xml. |
6263

6364
### host [string]
6465

@@ -203,6 +204,11 @@ Specifies the tag name of the data rows within the XML file.
203204

204205
Specifies Whether to process data using the tag attribute format.
205206

207+
### encoding [string]
208+
209+
Only used when file_format_type is json,text,csv,xml.
210+
The encoding of the file to write. This param will be parsed by `Charset.forName(encoding)`.
211+
206212
## Example
207213

208214
For text file format with `have_partition` and `custom_filename` and `sink_columns`

Diff for: docs/en/connector-v2/source/CosFile.md

+6
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@ To use this connector you need put hadoop-cos-{hadoop.version}-{version}.jar and
6565
| xml_use_attr_format | boolean | no | - |
6666
| file_filter_pattern | string | no | - |
6767
| compress_codec | string | no | none |
68+
| encoding | string | no | UTF-8 |
6869
| common-options | | no | - |
6970

7071
### path [string]
@@ -277,6 +278,11 @@ The compress codec of files and the details that supported as the following show
277278
- orc/parquet:
278279
automatically recognizes the compression type, no additional settings required.
279280

281+
### encoding [string]
282+
283+
Only used when file_format_type is json,text,csv,xml.
284+
The encoding of the file to read. This param will be parsed by `Charset.forName(encoding)`.
285+
280286
### common options
281287

282288
Source plugin common parameters, please refer to [Source Common Options](common-options.md) for details.

Diff for: docs/en/connector-v2/source/FtpFile.md

+6
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ If you use SeaTunnel Engine, It automatically integrated the hadoop jar when you
5959
| xml_use_attr_format | boolean | no | - |
6060
| file_filter_pattern | string | no | - |
6161
| compress_codec | string | no | none |
62+
| encoding | string | no | UTF-8 |
6263
| common-options | | no | - |
6364

6465
### host [string]
@@ -258,6 +259,11 @@ The compress codec of files and the details that supported as the following show
258259
- orc/parquet:
259260
automatically recognizes the compression type, no additional settings required.
260261

262+
### encoding [string]
263+
264+
Only used when file_format_type is json,text,csv,xml.
265+
The encoding of the file to read. This param will be parsed by `Charset.forName(encoding)`.
266+
261267
### common options
262268

263269
Source plugin common parameters, please refer to [Source Common Options](common-options.md) for details.

Diff for: docs/en/connector-v2/source/HdfsFile.md

+6
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,7 @@ Read data from hdfs file system.
6262
| xml_row_tag | string | no | - | Specifies the tag name of the data rows within the XML file, only used when file_format is xml. |
6363
| xml_use_attr_format | boolean | no | - | Specifies whether to process data using the tag attribute format, only used when file_format is xml. |
6464
| compress_codec | string | no | none | The compress codec of files |
65+
| encoding | string | no | UTF-8 |
6566
| common-options | | no | - | Source plugin common parameters, please refer to [Source Common Options](common-options.md) for details. |
6667

6768
### delimiter/field_delimiter [string]
@@ -78,6 +79,11 @@ The compress codec of files and the details that supported as the following show
7879
- orc/parquet:
7980
automatically recognizes the compression type, no additional settings required.
8081

82+
### encoding [string]
83+
84+
Only used when file_format_type is json,text,csv,xml.
85+
The encoding of the file to read. This param will be parsed by `Charset.forName(encoding)`.
86+
8187
### Tips
8288

8389
> If you use spark/flink, In order to use this connector, You must ensure your spark/flink cluster already integrated hadoop. The tested hadoop version is 2.x. If you use SeaTunnel Engine, It automatically integrated the hadoop jar when you download and install SeaTunnel Engine. You can check the jar package under ${SEATUNNEL_HOME}/lib to confirm this.

Diff for: docs/en/connector-v2/source/LocalFile.md

+18
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ If you use SeaTunnel Engine, It automatically integrated the hadoop jar when you
5959
| xml_use_attr_format | boolean | no | - |
6060
| file_filter_pattern | string | no | - |
6161
| compress_codec | string | no | none |
62+
| encoding | string | no | UTF-8 |
6263
| common-options | | no | - |
6364
| tables_configs | list | no | used to define a multiple table task |
6465

@@ -256,6 +257,11 @@ The compress codec of files and the details that supported as the following show
256257
- orc/parquet:
257258
automatically recognizes the compression type, no additional settings required.
258259

260+
### encoding [string]
261+
262+
Only used when file_format_type is json,text,csv,xml.
263+
The encoding of the file to read. This param will be parsed by `Charset.forName(encoding)`.
264+
259265
### common options
260266

261267
Source plugin common parameters, please refer to [Source Common Options](common-options.md) for details
@@ -292,6 +298,18 @@ LocalFile {
292298
293299
```
294300

301+
For json, text or csv file format with `encoding`
302+
303+
```hocon
304+
305+
LocalFile {
306+
path = "/tmp/hive/warehouse/test2"
307+
file_format_type = "text"
308+
encoding = "gbk"
309+
}
310+
311+
```
312+
295313
### Multiple Table
296314

297315
```hocon

0 commit comments

Comments
 (0)