Skip to content

Commit c140178

Browse files
authored
[Feature][Connector-V2] Supports the transfer of any file (#6826)
1 parent 1da9bd6 commit c140178

File tree

30 files changed

+520
-22
lines changed

30 files changed

+520
-22
lines changed

Diff for: docs/en/connector-v2/sink/CosFile.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ By default, we use 2PC commit to ensure `exactly-once`
3030
- [x] json
3131
- [x] excel
3232
- [x] xml
33+
- [x] binary
3334

3435
## Options
3536

@@ -115,7 +116,7 @@ When the format in the `file_name_expression` parameter is `xxxx-${now}` , `file
115116

116117
We supported as the following file types:
117118

118-
`text` `json` `csv` `orc` `parquet` `excel` `xml`
119+
`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`
119120

120121
Please note that, The final file name will end with the file_format's suffix, the suffix of the text file is `txt`.
121122

Diff for: docs/en/connector-v2/sink/FtpFile.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ By default, we use 2PC commit to ensure `exactly-once`
2828
- [x] json
2929
- [x] excel
3030
- [x] xml
31+
- [x] binary
3132

3233
## Options
3334

@@ -120,7 +121,7 @@ When the format in the `file_name_expression` parameter is `xxxx-${now}` , `file
120121

121122
We supported as the following file types:
122123

123-
`text` `json` `csv` `orc` `parquet` `excel` `xml`
124+
`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`
124125

125126
Please note that, The final file name will end with the file_format_type's suffix, the suffix of the text file is `txt`.
126127

Diff for: docs/en/connector-v2/sink/HdfsFile.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ By default, we use 2PC commit to ensure `exactly-once`
2222
- [x] json
2323
- [x] excel
2424
- [x] xml
25+
- [x] binary
2526
- [x] compress codec
2627
- [x] lzo
2728

@@ -46,7 +47,7 @@ Output data to hdfs file
4647
| custom_filename | boolean | no | false | Whether you need custom the filename |
4748
| file_name_expression | string | no | "${transactionId}" | Only used when `custom_filename` is `true`.`file_name_expression` describes the file expression which will be created into the `path`. We can add the variable `${now}` or `${uuid}` in the `file_name_expression`, like `test_${uuid}_${now}`,`${now}` represents the current time, and its format can be defined by specifying the option `filename_time_format`.Please note that, If `is_enable_transaction` is `true`, we will auto add `${transactionId}_` in the head of the file. |
4849
| filename_time_format | string | no | "yyyy.MM.dd" | Only used when `custom_filename` is `true`.When the format in the `file_name_expression` parameter is `xxxx-${now}` , `filename_time_format` can specify the time format of the path, and the default value is `yyyy.MM.dd` . The commonly used time formats are listed as follows:[y:Year,M:Month,d:Day of month,H:Hour in day (0-23),m:Minute in hour,s:Second in minute] |
49-
| file_format_type | string | no | "csv" | We supported as the following file types:`text` `json` `csv` `orc` `parquet` `excel` `xml`.Please note that, The final file name will end with the file_format's suffix, the suffix of the text file is `txt`. |
50+
| file_format_type | string | no | "csv" | We supported as the following file types:`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`.Please note that, The final file name will end with the file_format's suffix, the suffix of the text file is `txt`. |
5051
| field_delimiter | string | no | '\001' | Only used when file_format is text,The separator between columns in a row of data. Only needed by `text` file format. |
5152
| row_delimiter | string | no | "\n" | Only used when file_format is text,The separator between rows in a file. Only needed by `text` file format. |
5253
| have_partition | boolean | no | false | Whether you need processing partitions. |

Diff for: docs/en/connector-v2/sink/LocalFile.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ By default, we use 2PC commit to ensure `exactly-once`
2828
- [x] json
2929
- [x] excel
3030
- [x] xml
31+
- [x] binary
3132

3233
## Options
3334

@@ -94,7 +95,7 @@ When the format in the `file_name_expression` parameter is `xxxx-${now}` , `file
9495

9596
We supported as the following file types:
9697

97-
`text` `json` `csv` `orc` `parquet` `excel` `xml`
98+
`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`
9899

99100
Please note that, The final file name will end with the file_format_type's suffix, the suffix of the text file is `txt`.
100101

Diff for: docs/en/connector-v2/sink/OssFile.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ By default, we use 2PC commit to ensure `exactly-once`
3333
- [x] json
3434
- [x] excel
3535
- [x] xml
36+
- [x] binary
3637

3738
## Data Type Mapping
3839

@@ -166,7 +167,7 @@ When the format in the `file_name_expression` parameter is `xxxx-${Now}` , `file
166167

167168
We supported as the following file types:
168169

169-
`text` `json` `csv` `orc` `parquet` `excel` `xml`
170+
`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`
170171

171172
Please note that, The final file name will end with the file_format_type's suffix, the suffix of the text file is `txt`.
172173

Diff for: docs/en/connector-v2/sink/OssJindoFile.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ By default, we use 2PC commit to ensure `exactly-once`
3434
- [x] json
3535
- [x] excel
3636
- [x] xml
37+
- [x] binary
3738

3839
## Options
3940

@@ -119,7 +120,7 @@ When the format in the `file_name_expression` parameter is `xxxx-${now}` , `file
119120

120121
We supported as the following file types:
121122

122-
`text` `json` `csv` `orc` `parquet` `excel` `xml`
123+
`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`
123124

124125
Please note that, The final file name will end with the file_format_type's suffix, the suffix of the text file is `txt`.
125126

Diff for: docs/en/connector-v2/sink/S3File.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ By default, we use 2PC commit to ensure `exactly-once`
2323
- [x] json
2424
- [x] excel
2525
- [x] xml
26+
- [x] binary
2627

2728
## Description
2829

@@ -172,7 +173,7 @@ When the format in the `file_name_expression` parameter is `xxxx-${now}` , `file
172173

173174
We supported as the following file types:
174175

175-
`text` `json` `csv` `orc` `parquet` `excel` `xml`
176+
`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`
176177

177178
Please note that, The final file name will end with the file_format_type's suffix, the suffix of the text file is `txt`.
178179

Diff for: docs/en/connector-v2/sink/SftpFile.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ By default, we use 2PC commit to ensure `exactly-once`
2828
- [x] json
2929
- [x] excel
3030
- [x] xml
31+
- [x] binary
3132

3233
## Options
3334

@@ -113,7 +114,7 @@ When the format in the `file_name_expression` parameter is `xxxx-${now}` , `file
113114

114115
We supported as the following file types:
115116

116-
`text` `json` `csv` `orc` `parquet` `excel` `xml`
117+
`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`
117118

118119
Please note that, The final file name will end with the file_format_type's suffix, the suffix of the text file is `txt`.
119120

Diff for: docs/en/connector-v2/source/CosFile.md

+40-1
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ Read all the data in a split in a pollNext call. What splits are read will be sa
2727
- [x] json
2828
- [x] excel
2929
- [x] xml
30+
- [x] binary
3031

3132
## Description
3233

@@ -76,7 +77,7 @@ The source file path.
7677

7778
File type, supported as the following file types:
7879

79-
`text` `csv` `parquet` `orc` `json` `excel` `xml`
80+
`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`
8081

8182
If you assign file type to `json`, you should also assign schema option to tell connector how to parse data to the row you want.
8283

@@ -160,6 +161,11 @@ connector will generate data as the following:
160161
|---------------|-----|--------|
161162
| tyrantlucifer | 26 | male |
162163

164+
If you assign file type to `binary`, SeaTunnel can synchronize files in any format,
165+
such as compressed packages, pictures, etc. In short, any files can be synchronized to the target place.
166+
Under this requirement, you need to ensure that the source and sink use `binary` format for file synchronization
167+
at the same time. You can find the specific usage in the example below.
168+
163169
### bucket [string]
164170

165171
The bucket address of Cos file system, for example: `Cos://tyrantlucifer-image-bed`
@@ -321,6 +327,39 @@ Source plugin common parameters, please refer to [Source Common Options](common-
321327
322328
```
323329

330+
### Transfer Binary File
331+
332+
```hocon
333+
334+
env {
335+
parallelism = 1
336+
job.mode = "BATCH"
337+
}
338+
339+
source {
340+
CosFile {
341+
bucket = "cosn://seatunnel-test-1259587829"
342+
secret_id = "xxxxxxxxxxxxxxxxxxx"
343+
secret_key = "xxxxxxxxxxxxxxxxxxx"
344+
region = "ap-chengdu"
345+
path = "/seatunnel/read/binary/"
346+
file_format_type = "binary"
347+
}
348+
}
349+
sink {
350+
// you can transfer local file to s3/hdfs/oss etc.
351+
CosFile {
352+
bucket = "cosn://seatunnel-test-1259587829"
353+
secret_id = "xxxxxxxxxxxxxxxxxxx"
354+
secret_key = "xxxxxxxxxxxxxxxxxxx"
355+
region = "ap-chengdu"
356+
path = "/seatunnel/read/binary2/"
357+
file_format_type = "binary"
358+
}
359+
}
360+
361+
```
362+
324363
## Changelog
325364

326365
### next version

Diff for: docs/en/connector-v2/source/FtpFile.md

+40-1
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@
2222
- [x] json
2323
- [x] excel
2424
- [x] xml
25+
- [x] binary
2526

2627
## Description
2728

@@ -86,7 +87,7 @@ The source file path.
8687

8788
File type, supported as the following file types:
8889

89-
`text` `csv` `parquet` `orc` `json` `excel` `xml`
90+
`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`
9091

9192
If you assign file type to `json` , you should also assign schema option to tell connector how to parse data to the row you want.
9293

@@ -159,6 +160,11 @@ connector will generate data as the following:
159160
|---------------|-----|--------|
160161
| tyrantlucifer | 26 | male |
161162

163+
If you assign file type to `binary`, SeaTunnel can synchronize files in any format,
164+
such as compressed packages, pictures, etc. In short, any files can be synchronized to the target place.
165+
Under this requirement, you need to ensure that the source and sink use `binary` format for file synchronization
166+
at the same time. You can find the specific usage in the example below.
167+
162168
### connection_mode [string]
163169

164170
The target ftp connection mode , default is active mode, supported as the following modes:
@@ -288,6 +294,39 @@ Source plugin common parameters, please refer to [Source Common Options](common-
288294
289295
```
290296

297+
### Transfer Binary File
298+
299+
```hocon
300+
301+
env {
302+
parallelism = 1
303+
job.mode = "BATCH"
304+
}
305+
306+
source {
307+
FtpFile {
308+
host = "192.168.31.48"
309+
port = 21
310+
user = tyrantlucifer
311+
password = tianchao
312+
path = "/seatunnel/read/binary/"
313+
file_format_type = "binary"
314+
}
315+
}
316+
sink {
317+
// you can transfer local file to s3/hdfs/oss etc.
318+
FtpFile {
319+
host = "192.168.31.48"
320+
port = 21
321+
user = tyrantlucifer
322+
password = tianchao
323+
path = "/seatunnel/read/binary2/"
324+
file_format_type = "binary"
325+
}
326+
}
327+
328+
```
329+
291330
## Changelog
292331

293332
### 2.2.0-beta 2022-09-26

Diff for: docs/en/connector-v2/source/HdfsFile.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ Read all the data in a split in a pollNext call. What splits are read will be sa
2727
- [x] json
2828
- [x] excel
2929
- [x] xml
30+
- [x] binary
3031

3132
## Description
3233

@@ -43,7 +44,7 @@ Read data from hdfs file system.
4344
| Name | Type | Required | Default | Description |
4445
|---------------------------|---------|----------|---------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
4546
| path | string | yes | - | The source file path. |
46-
| file_format_type | string | yes | - | We supported as the following file types:`text` `json` `csv` `orc` `parquet` `excel` `xml`.Please note that, The final file name will end with the file_format's suffix, the suffix of the text file is `txt`. |
47+
| file_format_type | string | yes | - | We supported as the following file types:`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`.Please note that, The final file name will end with the file_format's suffix, the suffix of the text file is `txt`. |
4748
| fs.defaultFS | string | yes | - | The hadoop cluster address that start with `hdfs://`, for example: `hdfs://hadoopcluster` |
4849
| read_columns | list | yes | - | The read column list of the data source, user can use it to implement field projection.The file type supported column projection as the following shown:[text,json,csv,orc,parquet,excel,xml].Tips: If the user wants to use this feature when reading `text` `json` `csv` files, the schema option must be configured. |
4950
| hdfs_site_path | string | no | - | The path of `hdfs-site.xml`, used to load ha configuration of namenodes |

Diff for: docs/en/connector-v2/source/LocalFile.md

+32-1
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ Read all the data in a split in a pollNext call. What splits are read will be sa
2727
- [x] json
2828
- [x] excel
2929
- [x] xml
30+
- [x] binary
3031

3132
## Description
3233

@@ -71,7 +72,7 @@ The source file path.
7172

7273
File type, supported as the following file types:
7374

74-
`text` `csv` `parquet` `orc` `json` `excel` `xml`
75+
`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`
7576

7677
If you assign file type to `json`, you should also assign schema option to tell connector how to parse data to the row you want.
7778

@@ -155,6 +156,11 @@ connector will generate data as the following:
155156
|---------------|-----|--------|
156157
| tyrantlucifer | 26 | male |
157158

159+
If you assign file type to `binary`, SeaTunnel can synchronize files in any format,
160+
such as compressed packages, pictures, etc. In short, any files can be synchronized to the target place.
161+
Under this requirement, you need to ensure that the source and sink use `binary` format for file synchronization
162+
at the same time. You can find the specific usage in the example below.
163+
158164
### read_columns [list]
159165

160166
The read column list of the data source, user can use it to implement field projection.
@@ -363,6 +369,31 @@ LocalFile {
363369
364370
```
365371

372+
### Transfer Binary File
373+
374+
```hocon
375+
376+
env {
377+
parallelism = 1
378+
job.mode = "BATCH"
379+
}
380+
381+
source {
382+
LocalFile {
383+
path = "/seatunnel/read/binary/"
384+
file_format_type = "binary"
385+
}
386+
}
387+
sink {
388+
// you can transfer local file to s3/hdfs/oss etc.
389+
LocalFile {
390+
path = "/seatunnel/read/binary2/"
391+
file_format_type = "binary"
392+
}
393+
}
394+
395+
```
396+
366397
## Changelog
367398

368399
### 2.2.0-beta 2022-09-26

0 commit comments

Comments
 (0)