Skip to content

Commit 74db1cb

Browse files
authored
[Feature][File] Support extract CSV files with different columns in different order (#9064)
1 parent 53325aa commit 74db1cb

File tree

17 files changed

+228
-52
lines changed

17 files changed

+228
-52
lines changed

Diff for: docs/en/connector-v2/source/CosFile.md

+5
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@ To use this connector you need put hadoop-cos-{hadoop.version}-{version}.jar and
6666
| sheet_name | string | no | - |
6767
| xml_row_tag | string | no | - |
6868
| xml_use_attr_format | boolean | no | - |
69+
| csv_use_header_line | boolean | no | false |
6970
| file_filter_pattern | string | no | - |
7071
| filename_extension | string | no | - |
7172
| compress_codec | string | no | none |
@@ -274,6 +275,10 @@ Only need to be configured when file_format is xml.
274275

275276
Specifies Whether to process data using the tag attribute format.
276277

278+
### csv_use_header_line [boolean]
279+
280+
Whether to use the header line to parse the file, only used when the file_format is `csv` and the file contains the header line that match RFC 4180
281+
277282
### file_filter_pattern [string]
278283

279284
Filter pattern, which used for filtering files.

Diff for: docs/en/connector-v2/source/FtpFile.md

+5
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ If you use SeaTunnel Engine, It automatically integrated the hadoop jar when you
6060
| sheet_name | string | no | - |
6161
| xml_row_tag | string | no | - |
6262
| xml_use_attr_format | boolean | no | - |
63+
| csv_use_header_line | boolean | no | - |
6364
| file_filter_pattern | string | no | - |
6465
| filename_extension | string | no | - |
6566
| compress_codec | string | no | none |
@@ -317,6 +318,10 @@ Only need to be configured when file_format is xml.
317318

318319
Specifies Whether to process data using the tag attribute format.
319320

321+
### csv_use_header_line [boolean]
322+
323+
Whether to use the header line to parse the file, only used when the file_format is `csv` and the file contains the header line that match RFC 4180
324+
320325
### compress_codec [string]
321326

322327
The compress codec of files and the details that supported as the following shown:

Diff for: docs/en/connector-v2/source/HdfsFile.md

+1
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,7 @@ Read data from hdfs file system.
6464
| sheet_name | string | no | - | Reader the sheet of the workbook,Only used when file_format is excel. |
6565
| xml_row_tag | string | no | - | Specifies the tag name of the data rows within the XML file, only used when file_format is xml. |
6666
| xml_use_attr_format | boolean | no | - | Specifies whether to process data using the tag attribute format, only used when file_format is xml. |
67+
| csv_use_header_line | boolean | no | false | Whether to use the header line to parse the file, only used when the file_format is `csv` and the file contains the header line that match RFC 4180 |
6768
| file_filter_pattern | string | no | | Filter pattern, which used for filtering files. |
6869
| filename_extension | string | no | - | Filter filename extension, which used for filtering files with specific extension. Example: `csv` `.txt` `json` `.xml`. |
6970
| compress_codec | string | no | none | The compress codec of files |

Diff for: docs/en/connector-v2/source/LocalFile.md

+5
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@ If you use SeaTunnel Engine, It automatically integrated the hadoop jar when you
6161
| excel_engine | string | no | POI |
6262
| xml_row_tag | string | no | - |
6363
| xml_use_attr_format | boolean | no | - |
64+
| csv_use_header_line | boolean | no | false |
6465
| file_filter_pattern | string | no | - |
6566
| filename_extension | string | no | - |
6667
| compress_codec | string | no | none |
@@ -265,6 +266,10 @@ Only need to be configured when file_format is xml.
265266

266267
Specifies Whether to process data using the tag attribute format.
267268

269+
### csv_use_header_line [boolean]
270+
271+
Whether to use the header line to parse the file, only used when the file_format is `csv` and the file contains the header line that match RFC 4180
272+
268273
### file_filter_pattern [string]
269274

270275
Filter pattern, which used for filtering files.

Diff for: docs/en/connector-v2/source/OssFile.md

+2
Original file line numberDiff line numberDiff line change
@@ -194,10 +194,12 @@ If you assign file type to `parquet` `orc`, schema option not required, connecto
194194
| time_format | string | no | HH:mm:ss | Time type format, used to tell connector how to convert string to time, supported as the following formats:`HH:mm:ss` `HH:mm:ss.SSS` |
195195
| filename_extension | string | no | - | Filter filename extension, which used for filtering files with specific extension. Example: `csv` `.txt` `json` `.xml`. |
196196
| skip_header_row_number | long | no | 0 | Skip the first few lines, but only for the txt and csv. For example, set like following:`skip_header_row_number = 2`. Then SeaTunnel will skip the first 2 lines from source files |
197+
| csv_use_header_line | boolean | no | false | Whether to use the header line to parse the file, only used when the file_format is `csv` and the file contains the header line that match RFC 4180 |
197198
| schema | config | no | - | The schema of upstream data. |
198199
| sheet_name | string | no | - | Reader the sheet of the workbook,Only used when file_format is excel. |
199200
| xml_row_tag | string | no | - | Specifies the tag name of the data rows within the XML file, only used when file_format is xml. |
200201
| xml_use_attr_format | boolean | no | - | Specifies whether to process data using the tag attribute format, only used when file_format is xml. |
202+
| csv_use_header_line | boolean | no | false | Whether to use the header line to parse the file, only used when the file_format is `csv` and the file contains the header line that match RFC 4180 |
201203
| compress_codec | string | no | none | Which compress codec the files used. |
202204
| encoding | string | no | UTF-8 |
203205
| null_format | string | no | - | Only used when file_format_type is text. null_format to define which strings can be represented as null. e.g: `\N` |

Diff for: docs/en/connector-v2/source/OssJindoFile.md

+1
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@ It only supports hadoop version **2.9.X+**.
7070
| sheet_name | string | no | - |
7171
| xml_row_tag | string | no | - |
7272
| xml_use_attr_format | boolean | no | - |
73+
| csv_use_header_line | boolean | no | false |
7374
| file_filter_pattern | string | no | |
7475
| compress_codec | string | no | none |
7576
| archive_compress_codec | string | no | none |

Diff for: docs/en/connector-v2/source/S3File.md

+2
Original file line numberDiff line numberDiff line change
@@ -201,10 +201,12 @@ If you assign file type to `parquet` `orc`, schema option not required, connecto
201201
| datetime_format | string | no | yyyy-MM-dd HH:mm:ss | Datetime type format, used to tell connector how to convert string to datetime, supported as the following formats:`yyyy-MM-dd HH:mm:ss` `yyyy.MM.dd HH:mm:ss` `yyyy/MM/dd HH:mm:ss` `yyyyMMddHHmmss` |
202202
| time_format | string | no | HH:mm:ss | Time type format, used to tell connector how to convert string to time, supported as the following formats:`HH:mm:ss` `HH:mm:ss.SSS` |
203203
| skip_header_row_number | long | no | 0 | Skip the first few lines, but only for the txt and csv. For example, set like following:`skip_header_row_number = 2`. Then SeaTunnel will skip the first 2 lines from source files |
204+
| csv_use_header_line | boolean | no | false | Whether to use the header line to parse the file, only used when the file_format is `csv` and the file contains the header line that match RFC 4180 |
204205
| schema | config | no | - | The schema of upstream data. |
205206
| sheet_name | string | no | - | Reader the sheet of the workbook,Only used when file_format is excel. |
206207
| xml_row_tag | string | no | - | Specifies the tag name of the data rows within the XML file, only valid for XML files. |
207208
| xml_use_attr_format | boolean | no | - | Specifies whether to process data using the tag attribute format, only valid for XML files. |
209+
| csv_use_header_line | boolean | no | false | Whether to use the header line to parse the file, only used when the file_format is `csv` and the file contains the header line that match RFC 4180 |
208210
| compress_codec | string | no | none | |
209211
| archive_compress_codec | string | no | none | |
210212
| encoding | string | no | UTF-8 | |

Diff for: docs/en/connector-v2/source/SftpFile.md

+1
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,7 @@ The File does not have a specific type list, and we can indicate which SeaTunnel
9393
| sheet_name | String | No | - | Reader the sheet of the workbook,Only used when file_format is excel. |
9494
| xml_row_tag | string | no | - | Specifies the tag name of the data rows within the XML file, only used when file_format is xml. |
9595
| xml_use_attr_format | boolean | no | - | Specifies whether to process data using the tag attribute format, only used when file_format is xml. |
96+
| csv_use_header_line | boolean | no | false | Whether to use the header line to parse the file, only used when the file_format is `csv` and the file contains the header line that match RFC 4180 |
9697
| schema | Config | No | - | Please check #schema below |
9798
| compress_codec | String | No | None | The compress codec of files and the details that supported as the following shown: <br/> - txt: `lzo` `None` <br/> - json: `lzo` `None` <br/> - csv: `lzo` `None` <br/> - orc: `lzo` `snappy` `lz4` `zlib` `None` <br/> - parquet: `lzo` `snappy` `lz4` `gzip` `brotli` `zstd` `None` <br/> Tips: excel type does Not support any compression format |
9899
| archive_compress_codec | string | no | none |

0 commit comments

Comments
 (0)