Skip to content

Commit ec533ec

Browse files
authored
Add support for XML file type to various file connectors such as SFTP, FTP, LocalFile, HdfsFile, and more. (apache#6327)
1 parent e1a81ac commit ec533ec

File tree

54 files changed

+1421
-53
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

54 files changed

+1421
-53
lines changed

Diff for: docs/en/connector-v2/sink/CosFile.md

+17-1
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ By default, we use 2PC commit to ensure `exactly-once`
2929
- [x] orc
3030
- [x] json
3131
- [x] excel
32+
- [x] xml
3233

3334
## Options
3435

@@ -57,6 +58,9 @@ By default, we use 2PC commit to ensure `exactly-once`
5758
| common-options | object | no | - | |
5859
| max_rows_in_memory | int | no | - | Only used when file_format is excel. |
5960
| sheet_name | string | no | Sheet${Random number} | Only used when file_format is excel. |
61+
| xml_root_tag | string | no | RECORDS | Only used when file_format is xml. |
62+
| xml_row_tag | string | no | RECORD | Only used when file_format is xml. |
63+
| xml_use_attr_format | boolean | no | - | Only used when file_format is xml. |
6064

6165
### path [string]
6266

@@ -110,7 +114,7 @@ When the format in the `file_name_expression` parameter is `xxxx-${now}` , `file
110114

111115
We supported as the following file types:
112116

113-
`text` `json` `csv` `orc` `parquet` `excel`
117+
`text` `json` `csv` `orc` `parquet` `excel` `xml`
114118

115119
Please note that, The final file name will end with the file_format's suffix, the suffix of the text file is `txt`.
116120

@@ -189,6 +193,18 @@ When File Format is Excel,The maximum number of data items that can be cached in
189193

190194
Writer the sheet of the workbook
191195

196+
### xml_root_tag [string]
197+
198+
Specifies the tag name of the root element within the XML file.
199+
200+
### xml_row_tag [string]
201+
202+
Specifies the tag name of the data rows within the XML file.
203+
204+
### xml_use_attr_format [boolean]
205+
206+
Specifies Whether to process data using the tag attribute format.
207+
192208
## Example
193209

194210
For text file format with `have_partition` and `custom_filename` and `sink_columns`

Diff for: docs/en/connector-v2/sink/FtpFile.md

+17-1
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ By default, we use 2PC commit to ensure `exactly-once`
2727
- [x] orc
2828
- [x] json
2929
- [x] excel
30+
- [x] xml
3031

3132
## Options
3233

@@ -56,6 +57,9 @@ By default, we use 2PC commit to ensure `exactly-once`
5657
| common-options | object | no | - | |
5758
| max_rows_in_memory | int | no | - | Only used when file_format_type is excel. |
5859
| sheet_name | string | no | Sheet${Random number} | Only used when file_format_type is excel. |
60+
| xml_root_tag | string | no | RECORDS | Only used when file_format is xml. |
61+
| xml_row_tag | string | no | RECORD | Only used when file_format is xml. |
62+
| xml_use_attr_format | boolean | no | - | Only used when file_format is xml. |
5963

6064
### host [string]
6165

@@ -115,7 +119,7 @@ When the format in the `file_name_expression` parameter is `xxxx-${now}` , `file
115119

116120
We supported as the following file types:
117121

118-
`text` `json` `csv` `orc` `parquet` `excel`
122+
`text` `json` `csv` `orc` `parquet` `excel` `xml`
119123

120124
Please note that, The final file name will end with the file_format_type's suffix, the suffix of the text file is `txt`.
121125

@@ -194,6 +198,18 @@ When File Format is Excel,The maximum number of data items that can be cached in
194198

195199
Writer the sheet of the workbook
196200

201+
### xml_root_tag [string]
202+
203+
Specifies the tag name of the root element within the XML file.
204+
205+
### xml_row_tag [string]
206+
207+
Specifies the tag name of the data rows within the XML file.
208+
209+
### xml_use_attr_format [boolean]
210+
211+
Specifies Whether to process data using the tag attribute format.
212+
197213
## Example
198214

199215
For text file format simple config

Diff for: docs/en/connector-v2/sink/HdfsFile.md

+5-1
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ By default, we use 2PC commit to ensure `exactly-once`
2121
- [x] orc
2222
- [x] json
2323
- [x] excel
24+
- [x] xml
2425
- [x] compress codec
2526
- [x] lzo
2627

@@ -45,7 +46,7 @@ Output data to hdfs file
4546
| custom_filename | boolean | no | false | Whether you need custom the filename |
4647
| file_name_expression | string | no | "${transactionId}" | Only used when `custom_filename` is `true`.`file_name_expression` describes the file expression which will be created into the `path`. We can add the variable `${now}` or `${uuid}` in the `file_name_expression`, like `test_${uuid}_${now}`,`${now}` represents the current time, and its format can be defined by specifying the option `filename_time_format`.Please note that, If `is_enable_transaction` is `true`, we will auto add `${transactionId}_` in the head of the file. |
4748
| filename_time_format | string | no | "yyyy.MM.dd" | Only used when `custom_filename` is `true`.When the format in the `file_name_expression` parameter is `xxxx-${now}` , `filename_time_format` can specify the time format of the path, and the default value is `yyyy.MM.dd` . The commonly used time formats are listed as follows:[y:Year,M:Month,d:Day of month,H:Hour in day (0-23),m:Minute in hour,s:Second in minute] |
48-
| file_format_type | string | no | "csv" | We supported as the following file types:`text` `json` `csv` `orc` `parquet` `excel`.Please note that, The final file name will end with the file_format's suffix, the suffix of the text file is `txt`. |
49+
| file_format_type | string | no | "csv" | We supported as the following file types:`text` `json` `csv` `orc` `parquet` `excel` `xml`.Please note that, The final file name will end with the file_format's suffix, the suffix of the text file is `txt`. |
4950
| field_delimiter | string | no | '\001' | Only used when file_format is text,The separator between columns in a row of data. Only needed by `text` file format. |
5051
| row_delimiter | string | no | "\n" | Only used when file_format is text,The separator between rows in a file. Only needed by `text` file format. |
5152
| have_partition | boolean | no | false | Whether you need processing partitions. |
@@ -63,6 +64,9 @@ Output data to hdfs file
6364
| common-options | object | no | - | Sink plugin common parameters, please refer to [Sink Common Options](common-options.md) for details |
6465
| max_rows_in_memory | int | no | - | Only used when file_format is excel.When File Format is Excel,The maximum number of data items that can be cached in the memory. |
6566
| sheet_name | string | no | Sheet${Random number} | Only used when file_format is excel.Writer the sheet of the workbook |
67+
| xml_root_tag | string | no | RECORDS | Only used when file_format is xml, specifies the tag name of the root element within the XML file. |
68+
| xml_row_tag | string | no | RECORD | Only used when file_format is xml, specifies the tag name of the data rows within the XML file |
69+
| xml_use_attr_format | boolean | no | - | Only used when file_format is xml, specifies Whether to process data using the tag attribute format. |
6670

6771
### Tips
6872

Diff for: docs/en/connector-v2/sink/LocalFile.md

+17-1
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ By default, we use 2PC commit to ensure `exactly-once`
2727
- [x] orc
2828
- [x] json
2929
- [x] excel
30+
- [x] xml
3031

3132
## Options
3233

@@ -51,6 +52,9 @@ By default, we use 2PC commit to ensure `exactly-once`
5152
| common-options | object | no | - | |
5253
| max_rows_in_memory | int | no | - | Only used when file_format_type is excel. |
5354
| sheet_name | string | no | Sheet${Random number} | Only used when file_format_type is excel. |
55+
| xml_root_tag | string | no | RECORDS | Only used when file_format is xml. |
56+
| xml_row_tag | string | no | RECORD | Only used when file_format is xml. |
57+
| xml_use_attr_format | boolean | no | - | Only used when file_format is xml. |
5458
| enable_header_write | boolean | no | false | Only used when file_format_type is text,csv.<br/> false:don't write header,true:write header. |
5559

5660
### path [string]
@@ -89,7 +93,7 @@ When the format in the `file_name_expression` parameter is `xxxx-${now}` , `file
8993

9094
We supported as the following file types:
9195

92-
`text` `json` `csv` `orc` `parquet` `excel`
96+
`text` `json` `csv` `orc` `parquet` `excel` `xml`
9397

9498
Please note that, The final file name will end with the file_format_type's suffix, the suffix of the text file is `txt`.
9599

@@ -168,6 +172,18 @@ When File Format is Excel,The maximum number of data items that can be cached in
168172

169173
Writer the sheet of the workbook
170174

175+
### xml_root_tag [string]
176+
177+
Specifies the tag name of the root element within the XML file.
178+
179+
### xml_row_tag [string]
180+
181+
Specifies the tag name of the data rows within the XML file.
182+
183+
### xml_use_attr_format [boolean]
184+
185+
Specifies Whether to process data using the tag attribute format.
186+
171187
### enable_header_write [boolean]
172188

173189
Only used when file_format_type is text,csv.false:don't write header,true:write header.

Diff for: docs/en/connector-v2/sink/OssFile.md

+17-1
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ By default, we use 2PC commit to ensure `exactly-once`
3232
- [x] orc
3333
- [x] json
3434
- [x] excel
35+
- [x] xml
3536

3637
## Data Type Mapping
3738

@@ -108,6 +109,9 @@ If write to `csv`, `text` file type, All column will be string.
108109
| common-options | object | no | - | |
109110
| max_rows_in_memory | int | no | - | Only used when file_format_type is excel. |
110111
| sheet_name | string | no | Sheet${Random number} | Only used when file_format_type is excel. |
112+
| xml_root_tag | string | no | RECORDS | Only used when file_format is xml. |
113+
| xml_row_tag | string | no | RECORD | Only used when file_format is xml. |
114+
| xml_use_attr_format | boolean | no | - | Only used when file_format is xml. |
111115

112116
### path [string]
113117

@@ -161,7 +165,7 @@ When the format in the `file_name_expression` parameter is `xxxx-${Now}` , `file
161165

162166
We supported as the following file types:
163167

164-
`text` `json` `csv` `orc` `parquet` `excel`
168+
`text` `json` `csv` `orc` `parquet` `excel` `xml`
165169

166170
Please note that, The final file name will end with the file_format_type's suffix, the suffix of the text file is `txt`.
167171

@@ -240,6 +244,18 @@ When File Format is Excel,The maximum number of data items that can be cached in
240244

241245
Writer the sheet of the workbook
242246

247+
### xml_root_tag [string]
248+
249+
Specifies the tag name of the root element within the XML file.
250+
251+
### xml_row_tag [string]
252+
253+
Specifies the tag name of the data rows within the XML file.
254+
255+
### xml_use_attr_format [boolean]
256+
257+
Specifies Whether to process data using the tag attribute format.
258+
243259
## How to Create an Oss Data Synchronization Jobs
244260

245261
The following example demonstrates how to create a data synchronization job that reads data from Fake Source and writes it to the Oss:

0 commit comments

Comments
 (0)