Skip to content

Commit a8a514c

Browse files
authored
Merge branch 'apache:dev' into dev-doris-redirect
2 parents e9fd761 + 93bba1a commit a8a514c

57 files changed

Lines changed: 3924 additions & 56 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

docs/en/connectors/sink/Iceberg.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,23 @@ libfb303-xxx.jar
8080
| data_save_mode | Enum | no | APPEND_DATA | the data save mode, please refer to `data_save_mode` below |
8181
| custom_sql | string | no | - | Custom `delete` data sql for data save mode. e.g: `delete from ... where ...` |
8282
| iceberg.table.commit-branch | string | no | - | Default branch for commits |
83+
| krb5_path | string | no | /etc/krb5.conf | The path of `krb5.conf`, used for Kerberos authentication. |
84+
| kerberos_principal | string | no | - | The principal for Kerberos authentication. |
85+
| kerberos_keytab_path | string | no | - | The keytab file path for Kerberos authentication. |
86+
87+
## Sink Option descriptions
88+
89+
### krb5_path [string]
90+
91+
The path of `krb5.conf`, used for Kerberos authentication.
92+
93+
### kerberos_principal [string]
94+
95+
The principal for Kerberos authentication.
96+
97+
### kerberos_keytab_path [string]
98+
99+
The keytab file path for Kerberos authentication.
83100

84101
## Task Example
85102

@@ -234,6 +251,42 @@ sink {
234251
}
235252
```
236253

254+
### Kerberos Authentication
255+
256+
The following example demonstrates how to configure Iceberg sink with Kerberos authentication when using Hadoop catalog with HDFS:
257+
258+
```hocon
259+
sink {
260+
Iceberg {
261+
catalog_name = "seatunnel_test"
262+
iceberg.catalog.config = {
263+
type = "hadoop"
264+
warehouse = "hdfs://your_cluster/tmp/seatunnel/iceberg/"
265+
}
266+
namespace = "seatunnel_namespace"
267+
table = "iceberg_sink_table"
268+
iceberg.table.write-props = {
269+
write.format.default = "parquet"
270+
write.target-file-size-bytes = 536870912
271+
}
272+
krb5_path = "/etc/krb5.conf"
273+
kerberos_principal = "hive/your_host@EXAMPLE.COM"
274+
kerberos_keytab_path = "/path/to/your.keytab"
275+
iceberg.table.primary-keys = "id"
276+
iceberg.table.partition-keys = "f_datetime"
277+
iceberg.table.upsert-mode-enabled = true
278+
iceberg.table.schema-evolution-enabled = true
279+
case_sensitive = true
280+
}
281+
}
282+
```
283+
284+
Description:
285+
286+
- `krb5_path`: The path to the `krb5.conf` file used for Kerberos authentication.
287+
- `kerberos_principal`: The principal for Kerberos authentication in the format `primary/instance@REALM`.
288+
- `kerberos_keytab_path`: The keytab file path for Kerberos authentication.
289+
237290
### Multiple table
238291

239292
#### example1

docs/en/connectors/source/FtpFile.md

Lines changed: 69 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,9 @@ If you use SeaTunnel Engine, It automatically integrated the hadoop jar when you
7676
| null_format | string | no | - |
7777
| binary_chunk_size | int | no | 1024 |
7878
| binary_complete_file_mode | boolean | no | false |
79+
| discovery_mode | string | no | once |
80+
| scan_interval | string | no | 10S |
81+
| start_mode | string | no | earliest |
7982
| sync_mode | string | no | full |
8083
| target_path | string | no | - |
8184
| target_hadoop_conf | map | no | - |
@@ -452,6 +455,26 @@ Only used when file_format_type is binary.
452455

453456
Whether to read the complete file as a single chunk instead of splitting into chunks. When enabled, the entire file content will be read into memory at once. Default is false.
454457

458+
### discovery_mode [string]
459+
460+
File discovery mode. Supported values: `once` (default), `continuous`.
461+
462+
- `once`: enumerate current files once and finish (bounded).
463+
- `continuous`: keep scanning the path and processing new/changed files at runtime (unbounded).
464+
465+
In the current implementation, `discovery_mode=continuous` requires `sync_mode=update` (binary only) to avoid repeated transfers.
466+
467+
### scan_interval [string]
468+
469+
Only used when `discovery_mode=continuous`. Scan interval for periodic discovery; value must be greater than `0`. Recommended shorthand format `10S`, `30S` (case-insensitive, e.g. `10s`); ISO-8601 format `PT10S`, `PT30S` is also supported. Default is `10S`.
470+
471+
### start_mode [string]
472+
473+
Only used when `discovery_mode=continuous`. Supported values: `earliest` (default), `latest`.
474+
475+
- `earliest`: read existing files on startup.
476+
- `latest`: only process files modified after the job starts.
477+
455478
### sync_mode [string]
456479

457480
File sync mode. Supported values: `full` (default), `update`.
@@ -669,6 +692,52 @@ sink {
669692
}
670693
```
671694

695+
### Continuous Discovery (discovery_mode=continuous)
696+
697+
`discovery_mode=continuous` keeps the job running and periodically scans the path for new/changed files (long-running job, recommended to run with `job.mode="STREAMING"`).
698+
699+
**Note:** `discovery_mode=continuous` currently requires `sync_mode="update"` (binary-only) to avoid repeated transfers without keeping an unbounded "seen" state. `target_path` should align with the sink `path` on the same filesystem.
700+
701+
```hocon
702+
env {
703+
parallelism = 1
704+
job.mode = "STREAMING"
705+
}
706+
707+
source {
708+
FtpFile {
709+
host = "192.168.31.48"
710+
port = 21
711+
user = tyrantlucifer
712+
password = tianchao
713+
714+
path = "/seatunnel/watch/src/"
715+
file_format_type = "binary"
716+
717+
discovery_mode = "continuous"
718+
scan_interval = "10S"
719+
start_mode = "latest"
720+
721+
sync_mode = "update"
722+
target_path = "/seatunnel/watch/dst/"
723+
update_strategy = "distcp"
724+
compare_mode = "len_mtime"
725+
}
726+
}
727+
sink {
728+
FtpFile {
729+
host = "192.168.31.48"
730+
port = 21
731+
user = tyrantlucifer
732+
password = tianchao
733+
734+
path = "/seatunnel/watch/dst/"
735+
tmp_path = "/seatunnel/watch/dst-tmp/"
736+
file_format_type = "binary"
737+
}
738+
}
739+
```
740+
672741
### Filter File
673742

674743
```hocon
@@ -699,4 +768,3 @@ sink {
699768
## Changelog
700769

701770
<ChangeLog />
702-

docs/en/connectors/source/HdfsFile.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,9 @@ Read data from hdfs file system.
8080
| null_format | string | no | - | Only used when file_format_type is text. null_format to define which strings can be represented as null. e.g: `\N` |
8181
| binary_chunk_size | int | no | 1024 | Only used when file_format_type is binary. The chunk size (in bytes) for reading binary files. Default is 1024 bytes. Larger values may improve performance for large files but use more memory. |
8282
| binary_complete_file_mode | boolean | no | false | Only used when file_format_type is binary. Whether to read the complete file as a single chunk instead of splitting into chunks. When enabled, the entire file content will be read into memory at once. Default is false. |
83+
| discovery_mode | string | no | once | File discovery mode. Supported values: `once` (default), `continuous`. When `continuous`, the source keeps scanning the path and processes new/changed files at runtime (unbounded). In the current implementation, `continuous` requires `sync_mode=update` (binary only). |
84+
| scan_interval | string | no | 10S | Only used when `discovery_mode=continuous`. Scan interval for periodic discovery, recommended shorthand format `10S`, `30S`; ISO-8601 format `PT10S`, `PT30S` is also supported. |
85+
| start_mode | string | no | earliest | Only used when `discovery_mode=continuous`. Supported values: `earliest` (default), `latest`. |
8386
| sync_mode | string | no | full | File sync mode. Supported values: `full`, `update`. When `update`, the source compares files between source/target and only reads new/changed files (currently only supports `file_format_type=binary`). |
8487
| target_path | string | no | - | Only used when `sync_mode=update`. Target base path used for comparison (it should usually be the same as sink `path`). |
8588
| target_hadoop_conf | map | no | - | Only used when `sync_mode=update`. Extra Hadoop configuration for target filesystem. You can set `fs.defaultFS` in this map to override target defaultFS. |
@@ -220,6 +223,26 @@ Only used when file_format_type is binary.
220223

221224
Whether to read the complete file as a single chunk instead of splitting into chunks. When enabled, the entire file content will be read into memory at once. Default is false.
222225

226+
### discovery_mode [string]
227+
228+
File discovery mode. Supported values: `once` (default), `continuous`.
229+
230+
- `once`: enumerate current files once and finish (bounded).
231+
- `continuous`: keep scanning the path and processing new/changed files at runtime (unbounded).
232+
233+
In the current implementation, `discovery_mode=continuous` requires `sync_mode=update` (binary only) to avoid repeated transfers.
234+
235+
### scan_interval [string]
236+
237+
Only used when `discovery_mode=continuous`. Scan interval for periodic discovery; value must be greater than `0`. Recommended shorthand format `10S`, `30S` (case-insensitive, e.g. `10s`); ISO-8601 format `PT10S`, `PT30S` is also supported. Default is `10S`.
238+
239+
### start_mode [string]
240+
241+
Only used when `discovery_mode=continuous`. Supported values: `earliest` (default), `latest`.
242+
243+
- `earliest`: read existing files on startup.
244+
- `latest`: only process files modified after the job starts.
245+
223246
### sync_mode [string]
224247

225248
File sync mode. Supported values: `full` (default), `update`.
@@ -338,6 +361,79 @@ sink {
338361
}
339362
```
340363

364+
### Incremental Sync (sync_mode=update, binary)
365+
366+
`sync_mode=update` compares files between source and `target_path`, then only reads new/changed files (currently only supports `file_format_type=binary`).
367+
In most cases, `target_path` should be aligned with sink `path` (same filesystem and same relative paths).
368+
369+
```hocon
370+
env {
371+
parallelism = 1
372+
job.mode = "BATCH"
373+
}
374+
375+
source {
376+
HdfsFile {
377+
path = "/seatunnel/update/src/"
378+
file_format_type = "binary"
379+
fs.defaultFS = "hdfs://namenode001"
380+
381+
sync_mode = "update"
382+
target_path = "/seatunnel/update/dst/"
383+
update_strategy = "distcp"
384+
compare_mode = "len_mtime"
385+
}
386+
}
387+
388+
sink {
389+
HdfsFile {
390+
fs.defaultFS = "hdfs://namenode001"
391+
path = "/seatunnel/update/dst/"
392+
tmp_path = "/seatunnel/update/tmp/"
393+
file_format_type = "binary"
394+
}
395+
}
396+
```
397+
398+
### Continuous Discovery (discovery_mode=continuous)
399+
400+
`discovery_mode=continuous` keeps the job running and periodically scans the path for new/changed files (long-running job, recommended to run with `job.mode="STREAMING"`).
401+
402+
**Note:** `discovery_mode=continuous` currently requires `sync_mode="update"` (binary-only) to avoid repeated transfers without keeping an unbounded "seen" state. `target_path` should align with the sink `path` on the same filesystem.
403+
404+
```hocon
405+
env {
406+
parallelism = 1
407+
job.mode = "STREAMING"
408+
}
409+
410+
source {
411+
HdfsFile {
412+
path = "/seatunnel/watch/src/"
413+
file_format_type = "binary"
414+
fs.defaultFS = "hdfs://namenode001"
415+
416+
discovery_mode = "continuous"
417+
scan_interval = "10S"
418+
start_mode = "latest"
419+
420+
sync_mode = "update"
421+
target_path = "/seatunnel/watch/dst/"
422+
update_strategy = "distcp"
423+
compare_mode = "len_mtime"
424+
}
425+
}
426+
427+
sink {
428+
HdfsFile {
429+
fs.defaultFS = "hdfs://namenode001"
430+
path = "/seatunnel/watch/dst/"
431+
tmp_path = "/seatunnel/watch/tmp/"
432+
file_format_type = "binary"
433+
}
434+
}
435+
```
436+
341437
### Filter File
342438

343439
```hocon

0 commit comments

Comments
 (0)