-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
Search before asking
- I had searched in the feature and found no similar feature requirement.
Description
Seatunnel currently focuses on structured/semi-structured data integration (e.g., reading text/CSV/JSON files from SFTP and writing content to S3/Ceph). However, it lacks the ability to support file-level direct passthrough (whole file transmission) between different file systems/storage protocols. The key limitations are:
- Binary file incompatibility: Seatunnel parses files as text/structured data by default, which causes corruption or garbled content when handling binary files (e.g., ZIP, images, videos, executable files).
- Loss of original file attributes: Cannot retain the original file name, modification time, access permissions, file size, and other metadata during transmission.
- No whole file transmission: The current pipeline processes data line by line or in batches, rather than transmitting the entire file as a single unit, which is inefficient for large files.
- Limited support for file system protocols: For storage like Ceph (CephFS/RGW), Seatunnel relies on S3-compatible sinks but cannot directly interact with CephFS or other file system protocols for passthrough.
Expected Feature (File System Direct Passthrough)
We propose adding a File System Passthrough feature to Seatunnel, which enables direct, whole-file transmission between different storage protocols without parsing or modifying the file content. The core capabilities should include:
- Support for multiple storage protocols:
- Source: SFTP, Local File System, HDFS, S3 (including Ceph RGW), CephFS, FTP/SFTP, etc.
- Sink: Ceph (RGW/CephFS), S3, Local File System, HDFS, SFTP, OSS, COS, etc.
-
Whole file transmission: Transmit the entire file as a single unit (no line-by-line parsing) to support binary files and large files efficiently.
-
Preserve file attributes:
- Retain original file names (critical for business scenarios).
- Preserve metadata (modification time, access time, file permissions, file size, etc.).
- Support custom file name mapping (e.g., adding prefixes/suffixes, renaming rules) if needed.
- Batch and incremental transmission:
- Support batch transmission of all files in a specified directory (including subdirectories).
- Support incremental transmission (e.g., only transmit new/modified files since the last sync).
- Filter and control capabilities:
- Support file filtering via wildcards (e.g., .log, data_.zip) or regular expressions.
- Support skipping empty files, hidden files, or files larger/smaller than a specified size.
- Support configurable overwrite policies (e.g., overwrite existing files, skip, or append).
- Seamless integration with existing Seatunnel pipelines:
- Provide a dedicated FilePassthrough Source/Sink plugin (or extend existing file connectors with a "passthrough mode").
- Allow optional integration with Transform steps (e.g., adding file metadata as tags before transmission) for flexible customization.
新增文件系统透传功能,支持 SFTP、本地文件、HDFS、Ceph(RGW/CephFS)、S3 等协议间的整文件传输,核心能力包括:
支持二进制文件传输,不解析文件内容,直接透传;
保留原文件名、修改时间、权限等元数据;
支持批量目录同步、增量传输、文件过滤;
与现有 Seatunnel 管道无缝集成,可选择对文件元数据进行处理。
Usage Scenario
No response
Related issues
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct