Skip to content

[Feature] Support File System Direct Passthrough (Whole File/Binary Transmission) Between Different Storage Protocols 二进制文件在多存储系统中进行文件夹级别透传 #10192

@h499154897-cmyk

Description

@h499154897-cmyk

Search before asking

  • I had searched in the feature and found no similar feature requirement.

Description

Seatunnel currently focuses on structured/semi-structured data integration (e.g., reading text/CSV/JSON files from SFTP and writing content to S3/Ceph). However, it lacks the ability to support file-level direct passthrough (whole file transmission) between different file systems/storage protocols. The key limitations are:

  1. Binary file incompatibility: Seatunnel parses files as text/structured data by default, which causes corruption or garbled content when handling binary files (e.g., ZIP, images, videos, executable files).
  2. Loss of original file attributes: Cannot retain the original file name, modification time, access permissions, file size, and other metadata during transmission.
  3. No whole file transmission: The current pipeline processes data line by line or in batches, rather than transmitting the entire file as a single unit, which is inefficient for large files.
  4. Limited support for file system protocols: For storage like Ceph (CephFS/RGW), Seatunnel relies on S3-compatible sinks but cannot directly interact with CephFS or other file system protocols for passthrough.

Expected Feature (File System Direct Passthrough)
We propose adding a File System Passthrough feature to Seatunnel, which enables direct, whole-file transmission between different storage protocols without parsing or modifying the file content. The core capabilities should include:

  1. Support for multiple storage protocols:
  • Source: SFTP, Local File System, HDFS, S3 (including Ceph RGW), CephFS, FTP/SFTP, etc.
  • Sink: Ceph (RGW/CephFS), S3, Local File System, HDFS, SFTP, OSS, COS, etc.
  1. Whole file transmission: Transmit the entire file as a single unit (no line-by-line parsing) to support binary files and large files efficiently.

  2. Preserve file attributes:

  • Retain original file names (critical for business scenarios).
  • Preserve metadata (modification time, access time, file permissions, file size, etc.).
  • Support custom file name mapping (e.g., adding prefixes/suffixes, renaming rules) if needed.
  1. Batch and incremental transmission:
  • Support batch transmission of all files in a specified directory (including subdirectories).
  • Support incremental transmission (e.g., only transmit new/modified files since the last sync).
  1. Filter and control capabilities:
  • Support file filtering via wildcards (e.g., .log, data_.zip) or regular expressions.
  • Support skipping empty files, hidden files, or files larger/smaller than a specified size.
  • Support configurable overwrite policies (e.g., overwrite existing files, skip, or append).
  1. Seamless integration with existing Seatunnel pipelines:
  • Provide a dedicated FilePassthrough Source/Sink plugin (or extend existing file connectors with a "passthrough mode").
  • Allow optional integration with Transform steps (e.g., adding file metadata as tags before transmission) for flexible customization.

新增文件系统透传功能,支持 SFTP、本地文件、HDFS、Ceph(RGW/CephFS)、S3 等协议间的整文件传输,核心能力包括:
支持二进制文件传输,不解析文件内容,直接透传;
保留原文件名、修改时间、权限等元数据;
支持批量目录同步、增量传输、文件过滤;
与现有 Seatunnel 管道无缝集成,可选择对文件元数据进行处理。

Usage Scenario

No response

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions