[Feature][File] Add word parser for RAG support #9898

joonseolee · 2025-09-26T07:30:02Z

Purpose of this pull request

Add and refine Word (.docx) reading via WordReadStrategy.
Output schema (10 fields): element_id, element_type, text_content, font_style, underline_style, font_size, font_family, text_color, alignment, hyperlink_url.
Process document elements in natural order (paragraphs and tables). Footnote text is included within the referencing paragraph’s text_content.
Due to Apache POI limitations, the minimal extractable unit is a paragraph. Run-level styles are aggregated at the paragraph level:
- font_style: NORMAL/BOLD/ITALIC/BOLD_ITALIC
- underline_style: null or concrete style (e.g., SINGLE)
- font_size, font_family: first encountered values or null
- text_color: defaults to "000000" when absent
- hyperlink_url: all links in a paragraph concatenated with commas

Does this PR introduce any user-facing change?

Yes. The Word reader’s output schema is simplified to 10 fields above. Some formatting attributes now return null when not explicitly present; text_color defaults to "000000". Elements are emitted in document order, and hyperlinks are aggregated per paragraph.

How was this patch tested?

Added WordReadStrategyTest to validate all 10 fields against a sample .docx.
Verified:
- Paragraph rows contain text and aggregated formatting/links.
- Table rows produce a single text blob per table; formatting-related fields are null.

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config

Related Issue

#9715

Hisoka-X · 2025-09-26T14:19:47Z

cc @liugddx could you help to review it?

Hisoka-X · 2025-09-26T14:25:38Z

...ile-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/config/FileFormat.java

        }
    },
-    ;
+    WORD("docx") {


how about .doc?

I forgot it so I will add doc too :)

Hisoka-X · 2025-09-26T14:29:08Z

...main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/WordReadStrategy.java

+    @Override
+    public SeaTunnelRowType getSeaTunnelRowTypeInfo(String path) throws FileConnectorException {
+        return new SeaTunnelRowType(
+                new String[] {


we need those too.
"heading_level",
"text",
"page_number",
"position_index",
"parent_id",
"child_ids"
Just like markdown. We should keep most fields are same.

@Hisoka-X

I wanted to add those values as well, but I couldn't due to technical issues like the ones below.

heading_level: Not reliably derivable. Style names/IDs vary by template, language, and custom styles; there is no stable, standard mapping. Outline level may be absent, so consistent extraction isn’t guaranteed.

text: Same as text_content

page_number: Not available. Pagination is a rendering/layout result (paper size, margins, fonts, images, printer settings); Apache POI does not perform layout computation.

position_index: A hierarchical index (Markdown-style) isn’t feasible without a reliable document tree and heading detection; Word’s body is effectively flat and heading detection is inconsistent. A simple global ordinal is possible.

parent_id: Not available. WordprocessingML does not provide an explicit AST with parent-child relationships for paragraphs.

child_ids: Not available for the same reason as parent_id; there’s no stable logical child list per element in Word.

liugddx

Please add documentation.

liugddx · 2025-09-29T05:45:36Z

Please fix the test case issue.

[Feature][File] Add word parser for RAG support

ce2a470

github-actions bot added connectors-v2 file labels Sep 26, 2025

Hisoka-X reviewed Sep 26, 2025

View reviewed changes

liugddx reviewed Sep 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature][File] Add word parser for RAG support #9898

[Feature][File] Add word parser for RAG support #9898

joonseolee commented Sep 26, 2025

Uh oh!

Hisoka-X commented Sep 26, 2025

Uh oh!

Hisoka-X Sep 26, 2025

Uh oh!

joonseolee Oct 12, 2025

Uh oh!

Hisoka-X Sep 26, 2025

Uh oh!

joonseolee Oct 12, 2025

Uh oh!

liugddx left a comment

Uh oh!

liugddx commented Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Feature][File] Add word parser for RAG support #9898

Are you sure you want to change the base?

[Feature][File] Add word parser for RAG support #9898

Conversation

joonseolee commented Sep 26, 2025

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Related Issue

Uh oh!

Hisoka-X commented Sep 26, 2025

Uh oh!

Hisoka-X Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

joonseolee Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

Hisoka-X Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

joonseolee Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

liugddx left a comment

Choose a reason for hiding this comment

Uh oh!

liugddx commented Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants