Skip to content

Conversation

joonseolee
Copy link
Contributor

Purpose of this pull request

  • Add and refine Word (.docx) reading via WordReadStrategy.
  • Output schema (10 fields): element_id, element_type, text_content, font_style, underline_style, font_size, font_family, text_color, alignment, hyperlink_url.
  • Process document elements in natural order (paragraphs and tables). Footnote text is included within the referencing paragraph’s text_content.
  • Due to Apache POI limitations, the minimal extractable unit is a paragraph. Run-level styles are aggregated at the paragraph level:
    • font_style: NORMAL/BOLD/ITALIC/BOLD_ITALIC
    • underline_style: null or concrete style (e.g., SINGLE)
    • font_size, font_family: first encountered values or null
    • text_color: defaults to "000000" when absent
    • hyperlink_url: all links in a paragraph concatenated with commas

Does this PR introduce any user-facing change?

Yes. The Word reader’s output schema is simplified to 10 fields above. Some formatting attributes now return null when not explicitly present; text_color defaults to "000000". Elements are emitted in document order, and hyperlinks are aggregated per paragraph.

How was this patch tested?

  • Added WordReadStrategyTest to validate all 10 fields against a sample .docx.
  • Verified:
    • Paragraph rows contain text and aggregated formatting/links.
    • Table rows produce a single text blob per table; formatting-related fields are null.

Check list

Related Issue

#9715

@Hisoka-X
Copy link
Member

cc @liugddx could you help to review it?

}
},
;
WORD("docx") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about .doc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot it so I will add doc too :)

@Override
public SeaTunnelRowType getSeaTunnelRowTypeInfo(String path) throws FileConnectorException {
return new SeaTunnelRowType(
new String[] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need those too.
"heading_level",
"text",
"page_number",
"position_index",
"parent_id",
"child_ids"
Just like markdown. We should keep most fields are same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Hisoka-X

I wanted to add those values as well, but I couldn't due to technical issues like the ones below.

  • heading_level: Not reliably derivable. Style names/IDs vary by template, language, and custom styles; there is no stable, standard mapping. Outline level may be absent, so consistent extraction isn’t guaranteed.
  • text: Same as text_content
  • page_number: Not available. Pagination is a rendering/layout result (paper size, margins, fonts, images, printer settings); Apache POI does not perform layout computation.
  • position_index: A hierarchical index (Markdown-style) isn’t feasible without a reliable document tree and heading detection; Word’s body is effectively flat and heading detection is inconsistent. A simple global ordinal is possible.
  • parent_id: Not available. WordprocessingML does not provide an explicit AST with parent-child relationships for paragraphs.
  • child_ids: Not available for the same reason as parent_id; there’s no stable logical child list per element in Word.

Copy link
Member

@liugddx liugddx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add documentation.

@liugddx
Copy link
Member

liugddx commented Sep 29, 2025

Please fix the test case issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants