-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[Feature][File] Add word parser for RAG support #9898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
cc @liugddx could you help to review it? |
} | ||
}, | ||
; | ||
WORD("docx") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about .doc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot it so I will add doc
too :)
@Override | ||
public SeaTunnelRowType getSeaTunnelRowTypeInfo(String path) throws FileConnectorException { | ||
return new SeaTunnelRowType( | ||
new String[] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need those too.
"heading_level",
"text",
"page_number",
"position_index",
"parent_id",
"child_ids"
Just like markdown. We should keep most fields are same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to add those values as well, but I couldn't due to technical issues like the ones below.
- heading_level: Not reliably derivable. Style names/IDs vary by template, language, and custom styles; there is no stable, standard mapping. Outline level may be absent, so consistent extraction isn’t guaranteed.
- text: Same as text_content
- page_number: Not available. Pagination is a rendering/layout result (paper size, margins, fonts, images, printer settings); Apache POI does not perform layout computation.
- position_index: A hierarchical index (Markdown-style) isn’t feasible without a reliable document tree and heading detection; Word’s body is effectively flat and heading detection is inconsistent. A simple global ordinal is possible.
- parent_id: Not available. WordprocessingML does not provide an explicit AST with parent-child relationships for paragraphs.
- child_ids: Not available for the same reason as parent_id; there’s no stable logical child list per element in Word.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add documentation.
Please fix the test case issue. |
Purpose of this pull request
WordReadStrategy
.element_id
,element_type
,text_content
,font_style
,underline_style
,font_size
,font_family
,text_color
,alignment
,hyperlink_url
.text_content
.font_style
: NORMAL/BOLD/ITALIC/BOLD_ITALICunderline_style
: null or concrete style (e.g., SINGLE)font_size
,font_family
: first encountered values or nulltext_color
: defaults to "000000" when absenthyperlink_url
: all links in a paragraph concatenated with commasDoes this PR introduce any user-facing change?
Yes. The Word reader’s output schema is simplified to 10 fields above. Some formatting attributes now return
null
when not explicitly present;text_color
defaults to"000000"
. Elements are emitted in document order, and hyperlinks are aggregated per paragraph.How was this patch tested?
WordReadStrategyTest
to validate all 10 fields against a sample.docx
.null
.Check list
New License Guide
Related Issue
#9715