Skip to content

fix: resolving issue 3655 for rich tables in docx#3657

Merged
PeterStaar-IBM merged 2 commits into
mainfrom
dev/fix-issue-3655
Jun 19, 2026
Merged

fix: resolving issue 3655 for rich tables in docx#3657
PeterStaar-IBM merged 2 commits into
mainfrom
dev/fix-issue-3655

Conversation

@PeterStaar-IBM

@PeterStaar-IBM PeterStaar-IBM commented Jun 19, 2026

Copy link
Copy Markdown
Member

Summary

Fix DOCX table extraction when tables are wrapped in block-level SDT content controls.

Some DOCX files store real w:tbl elements inside w:sdtContent. The MS Word backend previously handled block-
level w:sdt by extracting every nested paragraph directly, which preserved the table text but flattened the
table structure. This change recursively walks the direct w:sdtContent children so nested tables reach the
existing table parser.

Changes

  • Recursively process block-level w:sdtContent in MsWordDocumentBackend._walk_linear.
  • Add regression coverage for docx_rich_tables_01.docx.
  • Verify the fixture extracts two 24x3 tables with headers:
    • Feature
    • Action Needed
    • Comment/Links
  • Verify the tables appear after the Phase 1 and Phase 2 text markers.

resolves: #3655

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
@PeterStaar-IBM PeterStaar-IBM requested a review from ceberam June 19, 2026 07:33
@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

DCO Check Passed

Thanks @PeterStaar-IBM, all your commits are properly signed off. 🎉

@mergify

mergify Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

@codecov

codecov Bot commented Jun 19, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@PeterStaar-IBM PeterStaar-IBM changed the title fix: resolving issue 3665 for rich tables in docx fix: resolving issue 3655 for rich tables in docx Jun 19, 2026

@ceberam ceberam left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
@PeterStaar-IBM can you just add the groundtruth files for the new test file docx_rich_tables_01.docx ? They should be:

tests/data/groundtruth/docling_v2/docx_rich_tables_01.docx.itxt
tests/data/groundtruth/docling_v2/docx_rich_tables_01.docx.json
tests/data/groundtruth/docling_v2/docx_rich_tables_01.docx.md

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

@ceberam ceberam left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@PeterStaar-IBM PeterStaar-IBM merged commit 6fe4fc3 into main Jun 19, 2026
26 checks passed
@PeterStaar-IBM PeterStaar-IBM deleted the dev/fix-issue-3655 branch June 19, 2026 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Table is not extracted correctly in DOCX

3 participants