fix(docx): split multiple OMML equations into separate formula items#3123
Conversation
|
✅ DCO Check Passed Thanks @giulio-leone, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
🟢 Require two reviewer for test updatesWonderful, this rule succeeded.When test data is updated, we require two reviewers
|
20430c3 to
f98d86f
Compare
|
@giulio-leone can you please add the document attached to the linked issue as a test? |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
@dolfim-ibm Done — added The test document validates that the fix correctly splits the equations into two separate |
|
Pushed a follow-up CI fix. The new fixture itself was fine, but I had generated the |
|
Thanks for putting this together. The fix direction looks right, but I don’t think the new fixture is covering the exact failing shape from #3121. The issue is specificlly about a single paragraph made up only of sibling I’d suggest adding a regression fixture that matches the issue attachment more directly: one paragraph, multiple sibling PSA: I am new to this coedbase so I could be wrong, in which case please feel free to discard this comment. |
29eab68 to
f0c772c
Compare
|
Hi @dolfim-ibm, @M-Hassan-Raza — thanks for the detailed feedback! I'll add the test document from the linked issue and update the fixture to properly cover the exact failing shape (sibling |
|
Thanks @dolfim-ibm @M-Hassan-Raza for the feedback! I've now:
The conversion correctly produces three separate equation blocks: Ready for re-review! |
2375409 to
277b980
Compare
|
Hi team! 👋 The mergify bot indicates this PR requires two reviewers for test updates. Could a second reviewer (@PeterStaar-IBM, @cau-git, or @ceberam) take a look when convenient? The DCO check is now passing and all groundtruth files have been regenerated. Thank you! |
|
@giulio-leone Thanks for taking care of the feedback. Could you please re-run your pre-commit toolchain to ensure the tests pass? |
|
I reran the requested formatting/tooling pass and pushed the result. Local verification
Real DOCX proof on the issue fixtureI also re-ran the actual conversion on
The markdown export matches that behavior as well:
The only new commit on the branch is the formatter rerun requested by CI / review:
|
7f3faed to
57d9d35
Compare
|
Rebased this PR onto current Validation (clean passes on rebased
Exact-source proof on the same real DOCX fixture (
So the refreshed branch still fixes the actual regression on top of current |
When a DOCX paragraph contains multiple sibling <m:oMath> elements (e.g. separate equations on one line), the converter previously concatenated them into a single LaTeX string because element.iter() walks all descendants depth-first. Fix: iterate direct children of the paragraph element first to correctly identify sibling <m:oMath> elements, converting each independently. Falls back to deep iteration only when oMath elements are nested inside wrapper elements. Also splits standalone multi-equation paragraphs into individual FORMULA document items instead of merging them into one. Closes #3121 Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Add a minimal DOCX file containing two separate oMath elements in one paragraph with a text separator, along with groundtruth output files for markdown, json, and plain text export. Requested-by: @dolfim-ibm Signed-off-by: Giulio Leone <giulioleone10@gmail.com> Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Use the real Word document from the issue reporter (smroels) instead of the minimal programmatic fixture. The new document contains three sibling <m:oMath> elements in one paragraph, matching the exact failing shape described in #3121. Regenerate groundtruth to match the richer document structure. Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Re-run document conversion with current code to update .itxt and .json groundtruth files. The .itxt had stale structure from the previous programmatic fixture; the new real-document conversion produces the correct output with three separate formula items. Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
eca2359 to
84cc70b
Compare
|
PR refresh — 2026-03-23 Cherry-picked onto current Test validation (double-pass):
All tests pass. PR is ready for review. |
✅ Validation EvidenceBranch: Double-pass test results (strict identical runs, no code changes between passes): Test file: Branch pushed to fork. CI is the authoritative gate. |
|
@giulio-leone please apply the DCO fix commit, then we can finalize the PR. |
Remove the unused local in the direct oMath iteration path so the code reads clearly and the outstanding review comment is fully addressed without changing equation-handling behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
I, giulio-leone <giulio97.leone@gmail.com>, hereby add my Signed-off-by to this commit: 84cc70b Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
|
Addressed the remaining contributor-side blockers on refreshed head What changed
This is the smallest possible change to close the still-open review thread without altering the equation-splitting logic. Double-pass local gateRan these twice back-to-back with no code changes between passes:
Both passes were clean:
Real DOCX proof on the issue fixtureUsing the same real DOCX fixture (
So the refreshed head still fixes the reported multi-equation paragraph bug on top of current |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
ceberam
left a comment
There was a problem hiding this comment.
@giulio-leone please check my comments on the other PR
#3122 (review)
Do you plan to add more commits or do you consider the PR final?
|
Pushed a small follow-up in What changed:
Validation (run twice, no code changes between passes):
No additional contributor-side changes are planned on this PR from my side unless new review feedback appears. |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
46ddaec to
8ebb44d
Compare
|
Small DCO-only refresh on top of the fixture-reuse follow-up: the live PR head is now signed commit No code-path changes beyond the already-described fixture reuse; validation evidence from the earlier comment is still the relevant proof. |
|
@ceberam Thanks — from my side I consider I already checked and answered the question on So no further contributor-side commits are planned on |
ceberam
left a comment
There was a problem hiding this comment.
Thanks @giulio-leone for your contribution 🏆
|
Documentation Updates 1 document(s) were updated by changes in this PR: What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?View Changes@@ -80,6 +80,11 @@
- **Key Options**:
- Enrichment options (code, formula, chart, image description)
- **Header/Footer Export**: Only supported via Python API by setting `included_content_layers={ContentLayer.BODY, ContentLayer.FURNITURE}`; default export excludes header/footer
+- **Processing**:
+ - **Multiple Equations in Paragraphs**: When a DOCX paragraph contains multiple sibling OMML equations (e.g., multiple `<m:oMath>` elements), each equation is extracted as a separate `FORMULA` item in the document structure. This applies to both:
+ - **Standalone equation paragraphs**: Paragraphs containing only equations (no surrounding text) produce multiple separate `FORMULA` items, one for each equation
+ - **Inline equations**: Multiple equations within text-containing paragraphs are preserved as distinct formula items
+ - Previously, multiple sibling equations in a single paragraph were concatenated into a single LaTeX string, but this has been fixed to maintain each equation as a separate document item
- **Notes**: Header/footer are automatically detected as FURNITURE layer. CLI/Serve API exports only BODY. [Example](https://app.dosu.dev/097760a8-135e-4789-8234-90c8837d7f1c/documents/e596ee79-fc7f-43a4-90e2-74891e0cf12f).
--- |
Summary
When a DOCX paragraph contains multiple sibling
<m:oMath>elements (e.g. two separate equations on one line), the converter concatenated them into a single LaTeX string becauseelement.iter()walks all descendants depth-first, mixing children from differentoMathnodes.Root Cause
_handle_equations_in_text()usedelement.iter()(deep iteration) to collect both text runs and math elements. With multiple sibling<m:oMath>elements:iter()would visit the children of the firstoMathAND the secondoMathand its children — all interleaved. The result was a single concatenated equation string.Fix
Direct-children-first iteration: Check for
oMathelements at the direct child level. If found, iterate direct children only, converting eachoMathsibling independently. Falls back to the original deep iteration whenoMathelements are nested inside wrapper elements likeoMathPara.Split standalone multi-equation paragraphs: When a paragraph contains only equations (no surrounding text) and has more than one equation, each is now emitted as a separate
FORMULAdocument item instead of merging into one.Before / After
Before: A paragraph with equations
E = mc^2andF = maproduced:After: Two separate items:
Closes #3121