Skip to content

fix(docx): split multiple OMML equations into separate formula items#3123

Merged
cau-git merged 10 commits into
docling-project:mainfrom
giulio-leone:fix/omml-multi-equation-paragraph
Mar 24, 2026
Merged

fix(docx): split multiple OMML equations into separate formula items#3123
cau-git merged 10 commits into
docling-project:mainfrom
giulio-leone:fix/omml-multi-equation-paragraph

Conversation

@giulio-leone
Copy link
Copy Markdown
Contributor

Summary

When a DOCX paragraph contains multiple sibling <m:oMath> elements (e.g. two separate equations on one line), the converter concatenated them into a single LaTeX string because element.iter() walks all descendants depth-first, mixing children from different oMath nodes.

Root Cause

_handle_equations_in_text() used element.iter() (deep iteration) to collect both text runs and math elements. With multiple sibling <m:oMath> elements:

<w:p>
</w:p>

iter() would visit the children of the first oMath AND the second oMath and its children — all interleaved. The result was a single concatenated equation string.

Fix

  1. Direct-children-first iteration: Check for oMath elements at the direct child level. If found, iterate direct children only, converting each oMath sibling independently. Falls back to the original deep iteration when oMath elements are nested inside wrapper elements like oMathPara.

  2. Split standalone multi-equation paragraphs: When a paragraph contains only equations (no surrounding text) and has more than one equation, each is now emitted as a separate FORMULA document item instead of merging into one.

Before / After

Before: A paragraph with equations E = mc^2 and F = ma produced:

FORMULA: "E = mc^2 F = ma"

After: Two separate items:

FORMULA: "E = mc^2"
FORMULA: "F = ma"

Closes #3121

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 13, 2026

DCO Check Passed

Thanks @giulio-leone, all your commits are properly signed off. 🎉

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 13, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

@giulio-leone giulio-leone force-pushed the fix/omml-multi-equation-paragraph branch from 20430c3 to f98d86f Compare March 13, 2026 05:31
@dolfim-ibm
Copy link
Copy Markdown
Member

@giulio-leone can you please add the document attached to the linked issue as a test?

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@giulio-leone
Copy link
Copy Markdown
Contributor Author

@dolfim-ibm Done — added tests/data/docx/omml_multi_equation_paragraph.docx (a minimal DOCX with two sibling oMath elements separated by a text run) along with matching groundtruth files (md, json, itxt).

The test document validates that the fix correctly splits the equations into two separate FormulaItem entries instead of concatenating them.

@giulio-leone
Copy link
Copy Markdown
Contributor Author

Pushed a follow-up CI fix. The new fixture itself was fine, but I had generated the .itxt snapshot with the wrong exporter. I regenerated omml_multi_equation_paragraph.docx.itxt using the same _export_to_indented_text(max_text_len=70, explicit_tables=False) path that test_backend_msword.py actually validates.

@M-Hassan-Raza
Copy link
Copy Markdown
Contributor

Thanks for putting this together. The fix direction looks right, but I don’t think the new fixture is covering the exact failing shape from #3121.

The issue is specificlly about a single paragraph made up only of sibling m:oMath elements, with no text runs between them. On current main, that case still collapses into one display block and the equation ordder gets scrambled. The fixture added here looks more like formula-text-formula inline content, which current main already seems to handle correctly.

I’d suggest adding a regression fixture that matches the issue attachment more directly: one paragraph, multiple sibling m:oMath nodes, no intervening text. That would make it much clearer that this PR is locking down the reported bug and not a nearby case.

PSA: I am new to this coedbase so I could be wrong, in which case please feel free to discard this comment.

@giulio-leone giulio-leone force-pushed the fix/omml-multi-equation-paragraph branch from 29eab68 to f0c772c Compare March 15, 2026 16:18
@giulio-leone
Copy link
Copy Markdown
Contributor Author

Hi @dolfim-ibm, @M-Hassan-Raza — thanks for the detailed feedback! I'll add the test document from the linked issue and update the fixture to properly cover the exact failing shape (sibling m:oMath elements with no text runs between them). Will push an update shortly.

@giulio-leone
Copy link
Copy Markdown
Contributor Author

Thanks @dolfim-ibm @M-Hassan-Raza for the feedback!

I've now:

  1. Replaced the test document with the real Word file from issue Multiple OMML equations in one paragraph concatenated into a single display block #3121 (the ~37 KB document from @smroels containing three sibling <m:oMath> elements in one paragraph)
  2. Regenerated all groundtruth files for the new document

The conversion correctly produces three separate equation blocks:

$$a=b$$
$$c=d$$
$$e=f$$

Ready for re-review!

@giulio-leone giulio-leone force-pushed the fix/omml-multi-equation-paragraph branch from 2375409 to 277b980 Compare March 15, 2026 20:43
@giulio-leone
Copy link
Copy Markdown
Contributor Author

Hi team! 👋 The mergify bot indicates this PR requires two reviewers for test updates. Could a second reviewer (@PeterStaar-IBM, @cau-git, or @ceberam) take a look when convenient? The DCO check is now passing and all groundtruth files have been regenerated. Thank you!

@cau-git cau-git changed the title fix(msword): split multiple OMML equations into separate formula items fix(docx): split multiple OMML equations into separate formula items Mar 16, 2026
@cau-git
Copy link
Copy Markdown
Member

cau-git commented Mar 17, 2026

@giulio-leone Thanks for taking care of the feedback. Could you please re-run your pre-commit toolchain to ensure the tests pass?

uv run pre-commit install # only once in your dev setup
uv run pre-commit run --all-files # you can make a new commit and it will do this for you automatically.

@giulio-leone
Copy link
Copy Markdown
Contributor Author

I reran the requested formatting/tooling pass and pushed the result.

Local verification

  • uv run --python 3.12 pre-commit run --all-files
  • uv run --python 3.12 pytest tests/test_backend_msword.py -q ✅ (17 passed, 1 xfailed, 1 xpassed)

Real DOCX proof on the issue fixture

I also re-ran the actual conversion on tests/data/docx/omml_multi_equation_paragraph.docx and compared origin/main against this PR branch.

  • origin/main produced 1 concatenated formula item: c=de=fa=b
  • this PR branch produced 3 separate formula items: a=b, c=d, e=f

The markdown export matches that behavior as well:

  • origin/main => one $$c=de=fa=b$$ block
  • this PR => three separate formula blocks

The only new commit on the branch is the formatter rerun requested by CI / review:

  • 7f3faed style(docx): rerun ruff formatter for msword backend

@giulio-leone giulio-leone force-pushed the fix/omml-multi-equation-paragraph branch from 7f3faed to 57d9d35 Compare March 21, 2026 15:18
@giulio-leone
Copy link
Copy Markdown
Contributor Author

Rebased this PR onto current main and reran the requested tooling/test path on the refreshed head.

Validation (clean passes on rebased 57d9d35):

  • uv run pre-commit run --files docling/backend/msword_backend.py tests/test_backend_msword.py
  • uv run pytest tests/test_backend_msword.py::test_e2e_docx_conversions -q
  • repeated the same no-diff pass a second time

Exact-source proof on the same real DOCX fixture (tests/data/docx/omml_multi_equation_paragraph.docx):

  • rebased branch 57d9d35: emits 3 separate formulas: a=b, c=d, e=f
  • current origin/main 4e650af: still emits 1 malformed concatenated formula: c=de=fa=b

So the refreshed branch still fixes the actual regression on top of current main, and the worktree is clean after validation.

dolfim-ibm
dolfim-ibm previously approved these changes Mar 23, 2026
Copy link
Copy Markdown
Member

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

giulio-leone and others added 6 commits March 23, 2026 11:27
When a DOCX paragraph contains multiple sibling <m:oMath> elements
(e.g. separate equations on one line), the converter previously
concatenated them into a single LaTeX string because element.iter()
walks all descendants depth-first.

Fix: iterate direct children of the paragraph element first to
correctly identify sibling <m:oMath> elements, converting each
independently. Falls back to deep iteration only when oMath
elements are nested inside wrapper elements.

Also splits standalone multi-equation paragraphs into individual
FORMULA document items instead of merging them into one.

Closes #3121

Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Add a minimal DOCX file containing two separate oMath elements
in one paragraph with a text separator, along with groundtruth
output files for markdown, json, and plain text export.

Requested-by: @dolfim-ibm
Signed-off-by: Giulio Leone <giulioleone10@gmail.com>
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Use the real Word document from the issue reporter (smroels)
instead of the minimal programmatic fixture. The new document
contains three sibling <m:oMath> elements in one paragraph,
matching the exact failing shape described in #3121.

Regenerate groundtruth to match the richer document structure.

Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Re-run document conversion with current code to update .itxt and .json
groundtruth files. The .itxt had stale structure from the previous
programmatic fixture; the new real-document conversion produces the
correct output with three separate formula items.

Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@giulio-leone giulio-leone force-pushed the fix/omml-multi-equation-paragraph branch from eca2359 to 84cc70b Compare March 23, 2026 10:28
@giulio-leone
Copy link
Copy Markdown
Contributor Author

PR refresh — 2026-03-23

Cherry-picked onto current main (f0e3d1d) — was previously on 4e650af. Branch fix/omml-multi-equation-paragraph force-pushed to fork.

Test validation (double-pass):

  • tests/test_backend_msword.py
  • Pass 1: 19 passed, 1 xfailed, 1 xpassed ✅
  • Pass 2: 19 passed, 1 xfailed, 1 xpassed ✅

All tests pass. PR is ready for review.

Comment thread docling/backend/msword_backend.py Outdated
@giulio-leone
Copy link
Copy Markdown
Contributor Author

✅ Validation Evidence

Branch: fix/omml-multi-equation-paragraph @ 84cc70b
Status: 0 commits behind upstream main

Double-pass test results (strict identical runs, no code changes between passes):

Pass 1: 19 passed, 1 xfailed, 1 xpassed, 1 warning  ✅
Pass 2: 19 passed, 1 xfailed, 1 xpassed, 1 warning  ✅

Test file: tests/test_backend_msword.py

Branch pushed to fork. CI is the authoritative gate.

@dolfim-ibm
Copy link
Copy Markdown
Member

@giulio-leone please apply the DCO fix commit, then we can finalize the PR.

giulio-leone and others added 2 commits March 23, 2026 17:14
Remove the unused local in the direct oMath iteration path so the code
reads clearly and the outstanding review comment is fully addressed
without changing equation-handling behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
I, giulio-leone <giulio97.leone@gmail.com>, hereby add my Signed-off-by to this commit: 84cc70b

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
@giulio-leone
Copy link
Copy Markdown
Contributor Author

Addressed the remaining contributor-side blockers on refreshed head c78617f.

What changed

  • removed the unused tag_name binding in the direct oMath iteration path of docling/backend/msword_backend.py
  • added the required DCO remediation commit for 84cc70b55e96804f32590215a3eab31a0c280586

This is the smallest possible change to close the still-open review thread without altering the equation-splitting logic.

Double-pass local gate

Ran these twice back-to-back with no code changes between passes:

  • uv run pre-commit run --all-files
  • uv run pytest tests/test_backend_msword.py -q

Both passes were clean:

  • 19 passed, 1 xfailed, 1 xpassed, 1 warning

Real DOCX proof on the issue fixture

Using the same real DOCX fixture (omml_multi_equation_paragraph.docx) with both current main code and the refreshed branch code:

  • branch:
    • a=b
    • c=d
    • e=f
  • main:
    • c=de=fa=b

So the refreshed head still fixes the reported multi-equation paragraph bug on top of current main, and the remaining PR blockers from review + DCO are now addressed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Copy link
Copy Markdown
Member

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giulio-leone please check my comments on the other PR
#3122 (review)

Do you plan to add more commits or do you consider the PR final?

Comment thread tests/test_backend_msword.py Outdated
@giulio-leone
Copy link
Copy Markdown
Contributor Author

Pushed a small follow-up in 46ddaec for the latest test-structure review note on tests/test_backend_msword.py.

What changed:

  • promoted the shared MsWordDocumentBackend setup into a module-scoped pytest fixture
  • rewired the touched tests to reuse that fixture instead of rebuilding the backend each time

Validation (run twice, no code changes between passes):

  • uv run pre-commit run --files tests/test_backend_msword.py
  • uv run pytest tests/test_backend_msword.py -q
  • result on both passes: 24 passed, 1 xfailed, 1 xpassed

No additional contributor-side changes are planned on this PR from my side unless new review feedback appears.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
@giulio-leone giulio-leone force-pushed the fix/omml-multi-equation-paragraph branch from 46ddaec to 8ebb44d Compare March 23, 2026 20:13
@giulio-leone
Copy link
Copy Markdown
Contributor Author

Small DCO-only refresh on top of the fixture-reuse follow-up: the live PR head is now signed commit 8ebb44d, which replaces 46ddaec with the same code change plus the required sign-off so the DCO gate can pass.

No code-path changes beyond the already-described fixture reuse; validation evidence from the earlier comment is still the relevant proof.

@giulio-leone
Copy link
Copy Markdown
Contributor Author

@ceberam Thanks — from my side I consider #3123 final now.

I already checked and answered the question on #3122 in a separate PR comment there, and on this PR the latest head 8ebb44d is only the signed replacement for the earlier fixture-reuse follow-up so that DCO can pass again.

So no further contributor-side commits are planned on #3123 unless new review feedback appears.

Copy link
Copy Markdown
Member

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @giulio-leone for your contribution 🏆

@cau-git cau-git merged commit 90d6dd4 into docling-project:main Mar 24, 2026
25 checks passed
@dosubot
Copy link
Copy Markdown

dosubot Bot commented Mar 24, 2026

Documentation Updates

1 document(s) were updated by changes in this PR:

What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?
View Changes
@@ -80,6 +80,11 @@
 - **Key Options**:
     - Enrichment options (code, formula, chart, image description)
     - **Header/Footer Export**: Only supported via Python API by setting `included_content_layers={ContentLayer.BODY, ContentLayer.FURNITURE}`; default export excludes header/footer
+- **Processing**:
+    - **Multiple Equations in Paragraphs**: When a DOCX paragraph contains multiple sibling OMML equations (e.g., multiple `<m:oMath>` elements), each equation is extracted as a separate `FORMULA` item in the document structure. This applies to both:
+        - **Standalone equation paragraphs**: Paragraphs containing only equations (no surrounding text) produce multiple separate `FORMULA` items, one for each equation
+        - **Inline equations**: Multiple equations within text-containing paragraphs are preserved as distinct formula items
+    - Previously, multiple sibling equations in a single paragraph were concatenated into a single LaTeX string, but this has been fixed to maintain each equation as a separate document item
 - **Notes**: Header/footer are automatically detected as FURNITURE layer. CLI/Serve API exports only BODY. [Example](https://app.dosu.dev/097760a8-135e-4789-8234-90c8837d7f1c/documents/e596ee79-fc7f-43a4-90e2-74891e0cf12f).
 
 ---

How did I do? Any feedback?  Join Discord

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multiple OMML equations in one paragraph concatenated into a single display block

6 participants