fix: chain merge continuations instead of stopping after one pair by Frank-Schruefer · Pull Request #156 · docling-project/docling-ibm-models

Frank-Schruefer · 2026-04-05T21:20:00Z

Problem

predict_merges merges at most one successor per TEXT element. After recording merges[A] = [B] it sets curr_ind = B and moves on, so C is processed as a fresh element rather than being appended to A's merge list.

If A continues into B and B continues into C by the same criteria, all three belong to the same paragraph. The current code produces two short paragraphs (A+B and C+D) instead of one.

This becomes visible with PDFs that have many small orphan clusters — for example scanned pages with a native text layer, where each line of text is its own cluster. The pairwise limit leaves every other line as a separate paragraph, causing unnatural blank lines throughout the output.

Fix

Replace the single if-block with a while loop that keeps extending the merge chain from the current tail (check_ind) as long as the continuation condition holds:

merge_list: List[int] = []
check_ind = ind

while True:
    # find next non-skippable element from check_ind
    ...
    m1 = re.fullmatch(r".+([a-z,\-])(\s*)", sorted_elements[check_ind].text)
    m2 = re.fullmatch(r"(\s*[a-z])(.+)", sorted_elements[ind_p1].text)
    if m1 and m2:
        merge_list.append(sorted_elements[ind_p1].cid)
        curr_ind = ind_p1
        check_ind = ind_p1
    else:
        break

if merge_list:
    merges[elem.cid] = merge_list

The existing _readingorder_elements_to_docling_doc already iterates over the full merge list per element, so no changes are needed there.

Also fixes the skip-loop element-vs-label comparison (sorted_elements[ind_p1].label in skip_labels instead of sorted_elements[ind_p1] in [...] which always evaluated to False — also submitted separately as #153).

predict_merges merged at most one successor per TEXT element: after recording merges[A] = [B] it set curr_ind = B and moved on, so C was processed as a fresh element rather than being appended to A's merge list. If A continues into B and B continues into C by the same criteria, all three belong to the same paragraph. Stopping at A→B and then C→D produces two short paragraphs instead of one. Fix: replace the single if-block with a while loop that keeps extending the merge chain from the current tail as long as the continuation condition holds. Also moves the skip-label list to a local variable and fixes the element-vs-label comparison in the skip loop (sorted_elements [ind_p1].label instead of sorted_elements[ind_p1]). Signed-off-by: stone <frank.schruefer@t-online.de>

github-actions · 2026-04-05T21:20:11Z

✅ DCO Check Passed

Thanks @Frank-Schruefer, all your commits are properly signed off. 🎉

mergify · 2026-04-05T21:20:37Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

nikos-livathinos · 2026-04-07T09:16:12Z

docling_ibm_models/reading_order/reading_order_rb.py

+                        ind_p1 < len(sorted_elements)
+                        and sorted_elements[ind_p1].label == elem.label
+                        and (
+                            elem.page_no != sorted_elements[ind_p1].label


I guess this should be elem.page_no != sorted_elements[ind_p1].page_no

Signed-off-by: stone <frank.schruefer@t-online.de>

nikos-livathinos self-requested a review April 7, 2026 08:11

nikos-livathinos reviewed Apr 7, 2026

View reviewed changes

style: apply black formatting

61479d5

Signed-off-by: stone <frank.schruefer@t-online.de>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: chain merge continuations instead of stopping after one pair#156

fix: chain merge continuations instead of stopping after one pair#156
Frank-Schruefer wants to merge 2 commits intodocling-project:mainfrom
Frank-Schruefer:fix/predict-merges-chain-continuations

Frank-Schruefer commented Apr 5, 2026

Uh oh!

github-actions bot commented Apr 5, 2026 •

edited

Loading

Uh oh!

mergify bot commented Apr 5, 2026

Uh oh!

nikos-livathinos Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Frank-Schruefer commented Apr 5, 2026

Problem

Fix

Uh oh!

github-actions bot commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Apr 5, 2026

Merge Protections

🟢 Enforce conventional commit

Uh oh!

nikos-livathinos Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Apr 5, 2026 •

edited

Loading