Skip to content

fix: chain merge continuations instead of stopping after one pair#156

Open
Frank-Schruefer wants to merge 2 commits intodocling-project:mainfrom
Frank-Schruefer:fix/predict-merges-chain-continuations
Open

fix: chain merge continuations instead of stopping after one pair#156
Frank-Schruefer wants to merge 2 commits intodocling-project:mainfrom
Frank-Schruefer:fix/predict-merges-chain-continuations

Conversation

@Frank-Schruefer
Copy link
Copy Markdown

Problem

predict_merges merges at most one successor per TEXT element. After recording merges[A] = [B] it sets curr_ind = B and moves on, so C is processed as a fresh element rather than being appended to A's merge list.

If A continues into B and B continues into C by the same criteria, all three belong to the same paragraph. The current code produces two short paragraphs (A+B and C+D) instead of one.

This becomes visible with PDFs that have many small orphan clusters — for example scanned pages with a native text layer, where each line of text is its own cluster. The pairwise limit leaves every other line as a separate paragraph, causing unnatural blank lines throughout the output.

Fix

Replace the single if-block with a while loop that keeps extending the merge chain from the current tail (check_ind) as long as the continuation condition holds:

merge_list: List[int] = []
check_ind = ind

while True:
    # find next non-skippable element from check_ind
    ...
    m1 = re.fullmatch(r".+([a-z,\-])(\s*)", sorted_elements[check_ind].text)
    m2 = re.fullmatch(r"(\s*[a-z])(.+)", sorted_elements[ind_p1].text)
    if m1 and m2:
        merge_list.append(sorted_elements[ind_p1].cid)
        curr_ind = ind_p1
        check_ind = ind_p1
    else:
        break

if merge_list:
    merges[elem.cid] = merge_list

The existing _readingorder_elements_to_docling_doc already iterates over the full merge list per element, so no changes are needed there.

Also fixes the skip-loop element-vs-label comparison (sorted_elements[ind_p1].label in skip_labels instead of sorted_elements[ind_p1] in [...] which always evaluated to False — also submitted separately as #153).

predict_merges merged at most one successor per TEXT element: after
recording merges[A] = [B] it set curr_ind = B and moved on, so C was
processed as a fresh element rather than being appended to A's merge
list.

If A continues into B and B continues into C by the same criteria, all
three belong to the same paragraph. Stopping at A→B and then C→D
produces two short paragraphs instead of one.

Fix: replace the single if-block with a while loop that keeps extending
the merge chain from the current tail as long as the continuation
condition holds. Also moves the skip-label list to a local variable and
fixes the element-vs-label comparison in the skip loop (sorted_elements
[ind_p1].label instead of sorted_elements[ind_p1]).

Signed-off-by: stone <frank.schruefer@t-online.de>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 5, 2026

DCO Check Passed

Thanks @Frank-Schruefer, all your commits are properly signed off. 🎉

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 5, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@nikos-livathinos nikos-livathinos self-requested a review April 7, 2026 08:11
ind_p1 < len(sorted_elements)
and sorted_elements[ind_p1].label == elem.label
and (
elem.page_no != sorted_elements[ind_p1].label
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this should be elem.page_no != sorted_elements[ind_p1].page_no

Signed-off-by: stone <frank.schruefer@t-online.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants