Skip to content

fix: generate page_break for skipped pages in export functions#466

Open
jhchoi1182 wants to merge 11 commits intodocling-project:mainfrom
jhchoi1182:fix/page-break-for-skipped-pages
Open

fix: generate page_break for skipped pages in export functions#466
jhchoi1182 wants to merge 11 commits intodocling-project:mainfrom
jhchoi1182:fix/page-break-for-skipped-pages

Conversation

@jhchoi1182
Copy link

@jhchoi1182 jhchoi1182 commented Jan 8, 2026

Description

Currently, pages that fail during PDF parsing are silently skipped. Taking the test.pdf attached to the issue (composed of pages 78, 79, 83, and 84) as an example, pages 79 and 83 failed to parse, causing page 78 to be followed immediately by page 84. Previously, this resulted in the generation of only a single <page_break> for page 84. This made it impossible for users to detect that pages 79 and 83 were missing, leading to potential issues during post-processing.

This PR modifies the _iterate_items() function to generate individual <page_break> markers for each skipped page when parsing failures result in non-consecutive page transitions. For instance, when jumping from page 78 to 84, the function now generates page breaks for the missing pages (79, 83) in addition to page 84, ensuring all skipped pages are explicitly marked.

Changes

common.py:

  • Added _PageBreakNode class to represent page break nodes with prev_page and next_page fields
  • Added _yield_page_breaks() helper function that generates page break nodes for each page in a non-consecutive range
  • Updated _iterate_items() to use the new helper, ensuring all skipped pages get their own <page_break> marker

html.py:

  • Updated serialize_doc() to handle skipped pages in SPLIT_PAGE mode
  • Skipped pages are now rendered as empty rows in the split page view, maintaining page numbering consistency
  • Added logic to track all physical pages from page breaks and render the full range including missing pages

Affected Export Functions

Function Behavior
export_to_doctags() Generates <page_break> tokens for skipped pages
export_to_markdown(page_break_placeholder=str) Generates placeholder for skipped pages
export_to_html(split_page_view=True) Renders skipped pages as empty rows in split view

Design Decision

When page breaks are enabled, breaks are now generated for all page transitions, including pages that failed to parse. This is the default behavior without additional parameters, as it provides the most intuitive result: users can detect missing pages in the output. If maintainers prefer to make this configurable, a skip_missing_page_breaks parameter could be added to CommonParams in a follow-up PR.

Example

export_to_doctags()

Before (page 78, 79, 83, 84):

<doctag>
... content ...      78page
<page_break>
... content ...      84page
</doctag>

After (page 78, 79, 83, 84):

<doctag>
... content ...      78page
<page_break>
<page_break>  (79, 83page)
<page_break>
... content ...      84page
</doctag>

export_to_markdown(page_break_placeholder="---PAGE BREAK---")

Before (page 78, 79, 83, 84):

... content ...      78page

---PAGE BREAK---

... content ...      84page

After (page 78, 79, 83, 84):

... content ...      78page

---PAGE BREAK---

---PAGE BREAK---

---PAGE BREAK---

... content ...      84page

export_to_html(split_page_view=True)

Before (page 78, 79, 83, 84):

<tr>
<td>
<figure>no page-image found</figure>
</td>
<td>
<div class='page'>
... content ...      78page

</div>
</td>
</tr>
<tr>
<td>
<figure>no page-image found</figure>
</td>
<td>
<div class='page'>
... content ...      84page

</div>
</td>
</tr>

After (page 78, 79, 83, 84):

<tr>
<td>
<figure>no page-image found</figure>
</td>
<td>
<div class='page'>
... content ...      78page

</div>
</td>
</tr>
<tr>
<td>
<figure>no page-image found</figure>
</td>
<td>
<div class='page'>
                     (79page)

</div>
</td>
</tr>
<tr>
<td>
<figure>no page-image found</figure>
</td>
<td>
<div class='page'>
                     (83page)

</div>
</td>
</tr>
<tr>
<td>
<figure>no page-image found</figure>
</td>
<td>
<div class='page'>
... content ...      84page

</div>
</td>
</tr>

Testing

Unit tests for this feature are available in test/test_page_break_skipped_pages.py. Run the tests with:

pytest test/test_page_break_skipped_pages.py -v

Issue resolved by this Pull Request:

Resolves #472

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • New and existing unit tests pass locally with my changes

Copilot AI review requested due to automatic review settings January 8, 2026 07:30
@mergify
Copy link

mergify bot commented Jan 8, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@dosubot
Copy link

dosubot bot commented Jan 8, 2026

Related Documentation

1 document(s) may need updating based on files changed in this PR:

Docling

How did I do? Any feedback?  Join Discord

@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2026

DCO Check Passed

Thanks @jhchoi1182, all your commits are properly signed off. 🎉

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jhchoi1182 jhchoi1182 force-pushed the fix/page-break-for-skipped-pages branch 4 times, most recently from c8c2c99 to a6102f0 Compare January 12, 2026 05:46
@jhchoi1182
Copy link
Author

jhchoi1182 commented Jan 16, 2026

After merging the main branch, I applied ruff checks to my changes.
Note that running ruff check --fix and ruff format also modifies 14 other files, but I only applied it to my changes as I think it would be better to handle those separately.

@jhchoi1182
Copy link
Author

jhchoi1182 commented Jan 23, 2026

Hi @dolfim-ibm, checking in on this PR. Is there anything else needed from my side to get this merged? I'm happy to address any feedback or conflicts.

@ezphyki
Copy link

ezphyki commented Jan 30, 2026

@vagenas

Please process this matter quickly.

@jhchoi1182
Copy link
Author

I found that my previous PR code conflicted with the .filter() method.

To resolve this, I've submitted a separate PR to the docling repository that adds code to track failed pages by adding them to DoclingDocument.pages. This companion PR in docling-core modifies _yield_page_breaks() to only generate page breaks for pages present in the pages dict.

How it works:

  • docling: _add_failed_pages_to_document() method adds failed/skipped pages to DoclingDocument.pages with their size information (empty content)
  • docling-core: _yield_page_breaks() now takes page_numbers parameter (extracted from doc.pages.keys()) and only generates page breaks for pages in that set

This approach ensures:

  • ✅ Failed pages get proper page break markers (pages are in pages dict)
  • filter() method continues to work correctly (excluded pages are removed from pages dict, so no spurious page breaks)
  • ✅ Filter + failed page scenario works (e.g., filtering {2,3,5} with page 3 failing still generates correct breaks)

Related PR: docling-project/docling#2939

When pages fail to parse and are missing from the document,
the serializer now generates page_break markers for each
skipped page number instead of only for the next available page.

This ensures users can detect when pages were skipped during
PDF parsing by checking for non-consecutive page breaks.

Affects: export_to_doctags(), export_to_html(), export_to_markdown()
Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
Update prev_page_nr after processing ListGroup/InlineGroup to prevent the same page transition from being detected twice.

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
Add comprehensive test coverage for the _yield_page_breaks() function
that generates individual page breaks for each skipped page.

Tests added:
- Document page count verification (normal, 1-page skipped, 2-pages skipped)
- DocTags page break count for all scenarios
- Markdown page break count for all scenarios
- Edge cases: skipping 1 page (1->3) vs multiple pages (1->4)

Test data files added:
- normal_4pages.json: 4-page document with all pages present
- skipped_1page.json: 3-page document with page 2 missing (pages 1, 3)
- skipped_2pages.json: 4-page document with pages 2, 3 missing (pages 1, 4)

Signed-off-by: jhchoi1182
Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
Render pages that failed parsing as empty rows in SPLIT_PAGE mode,
maintaining page numbering consistency and allowing users to detect
missing pages in the output.

Signed-off-by: Jihyeon Choe <choejihyeon@example.com>
Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
Verify _yield_page_breaks() generates page breaks for non-consecutive
pages and that export functions (doctags, markdown, html split view)
correctly handle skipped pages in their output.

Signed-off-by: Jihyeon Choe <choejihyeon@example.com>
Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
Only generate page breaks for pages present in DoclingDocument.pages dict.
This enables proper page break markers for failed pages (added by docling)
while maintaining compatibility with filter() method (which removes pages).

Changes:
- Add page_numbers parameter to _yield_page_breaks() function
- Extract page_numbers from doc.pages.keys() in _iterate_items()
- Update test data to include failed pages in pages dict
- Update test expectations for new behavior

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
@jhchoi1182 jhchoi1182 force-pushed the fix/page-break-for-skipped-pages branch from e4766e5 to 2d52ca9 Compare February 19, 2026 02:22
@jhchoi1182
Copy link
Author

@cau-git

I've rebased on main to clean up the messy commit history.

With the companion docling PR (docling-project/docling#2939) now merged into main, failed pages are guaranteed to be present in doc.pages. This allowed me to simplify the split_page_view logic in html.py by removing defensive code that handled missing pages.

I'd appreciate feedback on whether this overall approach is the right direction. Happy to adjust if needed.

… pages

Since the companion docling PR (_add_failed_pages_to_document) ensures
failed pages are always present in doc.pages, the defensive code for
handling missing pages in split_page_view is no longer needed.

- Remove all_physical_pages tracking (doc.pages.keys() is the source of truth)
- Remove is_skipped_page branching (failed pages exist in doc.pages with image=None)
- Use sorted(doc.pages.keys()) directly for pages_to_render

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
@jhchoi1182 jhchoi1182 force-pushed the fix/page-break-for-skipped-pages branch from be24bba to 4d5f231 Compare February 19, 2026 04:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Export functions fail to generate <page_break> tags for non-consecutive (skipped) pages

3 participants