fix: generate page_break for skipped pages in export functions by jhchoi1182 · Pull Request #466 · docling-project/docling-core

jhchoi1182 · 2026-01-08T07:30:16Z

Description

Currently, pages that fail during PDF parsing are silently skipped. Taking the test.pdf attached to the issue (composed of pages 78, 79, 83, and 84) as an example, pages 79 and 83 failed to parse, causing page 78 to be followed immediately by page 84. Previously, this resulted in the generation of only a single <page_break> for page 84. This made it impossible for users to detect that pages 79 and 83 were missing, leading to potential issues during post-processing.

This PR modifies the _iterate_items() function to generate individual <page_break> markers for each skipped page when parsing failures result in non-consecutive page transitions. For instance, when jumping from page 78 to 84, the function now generates page breaks for the missing pages (79, 83) in addition to page 84, ensuring all skipped pages are explicitly marked.

Changes

common.py:

Added _PageBreakNode class to represent page break nodes with prev_page and next_page fields
Added _yield_page_breaks() helper function that generates page break nodes for each page in a non-consecutive range
Updated _iterate_items() to use the new helper, ensuring all skipped pages get their own <page_break> marker

html.py:

Updated serialize_doc() to handle skipped pages in SPLIT_PAGE mode
Skipped pages are now rendered as empty rows in the split page view, maintaining page numbering consistency
Added logic to track all physical pages from page breaks and render the full range including missing pages

Affected Export Functions

Function	Behavior
`export_to_doctags()`	Generates `<page_break>` tokens for skipped pages
`export_to_markdown(page_break_placeholder=str)`	Generates placeholder for skipped pages
`export_to_html(split_page_view=True)`	Renders skipped pages as empty rows in split view

Design Decision

When page breaks are enabled, breaks are now generated for all page transitions, including pages that failed to parse. This is the default behavior without additional parameters, as it provides the most intuitive result: users can detect missing pages in the output. If maintainers prefer to make this configurable, a skip_missing_page_breaks parameter could be added to CommonParams in a follow-up PR.

Example

export_to_doctags()

Before (page 78, 79, 83, 84):

<doctag>
... content ...      78page
<page_break>
... content ...      84page
</doctag>

After (page 78, 79, 83, 84):

<doctag>
... content ...      78page
<page_break>
<page_break>  (79, 83page)
<page_break>
... content ...      84page
</doctag>

export_to_markdown(page_break_placeholder="---PAGE BREAK---")

Before (page 78, 79, 83, 84):

... content ...      78page

---PAGE BREAK---

... content ...      84page

After (page 78, 79, 83, 84):

... content ...      78page

---PAGE BREAK---

---PAGE BREAK---

---PAGE BREAK---

... content ...      84page

export_to_html(split_page_view=True)

Before (page 78, 79, 83, 84):

<tr>
<td>
<figure>no page-image found</figure>
</td>
<td>
<div class='page'>
... content ...      78page

</div>
</td>
</tr>
<tr>
<td>
<figure>no page-image found</figure>
</td>
<td>
<div class='page'>
... content ...      84page

</div>
</td>
</tr>

After (page 78, 79, 83, 84):

<tr>
<td>
<figure>no page-image found</figure>
</td>
<td>
<div class='page'>
... content ...      78page

</div>
</td>
</tr>
<tr>
<td>
<figure>no page-image found</figure>
</td>
<td>
<div class='page'>
                     (79page)

</div>
</td>
</tr>
<tr>
<td>
<figure>no page-image found</figure>
</td>
<td>
<div class='page'>
                     (83page)

</div>
</td>
</tr>
<tr>
<td>
<figure>no page-image found</figure>
</td>
<td>
<div class='page'>
... content ...      84page

</div>
</td>
</tr>

Testing

Unit tests for this feature are available in test/test_page_break_skipped_pages.py. Run the tests with:

pytest test/test_page_break_skipped_pages.py -v

Issue resolved by this Pull Request:

Resolves #472

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
New and existing unit tests pass locally with my changes

mergify · 2026-01-08T07:30:50Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

dosubot · 2026-01-08T07:30:52Z

Related Documentation

1 document(s) may need updating based on files changed in this PR:

Docling

How can I use granite-docling to process all PDFs in a directory and output doctags?

^{How did I do? Any feedback?}

github-actions · 2026-01-08T07:33:37Z

✅ DCO Check Passed

Thanks @jhchoi1182, all your commits are properly signed off. 🎉

Copilot

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

docling_core/transforms/serializer/common.py

jhchoi1182 · 2026-01-16T14:05:00Z

After merging the main branch, I applied ruff checks to my changes.
Note that running ruff check --fix and ruff format also modifies 14 other files, but I only applied it to my changes as I think it would be better to handle those separately.

jhchoi1182 · 2026-01-23T07:49:03Z

Hi @dolfim-ibm, checking in on this PR. Is there anything else needed from my side to get this merged? I'm happy to address any feedback or conflicts.

ezphyki · 2026-01-30T07:47:25Z

@vagenas

Please process this matter quickly.

jhchoi1182 · 2026-01-31T17:57:29Z

I found that my previous PR code conflicted with the .filter() method.

To resolve this, I've submitted a separate PR to the docling repository that adds code to track failed pages by adding them to DoclingDocument.pages. This companion PR in docling-core modifies _yield_page_breaks() to only generate page breaks for pages present in the pages dict.

How it works:

docling: _add_failed_pages_to_document() method adds failed/skipped pages to DoclingDocument.pages with their size information (empty content)
docling-core: _yield_page_breaks() now takes page_numbers parameter (extracted from doc.pages.keys()) and only generates page breaks for pages in that set

This approach ensures:

✅ Failed pages get proper page break markers (pages are in pages dict)
✅ filter() method continues to work correctly (excluded pages are removed from pages dict, so no spurious page breaks)
✅ Filter + failed page scenario works (e.g., filtering {2,3,5} with page 3 failing still generates correct breaks)

Related PR: docling-project/docling#2939

When pages fail to parse and are missing from the document, the serializer now generates page_break markers for each skipped page number instead of only for the next available page. This ensures users can detect when pages were skipped during PDF parsing by checking for non-consecutive page breaks. Affects: export_to_doctags(), export_to_html(), export_to_markdown() Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

Update prev_page_nr after processing ListGroup/InlineGroup to prevent the same page transition from being detected twice. Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

Add comprehensive test coverage for the _yield_page_breaks() function that generates individual page breaks for each skipped page. Tests added: - Document page count verification (normal, 1-page skipped, 2-pages skipped) - DocTags page break count for all scenarios - Markdown page break count for all scenarios - Edge cases: skipping 1 page (1->3) vs multiple pages (1->4) Test data files added: - normal_4pages.json: 4-page document with all pages present - skipped_1page.json: 3-page document with page 2 missing (pages 1, 3) - skipped_2pages.json: 4-page document with pages 2, 3 missing (pages 1, 4) Signed-off-by: jhchoi1182 Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

Render pages that failed parsing as empty rows in SPLIT_PAGE mode, maintaining page numbering consistency and allowing users to detect missing pages in the output. Signed-off-by: Jihyeon Choe <choejihyeon@example.com> Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

Verify _yield_page_breaks() generates page breaks for non-consecutive pages and that export functions (doctags, markdown, html split view) correctly handle skipped pages in their output. Signed-off-by: Jihyeon Choe <choejihyeon@example.com> Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

Only generate page breaks for pages present in DoclingDocument.pages dict. This enables proper page break markers for failed pages (added by docling) while maintaining compatibility with filter() method (which removes pages). Changes: - Add page_numbers parameter to _yield_page_breaks() function - Extract page_numbers from doc.pages.keys() in _iterate_items() - Update test data to include failed pages in pages dict - Update test expectations for new behavior Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

jhchoi1182 · 2026-02-19T04:53:49Z

@cau-git

I've rebased on main to clean up the messy commit history.

With the companion docling PR (docling-project/docling#2939) now merged into main, failed pages are guaranteed to be present in doc.pages. This allowed me to simplify the split_page_view logic in html.py by removing defensive code that handled missing pages.

I'd appreciate feedback on whether this overall approach is the right direction. Happy to adjust if needed.

… pages Since the companion docling PR (_add_failed_pages_to_document) ensures failed pages are always present in doc.pages, the defensive code for handling missing pages in split_page_view is no longer needed. - Remove all_physical_pages tracking (doc.pages.keys() is the source of truth) - Remove is_skipped_page branching (failed pages exist in doc.pages with image=None) - Use sorted(doc.pages.keys()) directly for pages_to_render Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

Copilot AI review requested due to automatic review settings January 8, 2026 07:30

jhchoi1182 force-pushed the fix/page-break-for-skipped-pages branch from b531c27 to 8c66d28 Compare January 8, 2026 07:37

Copilot started reviewing on behalf of jhchoi1182 January 8, 2026 07:37 View session

Copilot AI reviewed Jan 8, 2026

View reviewed changes

docling_core/transforms/serializer/common.py Show resolved Hide resolved

docling_core/transforms/serializer/common.py Show resolved Hide resolved

docling_core/transforms/serializer/common.py Show resolved Hide resolved

jhchoi1182 force-pushed the fix/page-break-for-skipped-pages branch 4 times, most recently from c8c2c99 to a6102f0 Compare January 12, 2026 05:46

ezphyki mentioned this pull request Jan 30, 2026

Inquiry regarding the status of issue . docling-project/docling#2931

Open

jhchoi1182 mentioned this pull request Jan 31, 2026

fix: add failed pages to DoclingDocument for page break consistency docling-project/docling#2939

Merged

3 tasks

jhchoi1182 added 10 commits February 19, 2026 11:18

docs: fix misleading example in _yield_page_breaks docstring

3051eba

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

fix: duplicate page breaks for groups starting on new pages

bf5a388

Update prev_page_nr after processing ListGroup/InlineGroup to prevent the same page transition from being detected twice. Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

fix: show 'no page-image found' for skipped pages in split page view

5b1c0e7

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

style: apply black formatting

c7c5d33

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

style: apply ruff formatting to changed files

e32fbf3

Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>

jhchoi1182 force-pushed the fix/page-break-for-skipped-pages branch from e4766e5 to 2d52ca9 Compare February 19, 2026 02:22

jhchoi1182 force-pushed the fix/page-break-for-skipped-pages branch from be24bba to 4d5f231 Compare February 19, 2026 04:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: generate page_break for skipped pages in export functions#466

fix: generate page_break for skipped pages in export functions#466
jhchoi1182 wants to merge 11 commits intodocling-project:mainfrom
jhchoi1182:fix/page-break-for-skipped-pages

jhchoi1182 commented Jan 8, 2026 •

edited

Loading

Uh oh!

mergify bot commented Jan 8, 2026 •

edited

Loading

Uh oh!

dosubot bot commented Jan 8, 2026

Uh oh!

github-actions bot commented Jan 8, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jhchoi1182 commented Jan 16, 2026 •

edited

Loading

Uh oh!

jhchoi1182 commented Jan 23, 2026 •

edited

Loading

Uh oh!

ezphyki commented Jan 30, 2026

Uh oh!

jhchoi1182 commented Jan 31, 2026

Uh oh!

jhchoi1182 commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jhchoi1182 commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Affected Export Functions

Design Decision

Example

Testing

Issue resolved by this Pull Request:

Checklist:

Uh oh!

mergify bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🔴 Require two reviewer for test updates

🟢 Enforce conventional commit

Uh oh!

dosubot bot commented Jan 8, 2026

Uh oh!

github-actions bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jhchoi1182 commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhchoi1182 commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezphyki commented Jan 30, 2026

Uh oh!

jhchoi1182 commented Jan 31, 2026

How it works:

This approach ensures:

Uh oh!

jhchoi1182 commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jhchoi1182 commented Jan 8, 2026 •

edited

Loading

mergify bot commented Jan 8, 2026 •

edited

Loading

github-actions bot commented Jan 8, 2026 •

edited

Loading

jhchoi1182 commented Jan 16, 2026 •

edited

Loading

jhchoi1182 commented Jan 23, 2026 •

edited

Loading