fix: generate page_break for skipped pages in export functions#466
fix: generate page_break for skipped pages in export functions#466jhchoi1182 wants to merge 11 commits intodocling-project:mainfrom
Conversation
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
Related Documentation 1 document(s) may need updating based on files changed in this PR: Docling |
|
✅ DCO Check Passed Thanks @jhchoi1182, all your commits are properly signed off. 🎉 |
b531c27 to
8c66d28
Compare
There was a problem hiding this comment.
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
c8c2c99 to
a6102f0
Compare
|
After merging the main branch, I applied ruff checks to my changes. |
|
Hi @dolfim-ibm, checking in on this PR. Is there anything else needed from my side to get this merged? I'm happy to address any feedback or conflicts. |
|
Please process this matter quickly. |
|
I found that my previous PR code conflicted with the To resolve this, I've submitted a separate PR to the docling repository that adds code to track failed pages by adding them to How it works:
This approach ensures:
Related PR: docling-project/docling#2939 |
When pages fail to parse and are missing from the document, the serializer now generates page_break markers for each skipped page number instead of only for the next available page. This ensures users can detect when pages were skipped during PDF parsing by checking for non-consecutive page breaks. Affects: export_to_doctags(), export_to_html(), export_to_markdown() Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
Update prev_page_nr after processing ListGroup/InlineGroup to prevent the same page transition from being detected twice. Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
Add comprehensive test coverage for the _yield_page_breaks() function that generates individual page breaks for each skipped page. Tests added: - Document page count verification (normal, 1-page skipped, 2-pages skipped) - DocTags page break count for all scenarios - Markdown page break count for all scenarios - Edge cases: skipping 1 page (1->3) vs multiple pages (1->4) Test data files added: - normal_4pages.json: 4-page document with all pages present - skipped_1page.json: 3-page document with page 2 missing (pages 1, 3) - skipped_2pages.json: 4-page document with pages 2, 3 missing (pages 1, 4) Signed-off-by: jhchoi1182 Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
Render pages that failed parsing as empty rows in SPLIT_PAGE mode, maintaining page numbering consistency and allowing users to detect missing pages in the output. Signed-off-by: Jihyeon Choe <choejihyeon@example.com> Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
Verify _yield_page_breaks() generates page breaks for non-consecutive pages and that export functions (doctags, markdown, html split view) correctly handle skipped pages in their output. Signed-off-by: Jihyeon Choe <choejihyeon@example.com> Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
Only generate page breaks for pages present in DoclingDocument.pages dict. This enables proper page break markers for failed pages (added by docling) while maintaining compatibility with filter() method (which removes pages). Changes: - Add page_numbers parameter to _yield_page_breaks() function - Extract page_numbers from doc.pages.keys() in _iterate_items() - Update test data to include failed pages in pages dict - Update test expectations for new behavior Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
e4766e5 to
2d52ca9
Compare
|
I've rebased on main to clean up the messy commit history. With the companion docling PR (docling-project/docling#2939) now merged into main, failed pages are guaranteed to be present in I'd appreciate feedback on whether this overall approach is the right direction. Happy to adjust if needed. |
… pages Since the companion docling PR (_add_failed_pages_to_document) ensures failed pages are always present in doc.pages, the defensive code for handling missing pages in split_page_view is no longer needed. - Remove all_physical_pages tracking (doc.pages.keys() is the source of truth) - Remove is_skipped_page branching (failed pages exist in doc.pages with image=None) - Use sorted(doc.pages.keys()) directly for pages_to_render Signed-off-by: jhchoi1182 <jhchoi1182@gmail.com>
be24bba to
4d5f231
Compare
Description
Currently, pages that fail during PDF parsing are silently skipped. Taking the test.pdf attached to the issue (composed of pages 78, 79, 83, and 84) as an example, pages 79 and 83 failed to parse, causing page 78 to be followed immediately by page 84. Previously, this resulted in the generation of only a single
<page_break>for page 84. This made it impossible for users to detect that pages 79 and 83 were missing, leading to potential issues during post-processing.This PR modifies the
_iterate_items()function to generate individual<page_break>markers for each skipped page when parsing failures result in non-consecutive page transitions. For instance, when jumping from page 78 to 84, the function now generates page breaks for the missing pages (79, 83) in addition to page 84, ensuring all skipped pages are explicitly marked.Changes
common.py:
_PageBreakNodeclass to represent page break nodes withprev_pageandnext_pagefields_yield_page_breaks()helper function that generates page break nodes for each page in a non-consecutive range_iterate_items()to use the new helper, ensuring all skipped pages get their own<page_break>markerhtml.py:
serialize_doc()to handle skipped pages inSPLIT_PAGEmodeAffected Export Functions
export_to_doctags()<page_break>tokens for skipped pagesexport_to_markdown(page_break_placeholder=str)export_to_html(split_page_view=True)Design Decision
When page breaks are enabled, breaks are now generated for all page transitions, including pages that failed to parse. This is the default behavior without additional parameters, as it provides the most intuitive result: users can detect missing pages in the output. If maintainers prefer to make this configurable, a
skip_missing_page_breaksparameter could be added toCommonParamsin a follow-up PR.Example
export_to_doctags()Before (page 78, 79, 83, 84):
After (page 78, 79, 83, 84):
export_to_markdown(page_break_placeholder="---PAGE BREAK---")Before (page 78, 79, 83, 84):
After (page 78, 79, 83, 84):
export_to_html(split_page_view=True)Before (page 78, 79, 83, 84):
After (page 78, 79, 83, 84):
Testing
Unit tests for this feature are available in
test/test_page_break_skipped_pages.py. Run the tests with:Issue resolved by this Pull Request:
Resolves #472
Checklist: