feat(cli): add --page-break-placeholder option for Markdown and Text exports#3184
Conversation
…exports Expose the existing `page_break_placeholder` parameter from the Python API (`save_as_markdown`) as a CLI option. When set, the specified string is inserted between pages in Markdown and Text outputs. Closes docling-project#3175
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
Related Documentation 1 document(s) may need updating based on files changed in this PR: Docling Content LayersView Suggested Changes@@ -109,6 +109,15 @@
Currently, the ability to include furniture in exports is only available via the Python API. The `docling-serve` API and CLI exports do not support specifying content layers and will always export with the default (BODY only).
+However, the CLI does support the `page_break_placeholder` parameter for Markdown and Text exports. You can specify a custom page break placeholder string when using the `docling convert` command with the `--page-break-placeholder` option:
+
+```bash
+docling my_document.pdf --to md --page-break-placeholder "---"
+docling my_document.pdf --to txt --page-break-placeholder "<!-- page-break -->"
+```
+
+When set, the specified string is inserted between pages in the output, allowing CLI users to control page break formatting in both Markdown and Text exports.
+
## Customization and Post-processing
Headers and footers are detected automatically by Docling’s layout model for `.docx` files. There is currently no rule-based mechanism to customize their detection during processing. However, you can manually remove or further process these elements after extraction if needed.Note: You must be authenticated to accept/decline updates. |
|
❌ DCO Check Failed Hi @Krishnachaitanyakc, your pull request has failed the Developer Certificate of Origin (DCO) check. This repository supports remediation commits, so you can fix this without rewriting history — but you must follow the required message format. 🛠 Quick Fix: Add a remediation commitRun this command: git commit --allow-empty -s -m "DCO Remediation Commit for Krishna Chaitanya Balusu <krishnabkc15@gmail.com>
I, Krishna Chaitanya Balusu <krishnabkc15@gmail.com>, hereby add my Signed-off-by to this commit: 667a168882f3f169f22d50ad26215339f3e3a12b
I, Krishna Chaitanya Balusu <krishnabkc15@gmail.com>, hereby add my Signed-off-by to this commit: 9bf673d1601bda3a44fbf4235bbcd16cb35e9b4a"
git push🔧 Advanced: Sign off each commit directlyFor the latest commit: git commit --amend --signoff
git push --force-with-leaseFor multiple commits: git rebase --signoff origin/main
git push --force-with-leaseMore info: DCO check report |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
@Krishnachaitanyakc Please follow the steps in #3184 (comment) to make sure your contribution is signed-off. |
…il.com> I, Krishna Chaitanya Balusu <krishnabkc15@gmail.com>, hereby add my Signed-off-by to this commit: 667a168 Signed-off-by: Krishna Chaitanya Balusu <krishnabkc15@gmail.com>
|
Great feature! One enhancement that could be valuable: supporting dynamic page numbers in the placeholder string via Currently, the placeholder is static (the same string is inserted at every page break). But a common use case is annotating pages with their actual number, e.g.: docling --to md --page-break-placeholder '---\n*[Page {next_page}]*\n---' input.pdfWhich would produce: ... content from page 1 ...
---
*[Page 2]*
---
... content from page 2 ...The docling-core serializer internally tracks page numbers in its markers ( I propose a patch file attached. |
Add {prev_page} and {next_page} format variables to the
--page-break-placeholder option. When these variables are present,
each page break in the output is replaced with the placeholder
formatted with the actual page numbers for that specific break.
Example usage:
docling --to md --page-break-placeholder '--- Page {next_page} ---' input.pdf
Which produces:
... content from page 1 ...
--- Page 2 ---
... content from page 2 ...
--- Page 3 ---
... content from page 3 ...
Uses a sentinel-based approach: a unique sentinel is passed to
docling-core during serialization, then post-processed to replace
each sentinel occurrence with the formatted placeholder using
sequential page numbers. Static placeholders (without format
variables) continue to work unchanged.
|
I see you've made the modifications, that's really nice. One comment on my side: you use sequential counting instead of doc.pages. The point here, if you have a blank page then you will have wrong number. If you have a 5-pages document with a blank page in the middle, as the serializer only inserts a page break sentinel when there is content you will get: Proposed patch: Best regards and thank again for the reactivity. |
Use document item provenance to determine real page numbers instead of sequential counting. This fixes incorrect numbering when documents contain blank pages — e.g. a 5-page doc with a blank page 4 now correctly produces page numbers 1, 2, 3, 5 instead of 1, 2, 3, 4. Signed-off-by: Krishna Chaitanya Balusu <krishnabkc15@gmail.com>
|
@smorand Thanks for the review and the great catch on blank page numbering. One note: rather than using doc.pages (which includes all pages the backend parsed, including blank ones), I'm using doc.iterate_items() to extract page numbers from item provenance. |
|
Pull, reviewed and tested, good to me. Thanks to you @Krishnachaitanyakc, because I was going to work and propose this feature, you save me time! |
Summary
--page-break-placeholderCLI option to theconvertcommand, exposing the existingpage_break_placeholderparameter fromDoclingDocument.save_as_markdown()to CLI users.---,<!-- page-break -->) is inserted between pages in Markdown and Text exports.Closes #3175
Details
The Python API already supports
page_break_placeholderinsave_as_markdown()/export_to_markdown(), but the CLI did not expose this parameter. This change threads the option through:--page-break-placeholdertyper option on theconvertcommand (default:None, preserving current behavior)export_documentshelper functionsave_as_markdowncall sites (Markdown export and Text export)Usage
Test plan
test_cli_page_break_placeholdertest that verifies the CLI accepts the option and produces outputtest_cli_convertcontinues to pass (no regression without the flag)--page-break-placeholder) is unchanged since the default isNone