Skip to content

feat(cli): add --page-break-placeholder option for Markdown and Text exports#3184

Open
Krishnachaitanyakc wants to merge 4 commits intodocling-project:mainfrom
Krishnachaitanyakc:feat/cli-page-break-placeholder
Open

feat(cli): add --page-break-placeholder option for Markdown and Text exports#3184
Krishnachaitanyakc wants to merge 4 commits intodocling-project:mainfrom
Krishnachaitanyakc:feat/cli-page-break-placeholder

Conversation

@Krishnachaitanyakc
Copy link
Copy Markdown

Summary

  • Adds a new --page-break-placeholder CLI option to the convert command, exposing the existing page_break_placeholder parameter from DoclingDocument.save_as_markdown() to CLI users.
  • When set, the specified string (e.g., ---, <!-- page-break -->) is inserted between pages in Markdown and Text exports.
  • Includes a test for the new CLI option.

Closes #3175

Details

The Python API already supports page_break_placeholder in save_as_markdown() / export_to_markdown(), but the CLI did not expose this parameter. This change threads the option through:

  1. A new --page-break-placeholder typer option on the convert command (default: None, preserving current behavior)
  2. The export_documents helper function
  3. Both save_as_markdown call sites (Markdown export and Text export)

Usage

docling my_document.pdf --to md --page-break-placeholder "---"
docling my_document.pdf --to md --page-break-placeholder "<!-- page-break -->"

Test plan

  • Added test_cli_page_break_placeholder test that verifies the CLI accepts the option and produces output
  • Existing test_cli_convert continues to pass (no regression without the flag)
  • Default behavior (no --page-break-placeholder) is unchanged since the default is None

…exports

Expose the existing `page_break_placeholder` parameter from the Python API
(`save_as_markdown`) as a CLI option. When set, the specified string is
inserted between pages in Markdown and Text outputs.

Closes docling-project#3175
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 25, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@dosubot
Copy link
Copy Markdown

dosubot bot commented Mar 25, 2026

Related Documentation

1 document(s) may need updating based on files changed in this PR:

Docling

Content Layers
View Suggested Changes
@@ -109,6 +109,15 @@
 
 Currently, the ability to include furniture in exports is only available via the Python API. The `docling-serve` API and CLI exports do not support specifying content layers and will always export with the default (BODY only).
 
+However, the CLI does support the `page_break_placeholder` parameter for Markdown and Text exports. You can specify a custom page break placeholder string when using the `docling convert` command with the `--page-break-placeholder` option:
+
+```bash
+docling my_document.pdf --to md --page-break-placeholder "---"
+docling my_document.pdf --to txt --page-break-placeholder "<!-- page-break -->"
+```
+
+When set, the specified string is inserted between pages in the output, allowing CLI users to control page break formatting in both Markdown and Text exports.
+
 ## Customization and Post-processing
 
 Headers and footers are detected automatically by Docling’s layout model for `.docx` files. There is currently no rule-based mechanism to customize their detection during processing. However, you can manually remove or further process these elements after extraction if needed.

[Accept] [Decline]

Note: You must be authenticated to accept/decline updates.

How did I do? Any feedback?  Join Discord

dolfim-ibm
dolfim-ibm previously approved these changes Mar 30, 2026
Copy link
Copy Markdown
Member

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 30, 2026

DCO Check Failed

Hi @Krishnachaitanyakc, your pull request has failed the Developer Certificate of Origin (DCO) check.

This repository supports remediation commits, so you can fix this without rewriting history — but you must follow the required message format.


🛠 Quick Fix: Add a remediation commit

Run this command:

git commit --allow-empty -s -m "DCO Remediation Commit for Krishna Chaitanya Balusu <krishnabkc15@gmail.com>

I, Krishna Chaitanya Balusu <krishnabkc15@gmail.com>, hereby add my Signed-off-by to this commit: 667a168882f3f169f22d50ad26215339f3e3a12b
I, Krishna Chaitanya Balusu <krishnabkc15@gmail.com>, hereby add my Signed-off-by to this commit: 9bf673d1601bda3a44fbf4235bbcd16cb35e9b4a"
git push

🔧 Advanced: Sign off each commit directly

For the latest commit:

git commit --amend --signoff
git push --force-with-lease

For multiple commits:

git rebase --signoff origin/main
git push --force-with-lease

More info: DCO check report

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@dolfim-ibm
Copy link
Copy Markdown
Member

@Krishnachaitanyakc Please follow the steps in #3184 (comment) to make sure your contribution is signed-off.

…il.com>

I, Krishna Chaitanya Balusu <krishnabkc15@gmail.com>, hereby add my Signed-off-by to this commit: 667a168

Signed-off-by: Krishna Chaitanya Balusu <krishnabkc15@gmail.com>
@smorand
Copy link
Copy Markdown

smorand commented Apr 1, 2026

Great feature! One enhancement that could be valuable: supporting dynamic page numbers in the placeholder string via {prev_page} and {next_page} format variables.

Currently, the placeholder is static (the same string is inserted at every page break). But a common use case is annotating pages with their actual number, e.g.:

docling --to md --page-break-placeholder '---\n*[Page {next_page}]*\n---' input.pdf

Which would produce:

... content from page 1 ...

---
*[Page 2]*
---

... content from page 2 ...

The docling-core serializer internally tracks page numbers in its markers (#_#_DOCLING_DOC_PAGE_BREAK_{prev}_{next}_#_#) but discards them during replacement. This could be handled in the CLI layer with a small helper that uses a sentinel placeholder, then post-processes the output to format each break with the correct page numbers from the document's page list.

I propose a patch file attached.
0001-feat-cli-add-prev_page-and-next_page-format-variable.patch

  Add {prev_page} and {next_page} format variables to the
  --page-break-placeholder option. When these variables are present,
  each page break in the output is replaced with the placeholder
  formatted with the actual page numbers for that specific break.

  Example usage:
    docling --to md --page-break-placeholder '--- Page {next_page} ---' input.pdf

  Which produces:
    ... content from page 1 ...
    --- Page 2 ---
    ... content from page 2 ...
    --- Page 3 ---
    ... content from page 3 ...

  Uses a sentinel-based approach: a unique sentinel is passed to
  docling-core during serialization, then post-processed to replace
  each sentinel occurrence with the formatted placeholder using
  sequential page numbers. Static placeholders (without format
  variables) continue to work unchanged.
@smorand
Copy link
Copy Markdown

smorand commented Apr 2, 2026

Hi @Krishnachaitanyakc,

I see you've made the modifications, that's really nice.

One comment on my side: you use sequential counting instead of doc.pages. The point here, if you have a blank page then you will have wrong number.

If you have a 5-pages document with a blank page in the middle, as the serializer only inserts a page break sentinel when there is content you will get:
1, 2, 3, 4
instead of
1, 2, 3, 5

Proposed patch:
fix-dynamic-page-break-numbering.patch

Best regards and thank again for the reactivity.

Use document item provenance to determine real page numbers instead of
sequential counting. This fixes incorrect numbering when documents
contain blank pages — e.g. a 5-page doc with a blank page 4 now
correctly produces page numbers 1, 2, 3, 5 instead of 1, 2, 3, 4.

Signed-off-by: Krishna Chaitanya Balusu <krishnabkc15@gmail.com>
@Krishnachaitanyakc
Copy link
Copy Markdown
Author

@smorand Thanks for the review and the great catch on blank page numbering.

One note: rather than using doc.pages (which includes all pages the backend parsed, including blank ones), I'm using doc.iterate_items() to extract page numbers from item provenance.

@smorand
Copy link
Copy Markdown

smorand commented Apr 2, 2026

Pull, reviewed and tested, good to me. Thanks to you @Krishnachaitanyakc, because I was going to work and propose this feature, you save me time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a page break parameter to the Docling CLI when exporting markdown

3 participants