Heading from pdf #1170

roninio · 2025-03-15T13:07:04Z

roninio
Mar 15, 2025

I have a PDF document containing headings of various sizes and styles. After converting the PDF to a Markdown file using DocLing, I've noticed that all headings are uniformly converted to level-2 headings (##), regardless of their original size or importance in the PDF.

I would like to know how to properly configure DocLing, or if there's an alternative method, to accurately represent the original heading hierarchy from the PDF in the resulting Markdown file. Specifically, I need the Markdown headings to reflect the relative size and importance of the headings in the original PDF (e.g., larger headings should become #, smaller headings ###, etc.).

Could you please provide information on how to achieve this? Thank you."

kaumnen · 2025-04-21T20:51:36Z

kaumnen
Apr 21, 2025

PyMuPDF has a get_toc [0] method that returns a table of contents

a quick example of returned ToC with simple=True (more info in the linked docs):

[
    [1, "Heading 1", 22],
    [2, "Heading 2", 33],
    [3, "Heading 3", 44],
    [4, "Another heading 4", 55]
]

elements format:
[<heading level>, <heading text>, <page its on>]

what you can do is:

convert the pdf and .export_to_markdown() [1]
open the same pdf file with PyMuPDF, extract ToC
go through the markdown, check and update (add or remove) # where needed based on the <heading level> from the ToC

[0]: https://pymupdf.readthedocs.io/en/latest/document.html#Document.get_toc

[1]: https://docling-project.github.io/docling/reference/docling_document/#docling_core.types.doc.DoclingDocument.export_to_markdown

0 replies

PeterStaar-IBM · 2025-04-22T06:54:52Z

PeterStaar-IBM
Apr 22, 2025
Maintainer

@roninio Note that the docling-parse can also provide the TOC and populate it (see here: https://github.com/docling-project/docling-core/blob/763e1364ff0b95388696ccd3d69f150718012a3a/docling_core/types/doc/page.py#L463).

We plan to propagate this info and use it to improve the heading tree.

2 replies

JohannKaspar May 25, 2025

Looking forward to this feature!

mrtj Sep 23, 2025

Both @kaumnen’s approach using PyMuPDF and @PeterStaar-IBM’s suggestion in docling-parse extract only the TOC from a PDF if the file already contains a TOC saved as a “hierarchical outline tree” (an optional PDF feature). If the PDF was digitally born and the PDF creator supported outlines and saved them to the file, you’re good to go. Typically this isn’t the case for scanned pages or even digitally born PDFs created without an outline. Docling does recognize headers by other means, but it classifies all headers as second level. Is extending the header detector to support multiple levels on your roadmap?

krrome · 2025-09-26T07:41:12Z

krrome
Sep 26, 2025

Hi @mrtj and @roninio I have been working on this topic. I'm hoping to open a docling-PR on this topic, but due to the complexity of integrating the solution I decided to make a little package docling-hierarchical-pdf, that is tailored exactly to work with Docling and adds inference of PDF hierarhies - it works with scanned PDFs as well as text-based PDFs. At the moment:

it is still in an early stage, but please do give it a try and give feedback. I ran a bunch of tests and was happy with the performance. The limitation, if any, seems to be more on the docling document parsing side.
I am inferring document hierarchy based on header numbering and header styles. Next I will add PDF-bookmark-support along the lines of my approach in this issue: Identify table of contents for better chunking Hierarchy Identification #287 (comment)

If there is interest by the maintainers in integrating my solution into docling, I am happy to work on that too.

4 replies

PeterStaar-IBM Sep 26, 2025
Maintainer

Yes, @krrome , we should definitely need to look into this!

krrome Oct 8, 2025

Update: v0.1.0 of docling-hierarchical-pdf now also attempts to use the TOC (aka bookmarks) from the PDF metadata, if there is one. Otherwise it falls back to using style-based inference. Please give it a try and report issues if there is any problem.

Sirnii Jan 14, 2026

Yes, @krrome , we should definitely need to look into this!

Hi!
Any is it somewhere close on your list? :)

Thanks for great job,
K

filip-komarzyniec Apr 28, 2026

Hi!

Is anyone actively working on incorporating docling-hierarchical-pdf into docling? Is it somewhere on the product's roadmap at all?

This seems like quite an important feature for improving the quality of information retrieval from pdf files serialized to markdown.

PeterStaar-IBM · 2026-04-28T12:24:12Z

PeterStaar-IBM
Apr 28, 2026
Maintainer

@filip-komarzyniec It is currently being done at the docling-agent level.

0 replies

zelinyang-create · 2026-05-28T04:09:50Z

zelinyang-create
May 28, 2026

I also encountered this problem when preparing PDFs for RAG, where the converter could detect headings but flattened most of them into the same Markdown level.

I built a small experimental project that tries to address this with a VLM-based PDF-to-Markdown pipeline:

https://github.com/zelinyang-create/DocVisionMD

Instead of relying only on embedded PDF outlines/bookmarks, it renders pages as images, asks a VLM to infer page-level structure and heading levels, then converts pages to Markdown with post-processing for heading normalization, TOC cleanup, table-title handling, and table repair.

It is still early and not meant to replace Docling, but it may help in cases where the PDF has no reliable outline tree or where visual layout is needed to infer hierarchy.

0 replies

PeterStaar-IBM · 2026-05-28T04:17:11Z

PeterStaar-IBM
May 28, 2026
Maintainer

@zelinyang-create Docling does the VLM's natively already if you use the VlmPipeline. To do the TOC/outline, I would encourage everyone to start using the docling-agent (eg with this config: https://github.com/docling-project/docling-agent/blob/main/task-configs/editor.yaml)

0 replies

Heading from pdf #1170

Uh oh!

Uh oh!

Replies: 6 comments · 6 replies

Uh oh!

Uh oh!

PeterStaar-IBM Apr 22, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PeterStaar-IBM Sep 26, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PeterStaar-IBM Apr 28, 2026 Maintainer

Uh oh!

Uh oh!

PeterStaar-IBM May 28, 2026 Maintainer

Replies: 6 comments 6 replies

PeterStaar-IBM
Apr 22, 2025
Maintainer

PeterStaar-IBM Sep 26, 2025
Maintainer

PeterStaar-IBM
Apr 28, 2026
Maintainer

PeterStaar-IBM
May 28, 2026
Maintainer