Replies: 6 comments 6 replies
-
|
PyMuPDF has a get_toc [0] method that returns a table of contents a quick example of returned ToC with simple=True (more info in the linked docs): elements format: what you can do is:
[0]: https://pymupdf.readthedocs.io/en/latest/document.html#Document.get_toc |
Beta Was this translation helpful? Give feedback.
-
|
@roninio Note that the docling-parse can also provide the TOC and populate it (see here: https://github.com/docling-project/docling-core/blob/763e1364ff0b95388696ccd3d69f150718012a3a/docling_core/types/doc/page.py#L463). We plan to propagate this info and use it to improve the heading tree. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @mrtj and @roninio I have been working on this topic. I'm hoping to open a docling-PR on this topic, but due to the complexity of integrating the solution I decided to make a little package docling-hierarchical-pdf, that is tailored exactly to work with Docling and adds inference of PDF hierarhies - it works with scanned PDFs as well as text-based PDFs. At the moment:
If there is interest by the maintainers in integrating my solution into docling, I am happy to work on that too. |
Beta Was this translation helpful? Give feedback.
-
|
@filip-komarzyniec It is currently being done at the docling-agent level. |
Beta Was this translation helpful? Give feedback.
-
|
I also encountered this problem when preparing PDFs for RAG, where the converter could detect headings but flattened most of them into the same Markdown level. I built a small experimental project that tries to address this with a VLM-based PDF-to-Markdown pipeline: https://github.com/zelinyang-create/DocVisionMD Instead of relying only on embedded PDF outlines/bookmarks, it renders pages as images, asks a VLM to infer page-level structure and heading levels, then converts pages to Markdown with post-processing for heading normalization, TOC cleanup, table-title handling, and table repair. It is still early and not meant to replace Docling, but it may help in cases where the PDF has no reliable outline tree or where visual layout is needed to infer hierarchy. |
Beta Was this translation helpful? Give feedback.
-
|
@zelinyang-create Docling does the VLM's natively already if you use the VlmPipeline. To do the TOC/outline, I would encourage everyone to start using the docling-agent (eg with this config: https://github.com/docling-project/docling-agent/blob/main/task-configs/editor.yaml) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I have a PDF document containing headings of various sizes and styles. After converting the PDF to a Markdown file using DocLing, I've noticed that all headings are uniformly converted to level-2 headings (##), regardless of their original size or importance in the PDF.
I would like to know how to properly configure DocLing, or if there's an alternative method, to accurately represent the original heading hierarchy from the PDF in the resulting Markdown file. Specifically, I need the Markdown headings to reflect the relative size and importance of the headings in the original PDF (e.g., larger headings should become #, smaller headings ###, etc.).
Could you please provide information on how to achieve this? Thank you."
Beta Was this translation helpful? Give feedback.
All reactions