Skip to content

feat: infer hierarchical heading levels (H1-H6) for PDFs (#4204)#4325

Open
statxc wants to merge 4 commits into
Unstructured-IO:mainfrom
statxc:statxc/feat-pdf-heading-hierarchy
Open

feat: infer hierarchical heading levels (H1-H6) for PDFs (#4204)#4325
statxc wants to merge 4 commits into
Unstructured-IO:mainfrom
statxc:statxc/feat-pdf-heading-hierarchy

Conversation

@statxc
Copy link
Copy Markdown

@statxc statxc commented Apr 7, 2026

Summary

  • Add two-strategy heading level inference for PDF Title elements via category_depth metadata
    • Outline extraction (primary): walks PDF bookmark tree and matches entries to Title elements by page number + text similarity
    • Font-size analysis (fallback): clusters distinct font sizes from pdfminer LTChar data, ranks largest-first to assign depth 0-5
  • Integrates as a post-processing step in partition_pdf_or_image(), works with all strategies (fast, hi_res, ocr_only)
  • Correctly skipped for image partitioning

Test plan

  • Existing test_document_to_element_list_sets_category_depth_titles passes unchanged
  • 154 existing PDF tests pass (1 pre-existing OCR language failure unrelated)

Closes #4204

@statxc
Copy link
Copy Markdown
Author

statxc commented Apr 10, 2026

Hi, @PastelStorm @cragwolfe
Could you review my PR please.
Please let me know if anything else is needed to update.
Thanks.

@codebymikey
Copy link
Copy Markdown

There are currenty some merge conflicts that need resolving.

@statxc
Copy link
Copy Markdown
Author

statxc commented Apr 22, 2026

@PastelStorm @codebymikey
Could you review this PR please? I'd appreciate any feedback from you. Please review my PR when you have time. Thanks.

@codebymikey
Copy link
Copy Markdown

Hi @statxc,

I'm not particularly sure I'll be able to review this effectively as I'm not a maintainer of the project or familiar enough with it to comment. I'm merely a potential end-user who pointed out the missing functionality.

Did you implement this yourself or with help from AI? One of the issues the previous PR trying to address this had was how it was hard to review because the original PR user wasn't fully aware of what the code actually did, and just made PRs based off the output of the tool.

Also, based off her Github activity, I believe @PastelStorm might be away for a while.

If you feel your PR addresses the issue to a sufficiently high standard, then feel free to ping some of the more recently active maintainers/committers for review.

@statxc
Copy link
Copy Markdown
Author

statxc commented May 5, 2026

@qued @cragwolfe Please review this PR when you get a chance. It’s been a while since I submitted it.

matched = 0
for entry_text, depth, page_number in outline_entries:
capped_depth = min(depth, _MAX_HEADING_DEPTH - 1)
candidates = titles_by_page.get(page_number, [])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the same title is used multiple times on the same page but with different headings?

Wouldn't it be better to resolve the exact outline element I'd being pointed to by the outline (excluding external links), and checking against that instead.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely right. I will fix this soon. Thank you for reviewing.

@statxc
Copy link
Copy Markdown
Author

statxc commented May 6, 2026

Hi @codebymikey I've fixed. Please review this again.

@statxc
Copy link
Copy Markdown
Author

statxc commented May 11, 2026

Hi @badGarnet @CyMule @claytonlin1110 @vladimir-kivi-ds @codebymikey
How are you? Could you review this PR?

@codebymikey
Copy link
Copy Markdown

It looks fine to me from a cursory look. But again, I'm not that familiar with how the codebase works, so my review is of limited value.

I'm not sure why there's been no engagement (or guidance) with the PR from maintainers within the project.

But at this point, I'd probably just leave it and move onto something else as the issue's clearly not high enough in their priority list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat/Infer the hierarchical heading/title levels such as H1, H2, H3, H4 for PDFs

2 participants