feat: infer hierarchical heading levels (H1-H6) for PDFs (#4204) by statxc · Pull Request #4325 · Unstructured-IO/unstructured

statxc · 2026-04-07T17:52:20Z

Summary

Add two-strategy heading level inference for PDF Title elements via category_depth metadata
- Outline extraction (primary): walks PDF bookmark tree and matches entries to Title elements by page number + text similarity
- Font-size analysis (fallback): clusters distinct font sizes from pdfminer LTChar data, ranks largest-first to assign depth 0-5
Integrates as a post-processing step in partition_pdf_or_image(), works with all strategies (fast, hi_res, ocr_only)
Correctly skipped for image partitioning

Test plan

Existing test_document_to_element_list_sets_category_depth_titles passes unchanged
154 existing PDF tests pass (1 pre-existing OCR language failure unrelated)

statxc · 2026-04-10T14:00:40Z

Hi, @PastelStorm @cragwolfe
Could you review my PR please.
Please let me know if anything else is needed to update.
Thanks.

codebymikey · 2026-04-14T10:29:01Z

There are currenty some merge conflicts that need resolving.

statxc · 2026-04-22T15:50:01Z

@PastelStorm @codebymikey
Could you review this PR please? I'd appreciate any feedback from you. Please review my PR when you have time. Thanks.

codebymikey · 2026-04-23T11:56:05Z

Hi @statxc,

I'm not particularly sure I'll be able to review this effectively as I'm not a maintainer of the project or familiar enough with it to comment. I'm merely a potential end-user who pointed out the missing functionality.

Did you implement this yourself or with help from AI? One of the issues the previous PR trying to address this had was how it was hard to review because the original PR user wasn't fully aware of what the code actually did, and just made PRs based off the output of the tool.

Also, based off her Github activity, I believe @PastelStorm might be away for a while.

If you feel your PR addresses the issue to a sufficiently high standard, then feel free to ping some of the more recently active maintainers/committers for review.

statxc · 2026-05-05T23:55:01Z

@qued @cragwolfe Please review this PR when you get a chance. It’s been a while since I submitted it.

codebymikey · 2026-05-06T03:27:41Z

+    matched = 0
+    for entry_text, depth, page_number in outline_entries:
+        capped_depth = min(depth, _MAX_HEADING_DEPTH - 1)
+        candidates = titles_by_page.get(page_number, [])


What if the same title is used multiple times on the same page but with different headings?

Wouldn't it be better to resolve the exact outline element I'd being pointed to by the outline (excluding external links), and checking against that instead.

You are absolutely right. I will fix this soon. Thank you for reviewing.

…atxc/feat-pdf-heading-hierarchy

statxc · 2026-05-06T11:41:16Z

Hi @codebymikey I've fixed. Please review this again.

statxc · 2026-05-11T15:52:21Z

Hi @badGarnet @CyMule @claytonlin1110 @vladimir-kivi-ds @codebymikey
How are you? Could you review this PR?

codebymikey · 2026-05-12T08:56:53Z

It looks fine to me from a cursory look. But again, I'm not that familiar with how the codebase works, so my review is of limited value.

I'm not sure why there's been no engagement (or guidance) with the PR from maintainers within the project.

But at this point, I'd probably just leave it and move onto something else as the issue's clearly not high enough in their priority list.

feat: infer hierarchical heading levels (H1-H6) for PDFs (Unstructure…

99762b3

…d-IO#4204)

Merge branch 'main' into statxc/feat-pdf-heading-hierarchy

12258f7

codebymikey reviewed May 6, 2026

View reviewed changes

statxc added 2 commits May 6, 2026 05:48

Merge branch 'main' of https://github.com/statxc/unstructured into st…

6a6f884

…atxc/feat-pdf-heading-hierarchy

fix: match PDF outlines by destination

950780f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: infer hierarchical heading levels (H1-H6) for PDFs (#4204)#4325

feat: infer hierarchical heading levels (H1-H6) for PDFs (#4204)#4325
statxc wants to merge 4 commits into
Unstructured-IO:mainfrom
statxc:statxc/feat-pdf-heading-hierarchy

statxc commented Apr 7, 2026 •

edited

Loading

Uh oh!

statxc commented Apr 10, 2026 •

edited

Loading

Uh oh!

codebymikey commented Apr 14, 2026

Uh oh!

statxc commented Apr 22, 2026

Uh oh!

codebymikey commented Apr 23, 2026

Uh oh!

statxc commented May 5, 2026

Uh oh!

codebymikey May 6, 2026

Uh oh!

statxc May 6, 2026

Uh oh!

statxc commented May 6, 2026

Uh oh!

statxc commented May 11, 2026 •

edited

Loading

Uh oh!

codebymikey commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

statxc commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

statxc commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codebymikey commented Apr 14, 2026

Uh oh!

statxc commented Apr 22, 2026

Uh oh!

codebymikey commented Apr 23, 2026

Uh oh!

statxc commented May 5, 2026

Uh oh!

codebymikey May 6, 2026

Choose a reason for hiding this comment

Uh oh!

statxc May 6, 2026

Choose a reason for hiding this comment

Uh oh!

statxc commented May 6, 2026

Uh oh!

statxc commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codebymikey commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

statxc commented Apr 7, 2026 •

edited

Loading

statxc commented Apr 10, 2026 •

edited

Loading

statxc commented May 11, 2026 •

edited

Loading