feat: infer hierarchical heading levels (H1-H6) for PDFs (#4204)#4325
feat: infer hierarchical heading levels (H1-H6) for PDFs (#4204)#4325statxc wants to merge 4 commits into
Conversation
|
Hi, @PastelStorm @cragwolfe |
|
There are currenty some merge conflicts that need resolving. |
|
@PastelStorm @codebymikey |
|
Hi @statxc, I'm not particularly sure I'll be able to review this effectively as I'm not a maintainer of the project or familiar enough with it to comment. I'm merely a potential end-user who pointed out the missing functionality. Did you implement this yourself or with help from AI? One of the issues the previous PR trying to address this had was how it was hard to review because the original PR user wasn't fully aware of what the code actually did, and just made PRs based off the output of the tool. Also, based off her Github activity, I believe @PastelStorm might be away for a while. If you feel your PR addresses the issue to a sufficiently high standard, then feel free to ping some of the more recently active maintainers/committers for review. |
|
@qued @cragwolfe Please review this PR when you get a chance. It’s been a while since I submitted it. |
| matched = 0 | ||
| for entry_text, depth, page_number in outline_entries: | ||
| capped_depth = min(depth, _MAX_HEADING_DEPTH - 1) | ||
| candidates = titles_by_page.get(page_number, []) |
There was a problem hiding this comment.
What if the same title is used multiple times on the same page but with different headings?
Wouldn't it be better to resolve the exact outline element I'd being pointed to by the outline (excluding external links), and checking against that instead.
There was a problem hiding this comment.
You are absolutely right. I will fix this soon. Thank you for reviewing.
…atxc/feat-pdf-heading-hierarchy
|
Hi @codebymikey I've fixed. Please review this again. |
|
Hi @badGarnet @CyMule @claytonlin1110 @vladimir-kivi-ds @codebymikey |
|
It looks fine to me from a cursory look. But again, I'm not that familiar with how the codebase works, so my review is of limited value. I'm not sure why there's been no engagement (or guidance) with the PR from maintainers within the project. But at this point, I'd probably just leave it and move onto something else as the issue's clearly not high enough in their priority list. |
Summary
Titleelements viacategory_depthmetadataLTChardata, ranks largest-first to assign depth 0-5partition_pdf_or_image(), works with all strategies (fast,hi_res,ocr_only)Test plan
test_document_to_element_list_sets_category_depth_titlespasses unchangedCloses #4204