-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Feat: patch pdfminer and use rendermode to detect invisible text #4158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feat: patch pdfminer and use rendermode to detect invisible text #4158
Conversation
|
@claude review |
|
Claude finished @badGarnet's task —— View job PR Review: Patch pdfminer and use rendermode to detect invisible textReview Progress:
SummaryThis PR addresses a critical issue where the recent pdfminer.six security update (fixing CVE) broke invisible text detection by changing color state handling. The solution implements a custom PDF interpreter that patches Technical Analysis✅ Strong Points
✅ Correct ImplementationThe approach is technically sound:
✅ Good Test CoverageThe test changes properly reflect the new behavior:
Issues & Recommendations🔧 Minor Issues
🚨 Potential RiskPerformance Impact: The patching occurs for every text operation (
📋 Questions/Clarifications
Overall AssessmentThis is a well-implemented solution that correctly addresses the CVE-related regression. The code is clean, follows good practices, and maintains backward compatibility. The custom interpreter approach is elegant and surgical - it only affects what needs to be changed. Recommendation: APPROVE ✅ The implementation properly solves the invisible text detection issue while maintaining code quality and adding appropriate test coverage. |
the
|
…ngest test fixtures update (#4159) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Adjusts test expectations to align with new invisible-text handling. > > - In `test_unstructured_ingest/expected-structured-output/azure/Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf.json`, multiple elements now have `metadata.is_extracted` set to `"false"` (was `"true"`) > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit a5499d8. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> Co-authored-by: badGarnet <[email protected]>
…nd-use-rendermode-to-detect-invisible-text
| if item.__class__.__name__ == "LTChar": | ||
| item.rendermode = render_mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assuming doing this instead of isinstance to avoid import issue? worth quick comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right
| class CustomPDFPageInterpreter(PDFPageInterpreter): | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a docstring for the class on why this exists...to patch the render attr on those do_tj methods which guarantee we have the render attr when we need it?
ryannikolaidis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes look great
…nd-use-rendermode-to-detect-invisible-text
…tect-invisible-text
…ngest test fixtures update (#4184) This pull request includes updated ingest test fixtures. Please review and merge if appropriate. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Updates ingest test fixtures to align with new PDF extraction behavior (renderMode/invisible text handling). > > - Many entries change `metadata.is_extracted` from `false` to `partial`; some new items use `is_extracted: "true"` > - Adds additional extracted elements (e.g., author lines, headers, titles, uncategorized text) to the JSON > - Affects `test_unstructured_ingest/expected-structured-output/azure/Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf.json` only > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit f439c15. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> Co-authored-by: badGarnet <[email protected]>
This PR updates the logic to detect invisible text:
pdfminer(to fix CVE) disabled the route to use color data to determine if a piece of text is invisible or notLTCharobject then use that to determine of a piece of text is invisibleNote on ingest test update:
The file
Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdfcontains invisible white space and line breaks in text. Those are cleaned up by post processing but they do mean that the text we got are not 100% unchanged embedded text in the pdf data itself. Moreover in some other files the post processing many not be able to remove some of the extra invisible white space. Both points justifies the change to the flag ofis_extractedfromTruetopartialfor some of the elements (that post processing removed the invisible white space)To check the invisible text in that fine run