Feat: patch pdfminer and use rendermode to detect invisible text #4158

badGarnet · 2026-01-05T18:49:10Z

This PR updates the logic to detect invisible text:

recent bump for pdfminer (to fix CVE) disabled the route to use color data to determine if a piece of text is invisible or not
this PR uses a custom pdf interpreter that exposes render mode information for an LTChar object then use that to determine of a piece of text is invisible

Note on ingest test update:

The file Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf contains invisible white space and line breaks in text. Those are cleaned up by post processing but they do mean that the text we got are not 100% unchanged embedded text in the pdf data itself. Moreover in some other files the post processing many not be able to remove some of the extra invisible white space. Both points justifies the change to the flag of is_extracted from True to partial for some of the elements (that post processing removed the invisible white space)

To check the invisible text in that fine run

from unstructured.partition.pdf_image.pdfminer_processing import process_file_with_pdfminer

layout, _ = process_file_with_pdfminer("Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf")
layout[0].texts

badGarnet · 2026-01-05T18:50:28Z

@claude review

claude · 2026-01-05T18:50:47Z

Claude finished @badGarnet's task —— View job

PR Review: Patch pdfminer and use rendermode to detect invisible text

Review Progress:

Read repository guidelines (CLAUDE.md)
Analyze the problem and proposed solution
Review core implementation changes
Review test changes
Check for potential issues and improvements
Provide final review feedback

Summary

This PR addresses a critical issue where the recent pdfminer.six security update (fixing CVE) broke invisible text detection by changing color state handling. The solution implements a custom PDF interpreter that patches LTChar objects with render mode information to properly detect invisible text.

Technical Analysis

✅ Strong Points

Clean Architecture: The custom PDF interpreter (CustomPDFPageInterpreter) is well-designed with focused responsibility
Precise Targeting: Only patches necessary PDF operators (do_TJ, do_Tj) that handle text showing
Defensive Programming: The _patch_current_chars_with_render_mode() method includes proper bounds checking
Clear Documentation: Code comments clearly explain the render mode 3 detection approach

✅ Correct Implementation

The approach is technically sound:

PDF Spec Compliant: Render mode 3 is indeed "neither fill nor stroke" (invisible) per PDF specification
Efficient Patching: Only adds render mode when not already present (if hasattr(item, "rendermode"): continue)
Proper Integration: Uses the custom interpreter in init_pdfminer() seamlessly

✅ Good Test Coverage

The test changes properly reflect the new behavior:

Fixed test expectations to match render mode detection
Updated mock objects to use rendermode attribute instead of color-based detection
Maintained test for hidden OCR text functionality

Issues & Recommendations

🔧 Minor Issues

unstructured/partition/pdf_image/pdfminer_utils.py:25-27: Missing safety check

if hasattr(item, "rendermode"):
    continue  # Already patched
if item.__class__.__name__ == "LTChar":  # Should also check if item is not None

unstructured/partition/pdf_image/pdfminer_processing.py:397: Consider making render mode check more robust:

# Current:
if hasattr(layout_obj, "rendermode") and layout_obj.rendermode == 3:

# Suggestion:
if getattr(layout_obj, "rendermode", None) == 3:

🚨 Potential Risk

Performance Impact: The patching occurs for every text operation (do_TJ, do_Tj). For PDFs with large amounts of text, this could introduce overhead. Consider:

Adding early exit if no cur_item exists (already done ✓)
Monitoring memory usage during patching
Testing with large PDFs to ensure acceptable performance

📋 Questions/Clarifications

Error Handling: What happens if self.textstate.render is not available or throws an exception?
Memory Management: Are there potential memory leaks from adding attributes to LTChar objects?
Compatibility: Have you tested with various PDF versions and creators to ensure render mode is consistently available?

Overall Assessment

This is a well-implemented solution that correctly addresses the CVE-related regression. The code is clean, follows good practices, and maintains backward compatibility. The custom interpreter approach is elegant and surgical - it only affects what needs to be changed.

Recommendation: APPROVE ✅

The implementation properly solves the invisible text detection issue while maintaining code quality and adding appropriate test coverage.

badGarnet · 2026-01-05T19:03:07Z

Claude finished @badGarnet's task —— View job

PR Review: Patch pdfminer and use rendermode to detect invisible text

Review Progress:

Read repository guidelines (CLAUDE.md)

Analyze the problem and proposed solution

Review core implementation changes

Review test changes

Check for potential issues and improvements

Provide final review feedback

Summary

This PR addresses a critical issue where the recent pdfminer.six security update (fixing CVE) broke invisible text detection by changing color state handling. The solution implements a custom PDF interpreter that patches LTChar objects with render mode information to properly detect invisible text.

Technical Analysis

✅ Strong Points

Clean Architecture: The custom PDF interpreter (CustomPDFPageInterpreter) is well-designed with focused responsibility

Precise Targeting: Only patches necessary PDF operators (do_TJ, do_Tj) that handle text showing

Defensive Programming: The _patch_current_chars_with_render_mode() method includes proper bounds checking

Clear Documentation: Code comments clearly explain the render mode 3 detection approach

✅ Correct Implementation

The approach is technically sound:

PDF Spec Compliant: Render mode 3 is indeed "neither fill nor stroke" (invisible) per PDF specification

Efficient Patching: Only adds render mode when not already present (if hasattr(item, "rendermode"): continue)

Proper Integration: Uses the custom interpreter in init_pdfminer() seamlessly

✅ Good Test Coverage

The test changes properly reflect the new behavior:

Fixed test expectations to match render mode detection

Updated mock objects to use rendermode attribute instead of color-based detection

Maintained test for hidden OCR text functionality

Issues & Recommendations

🔧 Minor Issues
unstructured/partition/pdf_image/pdfminer_utils.py:25-27: Missing safety check
if hasattr(item, "rendermode"):
    continue  # Already patched
if item.__class__.__name__ == "LTChar":  # Should also check if item is not None
unstructured/partition/pdf_image/pdfminer_processing.py:397: Consider making render mode check more robust:
# Current:
if hasattr(layout_obj, "rendermode") and layout_obj.rendermode == 3:

# Suggestion:
if getattr(layout_obj, "rendermode", None) == 3:
🚨 Potential Risk

Performance Impact: The patching occurs for every text operation (do_TJ, do_Tj). For PDFs with large amounts of text, this could introduce overhead. Consider:

Adding early exit if no cur_item exists (already done ✓)

Monitoring memory usage during patching

Testing with large PDFs to ensure acceptable performance

📋 Questions/Clarifications

Error Handling: What happens if self.textstate.render is not available or throws an exception?

Memory Management: Are there potential memory leaks from adding attributes to LTChar objects?

Compatibility: Have you tested with various PDF versions and creators to ensure render mode is consistently available?

Overall Assessment

This is a well-implemented solution that correctly addresses the CVE-related regression. The code is clean, follows good practices, and maintains backward compatibility. The custom interpreter approach is elegant and surgical - it only affects what needs to be changed.

Recommendation: APPROVE ✅

The implementation properly solves the invisible text detection issue while maintaining code quality and adding appropriate test coverage.

the self.textstate.render always exist as

textstate is at init for interpreter: https://github.com/Jeremy-Hibiki/pdfminer.six/blob/977de923efce753ab648ff7b4ef6b757758919c8/pdfminer/pdfinterp.py#L423
render is an attribute for textstate: https://github.com/Jeremy-Hibiki/pdfminer.six/blob/977de923efce753ab648ff7b4ef6b757758919c8/pdfminer/pdfinterp.py#L80

…ngest test fixtures update (#4159) This pull request includes updated ingest test fixtures. Please review and merge if appropriate.  --- > [!NOTE] > Adjusts test expectations to align with new invisible-text handling. > > - In `test_unstructured_ingest/expected-structured-output/azure/Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf.json`, multiple elements now have `metadata.is_extracted` set to `"false"` (was `"true"`) > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit a5499d8. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>  Co-authored-by: badGarnet <[email protected]>

…nd-use-rendermode-to-detect-invisible-text

ryannikolaidis · 2026-01-08T18:04:01Z

unstructured/partition/pdf_image/pdfminer_utils.py

+                if item.__class__.__name__ == "LTChar":
+                    item.rendermode = render_mode


assuming doing this instead of isinstance to avoid import issue? worth quick comment?

ryannikolaidis · 2026-01-08T18:06:52Z

unstructured/partition/pdf_image/pdfminer_utils.py

+class CustomPDFPageInterpreter(PDFPageInterpreter):
+


maybe a docstring for the class on why this exists...to patch the render attr on those do_tj methods which guarantee we have the render attr when we need it?

ryannikolaidis

changes look great

…nd-use-rendermode-to-detect-invisible-text

…tect-invisible-text

…ngest test fixtures update (#4184) This pull request includes updated ingest test fixtures. Please review and merge if appropriate.  --- > [!NOTE] > Updates ingest test fixtures to align with new PDF extraction behavior (renderMode/invisible text handling). > > - Many entries change `metadata.is_extracted` from `false` to `partial`; some new items use `is_extracted: "true"` > - Adds additional extracted elements (e.g., author lines, headers, titles, uncategorized text) to the JSON > - Affects `test_unstructured_ingest/expected-structured-output/azure/Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf.json` only > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit f439c15. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>  Co-authored-by: badGarnet <[email protected]>

badGarnet added 2 commits January 5, 2026 12:33

patch pdfminer so it exposes render mode to ltchar obj

70deabe

restore test and update changelog

c910529

ryannikolaidis and others added 2 commits January 5, 2026 14:28

Merge remote-tracking branch 'origin/main' into feat/patch-pdfminer-a…

8504d30

…nd-use-rendermode-to-detect-invisible-text

badGarnet marked this pull request as ready for review January 5, 2026 21:45

ryannikolaidis reviewed Jan 8, 2026

View reviewed changes

ryannikolaidis approved these changes Jan 8, 2026

View reviewed changes

badGarnet and others added 3 commits January 9, 2026 11:16

Merge remote-tracking branch 'origin/main' into feat/patch-pdfminer-a…

c1c4b3b

…nd-use-rendermode-to-detect-invisible-text

Merge branch 'main' into feat/patch-pdfminer-and-use-rendermode-to-de…

80bca1d

…tect-invisible-text

bump version

2661234

badGarnet enabled auto-merge January 9, 2026 21:14

badGarnet and others added 2 commits January 9, 2026 17:07

clean up disk space for ingest cache

dde0f68

		if item.__class__.__name__ == "LTChar":
		item.rendermode = render_mode

Feat: patch pdfminer and use rendermode to detect invisible text #4158

Are you sure you want to change the base?

Feat: patch pdfminer and use rendermode to detect invisible text #4158

Uh oh!

Conversation

badGarnet commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

badGarnet commented Jan 5, 2026

Uh oh!

claude bot commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Patch pdfminer and use rendermode to detect invisible text

Summary

Technical Analysis

✅ Strong Points

✅ Correct Implementation

✅ Good Test Coverage

Issues & Recommendations

🔧 Minor Issues

🚨 Potential Risk

📋 Questions/Clarifications

Overall Assessment

Uh oh!

badGarnet commented Jan 5, 2026

PR Review: Patch pdfminer and use rendermode to detect invisible text

Summary

Technical Analysis

✅ Strong Points

✅ Correct Implementation

✅ Good Test Coverage

Issues & Recommendations

🔧 Minor Issues

🚨 Potential Risk

📋 Questions/Clarifications

Overall Assessment

Uh oh!

ryannikolaidis Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

badGarnet Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

ryannikolaidis Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

ryannikolaidis left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

badGarnet commented Jan 5, 2026 •

edited

Loading

claude bot commented Jan 5, 2026 •

edited

Loading