Skip to content

Conversation

@badGarnet
Copy link
Collaborator

@badGarnet badGarnet commented Jan 5, 2026

This PR updates the logic to detect invisible text:

  • recent bump for pdfminer (to fix CVE) disabled the route to use color data to determine if a piece of text is invisible or not
  • this PR uses a custom pdf interpreter that exposes render mode information for an LTChar object then use that to determine of a piece of text is invisible

Note on ingest test update:

The file Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf contains invisible white space and line breaks in text. Those are cleaned up by post processing but they do mean that the text we got are not 100% unchanged embedded text in the pdf data itself. Moreover in some other files the post processing many not be able to remove some of the extra invisible white space. Both points justifies the change to the flag of is_extracted from True to partial for some of the elements (that post processing removed the invisible white space)

To check the invisible text in that fine run

from unstructured.partition.pdf_image.pdfminer_processing import process_file_with_pdfminer

layout, _ = process_file_with_pdfminer("Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf")
layout[0].texts

@badGarnet
Copy link
Collaborator Author

@claude review

@claude
Copy link

claude bot commented Jan 5, 2026

Claude finished @badGarnet's task —— View job


PR Review: Patch pdfminer and use rendermode to detect invisible text

Review Progress:

  • Read repository guidelines (CLAUDE.md)
  • Analyze the problem and proposed solution
  • Review core implementation changes
  • Review test changes
  • Check for potential issues and improvements
  • Provide final review feedback

Summary

This PR addresses a critical issue where the recent pdfminer.six security update (fixing CVE) broke invisible text detection by changing color state handling. The solution implements a custom PDF interpreter that patches LTChar objects with render mode information to properly detect invisible text.

Technical Analysis

Strong Points

  1. Clean Architecture: The custom PDF interpreter (CustomPDFPageInterpreter) is well-designed with focused responsibility
  2. Precise Targeting: Only patches necessary PDF operators (do_TJ, do_Tj) that handle text showing
  3. Defensive Programming: The _patch_current_chars_with_render_mode() method includes proper bounds checking
  4. Clear Documentation: Code comments clearly explain the render mode 3 detection approach

Correct Implementation

The approach is technically sound:

  • PDF Spec Compliant: Render mode 3 is indeed "neither fill nor stroke" (invisible) per PDF specification
  • Efficient Patching: Only adds render mode when not already present (if hasattr(item, "rendermode"): continue)
  • Proper Integration: Uses the custom interpreter in init_pdfminer() seamlessly

Good Test Coverage

The test changes properly reflect the new behavior:

  • Fixed test expectations to match render mode detection
  • Updated mock objects to use rendermode attribute instead of color-based detection
  • Maintained test for hidden OCR text functionality

Issues & Recommendations

🔧 Minor Issues

  1. unstructured/partition/pdf_image/pdfminer_utils.py:25-27: Missing safety check

    if hasattr(item, "rendermode"):
        continue  # Already patched
    if item.__class__.__name__ == "LTChar":  # Should also check if item is not None
  2. unstructured/partition/pdf_image/pdfminer_processing.py:397: Consider making render mode check more robust:

    # Current:
    if hasattr(layout_obj, "rendermode") and layout_obj.rendermode == 3:
    
    # Suggestion:
    if getattr(layout_obj, "rendermode", None) == 3:

🚨 Potential Risk

Performance Impact: The patching occurs for every text operation (do_TJ, do_Tj). For PDFs with large amounts of text, this could introduce overhead. Consider:

  • Adding early exit if no cur_item exists (already done ✓)
  • Monitoring memory usage during patching
  • Testing with large PDFs to ensure acceptable performance

📋 Questions/Clarifications

  1. Error Handling: What happens if self.textstate.render is not available or throws an exception?
  2. Memory Management: Are there potential memory leaks from adding attributes to LTChar objects?
  3. Compatibility: Have you tested with various PDF versions and creators to ensure render mode is consistently available?

Overall Assessment

This is a well-implemented solution that correctly addresses the CVE-related regression. The code is clean, follows good practices, and maintains backward compatibility. The custom interpreter approach is elegant and surgical - it only affects what needs to be changed.

Recommendation: APPROVE

The implementation properly solves the invisible text detection issue while maintaining code quality and adding appropriate test coverage.


@badGarnet
Copy link
Collaborator Author

Claude finished @badGarnet's task —— View job

PR Review: Patch pdfminer and use rendermode to detect invisible text

Review Progress:

  • Read repository guidelines (CLAUDE.md)
  • Analyze the problem and proposed solution
  • Review core implementation changes
  • Review test changes
  • Check for potential issues and improvements
  • Provide final review feedback

Summary

This PR addresses a critical issue where the recent pdfminer.six security update (fixing CVE) broke invisible text detection by changing color state handling. The solution implements a custom PDF interpreter that patches LTChar objects with render mode information to properly detect invisible text.

Technical Analysis

Strong Points

  1. Clean Architecture: The custom PDF interpreter (CustomPDFPageInterpreter) is well-designed with focused responsibility
  2. Precise Targeting: Only patches necessary PDF operators (do_TJ, do_Tj) that handle text showing
  3. Defensive Programming: The _patch_current_chars_with_render_mode() method includes proper bounds checking
  4. Clear Documentation: Code comments clearly explain the render mode 3 detection approach

Correct Implementation

The approach is technically sound:

  • PDF Spec Compliant: Render mode 3 is indeed "neither fill nor stroke" (invisible) per PDF specification
  • Efficient Patching: Only adds render mode when not already present (if hasattr(item, "rendermode"): continue)
  • Proper Integration: Uses the custom interpreter in init_pdfminer() seamlessly

Good Test Coverage

The test changes properly reflect the new behavior:

  • Fixed test expectations to match render mode detection
  • Updated mock objects to use rendermode attribute instead of color-based detection
  • Maintained test for hidden OCR text functionality

Issues & Recommendations

🔧 Minor Issues

  1. unstructured/partition/pdf_image/pdfminer_utils.py:25-27: Missing safety check
    if hasattr(item, "rendermode"):
        continue  # Already patched
    if item.__class__.__name__ == "LTChar":  # Should also check if item is not None
  2. unstructured/partition/pdf_image/pdfminer_processing.py:397: Consider making render mode check more robust:
    # Current:
    if hasattr(layout_obj, "rendermode") and layout_obj.rendermode == 3:
    
    # Suggestion:
    if getattr(layout_obj, "rendermode", None) == 3:

🚨 Potential Risk

Performance Impact: The patching occurs for every text operation (do_TJ, do_Tj). For PDFs with large amounts of text, this could introduce overhead. Consider:

  • Adding early exit if no cur_item exists (already done ✓)
  • Monitoring memory usage during patching
  • Testing with large PDFs to ensure acceptable performance

📋 Questions/Clarifications

  1. Error Handling: What happens if self.textstate.render is not available or throws an exception?
  2. Memory Management: Are there potential memory leaks from adding attributes to LTChar objects?
  3. Compatibility: Have you tested with various PDF versions and creators to ensure render mode is consistently available?

Overall Assessment

This is a well-implemented solution that correctly addresses the CVE-related regression. The code is clean, follows good practices, and maintains backward compatibility. The custom interpreter approach is elegant and surgical - it only affects what needs to be changed.

Recommendation: APPROVE

The implementation properly solves the invisible text detection issue while maintaining code quality and adding appropriate test coverage.

the self.textstate.render always exist as

ryannikolaidis and others added 2 commits January 5, 2026 14:28
…ngest test fixtures update (#4159)

This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Adjusts test expectations to align with new invisible-text handling.
> 
> - In
`test_unstructured_ingest/expected-structured-output/azure/Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf.json`,
multiple elements now have `metadata.is_extracted` set to `"false"` (was
`"true"`)
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
a5499d8. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Co-authored-by: badGarnet <[email protected]>
@badGarnet badGarnet marked this pull request as ready for review January 5, 2026 21:45
Comment on lines 27 to 28
if item.__class__.__name__ == "LTChar":
item.rendermode = render_mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming doing this instead of isinstance to avoid import issue? worth quick comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right

Comment on lines 16 to 17
class CustomPDFPageInterpreter(PDFPageInterpreter):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a docstring for the class on why this exists...to patch the render attr on those do_tj methods which guarantee we have the render attr when we need it?

Copy link
Contributor

@ryannikolaidis ryannikolaidis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes look great

@badGarnet badGarnet enabled auto-merge January 9, 2026 21:14
badGarnet and others added 2 commits January 9, 2026 17:07
…ngest test fixtures update (#4184)

This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Updates ingest test fixtures to align with new PDF extraction behavior
(renderMode/invisible text handling).
> 
> - Many entries change `metadata.is_extracted` from `false` to
`partial`; some new items use `is_extracted: "true"`
> - Adds additional extracted elements (e.g., author lines, headers,
titles, uncategorized text) to the JSON
> - Affects
`test_unstructured_ingest/expected-structured-output/azure/Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf.json`
only
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
f439c15. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Co-authored-by: badGarnet <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants