Skip to content

Conversation

@bittoby
Copy link

@bittoby bittoby commented Jan 28, 2026

Closes #3864

Summary

  • Fixes issue where bold text in PDFs is extracted with duplicate characters (e.g., "BOLD" → "BBOOLLDD")
  • Some PDF generators simulate bold by rendering each character twice at slightly offset positions
  • Added character-level deduplication based on position proximity to detect and remove these duplicates

Problem

When extracting text from certain PDFs, bold text appears duplicated:

# Before fix
elements = partition_pdf("document.pdf", strategy="fast")
print(elements[0].text)  # Output: ">60>60" instead of ">60"

Solution

Added character-level deduplication that:

  • Compares consecutive characters' text content and position
  • Removes duplicates where same character appears within 3 pixels (configurable)
  • Preserves spaces and other non-character elements (LTAnno objects)
# After fix
elements = partition_pdf("document.pdf", strategy="fast")
print(elements[0].text)  # Output: ">60" ✓

Configuration

# Default: 3.0 pixels (enabled)
export PDF_CHAR_DUPLICATE_THRESHOLD=3.0

# Disable deduplication
export PDF_CHAR_DUPLICATE_THRESHOLD=0

# More aggressive deduplication
export PDF_CHAR_DUPLICATE_THRESHOLD=5.0

@bittoby
Copy link
Author

bittoby commented Jan 28, 2026

@badGarnet Could you please review this PR? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug/Bold characters get repeated while extracting

1 participant