fix: remove duplicate characters caused by fake bold rendering in PDFs #4215

bittoby · 2026-01-28T12:23:22Z

Summary

Fixes issue where bold text in PDFs is extracted with duplicate characters (e.g., "BOLD" → "BBOOLLDD")
Some PDF generators simulate bold by rendering each character twice at slightly offset positions
Added character-level deduplication based on position proximity to detect and remove these duplicates

Problem

When extracting text from certain PDFs, bold text appears duplicated:

# Before fix
elements = partition_pdf("document.pdf", strategy="fast")
print(elements[0].text)  # Output: ">60>60" instead of ">60"

Solution

Added character-level deduplication that:

Compares consecutive characters' text content and position
Removes duplicates where same character appears within 3 pixels (configurable)
Preserves spaces and other non-character elements (LTAnno objects)

# After fix
elements = partition_pdf("document.pdf", strategy="fast")
print(elements[0].text)  # Output: ">60" ✓

Configuration

# Default: 3.0 pixels (enabled)
export PDF_CHAR_DUPLICATE_THRESHOLD=3.0

# Disable deduplication
export PDF_CHAR_DUPLICATE_THRESHOLD=0

# More aggressive deduplication
export PDF_CHAR_DUPLICATE_THRESHOLD=5.0

bittoby · 2026-01-28T12:33:07Z

@badGarnet Could you please review this PR? Thanks!

fix: remove duplicate characters caused by fake bold rendering in PDFs

f8af84b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: remove duplicate characters caused by fake bold rendering in PDFs #4215

fix: remove duplicate characters caused by fake bold rendering in PDFs #4215

Uh oh!

bittoby commented Jan 28, 2026

Uh oh!

bittoby commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: remove duplicate characters caused by fake bold rendering in PDFs #4215

Are you sure you want to change the base?

fix: remove duplicate characters caused by fake bold rendering in PDFs #4215

Uh oh!

Conversation

bittoby commented Jan 28, 2026

Summary

Problem

Solution

Configuration

Uh oh!

bittoby commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant