Skip to content

fix: correct abnormal bounding box width for short text chunks in ContentFilterProcessor #258

@bundolee

Description

@bundolee

Derived from #249

Problem

Short text chunks (1-3 characters) can have abnormally wide bounding boxes in certain PDFs. For example, a single character "4" with height 10 may have a bounding box width of 42 (expected ~7).

This causes the text to span across table cell boundaries, leading to incorrect cell assignment in TableBorderProcessor.getTextChunkPartForTableCell().

Reproduction

Document: odl-test-fixtures/documents/pdf/1218_간질환정보집_최종본.pdf, page 31
Table: [표 4] Child-Pugh classification (프로트롬빈 시간 row)

Same table as #257 — the abnormal width contributes to the misassignment.

Existing test

ContentFilterProcessorTest.testShortTextWithAbnormallyWideBoundingBox() — documents the abnormal width behavior with flip instructions.

Fix direction

Add fixAbnormalTextChunkBoundingBoxes() in ContentFilterProcessor, called before mergeCloseTextChunks():

  • Detect: width > charCount * height * 0.7 * 3
  • Correct: Set rightX = leftX + charCount * height * 0.7

Affected file: ContentFilterProcessor.java

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions