Derived from #249
Problem
Short text chunks (1-3 characters) can have abnormally wide bounding boxes in certain PDFs. For example, a single character "4" with height 10 may have a bounding box width of 42 (expected ~7).
This causes the text to span across table cell boundaries, leading to incorrect cell assignment in TableBorderProcessor.getTextChunkPartForTableCell().
Reproduction
Document: odl-test-fixtures/documents/pdf/1218_간질환정보집_최종본.pdf, page 31
Table: [표 4] Child-Pugh classification (프로트롬빈 시간 row)
Same table as #257 — the abnormal width contributes to the misassignment.
Existing test
ContentFilterProcessorTest.testShortTextWithAbnormallyWideBoundingBox() — documents the abnormal width behavior with flip instructions.
Fix direction
Add fixAbnormalTextChunkBoundingBoxes() in ContentFilterProcessor, called before mergeCloseTextChunks():
- Detect:
width > charCount * height * 0.7 * 3
- Correct: Set
rightX = leftX + charCount * height * 0.7
Affected file: ContentFilterProcessor.java