Skip to content

bug/Two Column PDF partition result in incorrect text. #3325

Open
@pfcharles

Description

@pfcharles

Describe the bug
When running partition on a two column pdf, text extraction puts characters is the wrong position
To Reproduce
two_col.pdf

Provide a code snippet that reproduces the issue.
elements = partition("two_col.pdf", strategy="fast")

text attribute of elements[2] = '1. Exchange of Information. The parties agree to exchange Confidential Information for the purpose of (the evaluating a potential business "Purpose") in accordance with this Agreement.'
text attribute of elements[3] = 'relationship'

Actually text from the pdf = '1.Exchange of Information. The parties agree to exchange Confidential Information for the purpose of evaluating a potential business relationship (the "Purpose") in accordance with this Agreement.'

two_col.json

Expected behavior
Extracted text matches the actual text

Screenshots
image

Environment Info
Please run python scripts/collect_env.py and paste the output here.
OS version: macOS-14.5-arm64-arm-64bit
Python version: 3.9.6
unstructured version: 0.14.9
unstructured-inference version: 0.7.36
pytesseract version: 0.3.10
Torch version: 2.3.1
Detectron2 is not installed
PaddleOCR is not installed
Libmagic version: file-5.41
magic file from /usr/share/file/magic
LibreOffice version: ==> libreoffice: 24.2.4

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpdf

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions