Description
Describe the bug
When running partition on a two column pdf, text extraction puts characters is the wrong position
To Reproduce
two_col.pdf
Provide a code snippet that reproduces the issue.
elements = partition("two_col.pdf", strategy="fast")
text attribute of elements[2] = '1. Exchange of Information. The parties agree to exchange Confidential Information for the purpose of (the evaluating a potential business "Purpose") in accordance with this Agreement.'
text attribute of elements[3] = 'relationship'
Actually text from the pdf = '1.Exchange of Information. The parties agree to exchange Confidential Information for the purpose of evaluating a potential business relationship (the "Purpose") in accordance with this Agreement.'
Expected behavior
Extracted text matches the actual text
Environment Info
Please run python scripts/collect_env.py
and paste the output here.
OS version: macOS-14.5-arm64-arm-64bit
Python version: 3.9.6
unstructured version: 0.14.9
unstructured-inference version: 0.7.36
pytesseract version: 0.3.10
Torch version: 2.3.1
Detectron2 is not installed
PaddleOCR is not installed
Libmagic version: file-5.41
magic file from /usr/share/file/magic
LibreOffice version: ==> libreoffice: 24.2.4
Additional context
Add any other context about the problem here.