- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.1k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
Partition PDF with 'fast' strategy returns an empty list of elements when OCR is not needed. Text is returned instantly with other libraries like PyMuPDF.
Reproduction
from unstructured.partition.pdf import partition_pdf
import pymupdf
fname = 'file.PDF'
elements = partition_pdf(filename=fname, strategy='fast')
elements
Out[18]: []
with pymupdf.open(fname) as doc:
     text = chr(12).join([page.get_text() for page in doc])
Out: ...many pages of text
Expected behavior
Partition PDF should return chunks of text without running OCR when PDF has embedded text
Environment Info
Please run python scripts/collect_env.py and paste the output here.
OS version:  Linux-5.14.0-427.26.1.el9_4.x86_64-x86_64-with-glibc2.34
Python version:  3.12.8
unstructured version:  0.16.15
unstructured-inference version:  0.8.1
pytesseract is not installed
Torch version:  2.5.1
Detectron2 is not installed
PaddleOCR is not installed
Libmagic version: file-5.39
magic file from /etc/magic:/usr/share/misc/magic
Traceback (most recent call last):
...
FileNotFoundError: [Errno 2] No such file or directory: 'libreoffice'
spartan-tridoanspartan-tridoan
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working