Skip to content

bug/Partition-PDF-empty-elements #3885

Open
@MackBlackburn

Description

@MackBlackburn

Describe the bug
Partition PDF with 'fast' strategy returns an empty list of elements when OCR is not needed. Text is returned instantly with other libraries like PyMuPDF.

Reproduction

from unstructured.partition.pdf import partition_pdf
import pymupdf

fname = 'file.PDF'

elements = partition_pdf(filename=fname, strategy='fast')
elements
Out[18]: []

with pymupdf.open(fname) as doc:
     text = chr(12).join([page.get_text() for page in doc])
Out: ...many pages of text

Expected behavior
Partition PDF should return chunks of text without running OCR when PDF has embedded text

Environment Info
Please run python scripts/collect_env.py and paste the output here.

OS version:  Linux-5.14.0-427.26.1.el9_4.x86_64-x86_64-with-glibc2.34
Python version:  3.12.8
unstructured version:  0.16.15
unstructured-inference version:  0.8.1
pytesseract is not installed
Torch version:  2.5.1
Detectron2 is not installed
PaddleOCR is not installed
Libmagic version: file-5.39
magic file from /etc/magic:/usr/share/misc/magic
Traceback (most recent call last):
...
FileNotFoundError: [Errno 2] No such file or directory: 'libreoffice'

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions