Open
Description
Description:
When installing unstructured[pdf]==0.16.11, it pulls in the latest pdfminer.six 20250327, which causes import errors.
Steps to reproduce:
- pip install "unstructured[pdf]==0.16.11"
- Run the following code:
from langchain_community.document_loaders import UnstructuredPDFLoader
doc = './IF10244.pdf'
loader = UnstructuredPDFLoader(file_path=doc,
strategy='hi_res',
extract_images_in_pdf=True,
infer_table_structure=True,
mode='elements',
image_output_dir_path='./figures')
data = loader.load()
Error:
ImportError: cannot import name 'PSSyntaxError' from 'pdfminer.pdfparser' (/usr/local/lib/python3.11/dist-packages/pdfminer/pdfparser.py)
The issue appears to be that unstructured is trying to import PSSyntaxError from pdfminer.pdfparser, but this class isn't available in the newest version of pdfminer.six.
Workaround:
Downgrading pdfminer.six resolves the issue:
pip install pdfminer.six==20240706
Environment:
Python version: 3.11
OS: [linux]
unstructured: 0.16.11
pdfminer.six: 20250327 (fails), 20240706 (works)
Metadata
Metadata
Assignees
Labels
No labels