Description
Describe the bug
After updating to version 0.17.0, I am experiencing the same issue as #3400. When supplying languages=["en"]
to partition_pdf
with a strategy of either "auto"
or "ocr_only"
, the OCR Agent is not passed through, which causes the following error to occur:
Traceback (most recent call last):
File "/<MY_DIR>/main.py", line 20, in <module>
elements = partition_pdf(
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/documents/elements.py", line 581, in wrapper
elements = func(*args, **kwargs)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 816, in wrapper
elements = func(*args, **kwargs)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 774, in wrapper
elements = func(*args, **kwargs)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
elements = func(*args, **kwargs)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 228, in partition_pdf
return partition_pdf_or_image(
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 379, in partition_pdf_or_image
elements = _partition_pdf_or_image_with_ocr(
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 934, in _partition_pdf_or_image_with_ocr
page_elements = _partition_pdf_or_image_with_ocr_from_image(
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 962, in _partition_pdf_or_image_with_ocr_from_image
ocr_agent = OCRAgent.get_agent(language=ocr_languages)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/ocr_interface.py", line 34, in get_agent
return cls.get_instance(ocr_agent_cls_qname, language)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/ocr_interface.py", line 49, in get_instance
return loaded_class(language)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/paddle_ocr.py", line 23, in __init__
self.agent = self.load_agent(language)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/paddle_ocr.py", line 45, in load_agent
paddle_ocr = PaddleOCR(
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured_paddleocr/paddleocr.py", line 610, in __init__
lang, det_lang = parse_lang(params.lang)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured_paddleocr/paddleocr.py", line 479, in parse_lang
lang in MODEL_URLS["OCR"][DEFAULT_OCR_MODEL_VERSION]["rec"]
AssertionError: param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng
To Reproduce
Setup:
Run pip install "unstructured[pdf]"==0.17.0 paddlepaddle unstructured.paddleocr
. I also had to run pip uninstall torch -y
and then pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
from unstructured.partition.pdf import partition_pdf
import unstructured.partition.utils.ocr_models.paddle_ocr as paddle_ocr_module
from unstructured_inference.inference.layoutelement import LayoutElements
paddle_ocr_module.LayoutElements = LayoutElements # workaround for #3931
os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle" # found in an old issue
os.environ["DEFAULT_PADDLE_LANG"] = "en" # found in an old issue
filename = "path_to_your_file.pdf"
elements = partition_pdf(
filename=filename,
strategy="ocr_only",
languages=["en"],
table_ocr_agent="unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle",
ocr_agent="unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle",
)
Expected behavior
Script would run without errors and return elements.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment Info
Please run python scripts/collect_env.py
and paste the output here.
Broken Env:
OS version: Linux-6.8.0-52-generic-x86_64-with-glibc2.35
Python version: 3.10.12
unstructured version: 0.17.0
unstructured-inference version: 0.8.9
pytesseract is not installed
Torch version: 2.6.0+cpu
Detectron2 is not installed
PaddleOCR version: 2.6.2
Libmagic version: file-5.41
magic file from /etc/magic:/usr/share/misc/magic
LibreOffice is not installed
Working Env:
OS version: Linux-6.8.0-52-generic-x86_64-with-glibc2.35
Python version: 3.10.12
unstructured version: 0.16.25
unstructured-inference version: 0.8.9
pytesseract is not installed
Torch version: 2.6.0+cpu
Detectron2 is not installed
PaddleOCR version: 2.6.2
Libmagic version: file-5.41
magic file from /etc/magic:/usr/share/misc/magic
LibreOffice is not installed
Additional context
From what I can tell, the ocr_agent
isn't being passed to _partition_pdf_or_image_with_ocr_from_image
. Since the tesseract_to_paddle_language
call is only being done inside _partition_pdf_or_image_local
> process_file_with_ocr
> supplement_page_layout_with_ocr
(which doesn't get called by _partition_pdf_or_image_with_ocr
) passing languages=["en"]
has the languages changed to the tesseract language structure ocr_languages = prepare_languages_for_tesseract(languages)
here, which causes the call to ocr_agent = OCRAgent.get_agent(language=ocr_languages)
to break here