Skip to content

bug/PaddleOCR language specification issue #3957

Open
@joshrbarcodefactory

Description

@joshrbarcodefactory

Describe the bug
After updating to version 0.17.0, I am experiencing the same issue as #3400. When supplying languages=["en"] to partition_pdf with a strategy of either "auto" or "ocr_only", the OCR Agent is not passed through, which causes the following error to occur:

Traceback (most recent call last):
  File "/<MY_DIR>/main.py", line 20, in <module>
    elements = partition_pdf(
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/documents/elements.py", line 581, in wrapper
    elements = func(*args, **kwargs)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 816, in wrapper
    elements = func(*args, **kwargs)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 774, in wrapper
    elements = func(*args, **kwargs)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 228, in partition_pdf
    return partition_pdf_or_image(
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 379, in partition_pdf_or_image
    elements = _partition_pdf_or_image_with_ocr(
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 934, in _partition_pdf_or_image_with_ocr
    page_elements = _partition_pdf_or_image_with_ocr_from_image(
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 962, in _partition_pdf_or_image_with_ocr_from_image
    ocr_agent = OCRAgent.get_agent(language=ocr_languages)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/ocr_interface.py", line 34, in get_agent
    return cls.get_instance(ocr_agent_cls_qname, language)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/ocr_interface.py", line 49, in get_instance
    return loaded_class(language)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/paddle_ocr.py", line 23, in __init__
    self.agent = self.load_agent(language)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/paddle_ocr.py", line 45, in load_agent
    paddle_ocr = PaddleOCR(
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured_paddleocr/paddleocr.py", line 610, in __init__
    lang, det_lang = parse_lang(params.lang)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured_paddleocr/paddleocr.py", line 479, in parse_lang
    lang in MODEL_URLS["OCR"][DEFAULT_OCR_MODEL_VERSION]["rec"]
AssertionError: param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng

To Reproduce
Setup:
Run pip install "unstructured[pdf]"==0.17.0 paddlepaddle unstructured.paddleocr. I also had to run pip uninstall torch -y and then pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

from unstructured.partition.pdf import partition_pdf
import unstructured.partition.utils.ocr_models.paddle_ocr as paddle_ocr_module
from unstructured_inference.inference.layoutelement import LayoutElements

paddle_ocr_module.LayoutElements = LayoutElements # workaround for #3931

os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle" # found in an old issue
os.environ["DEFAULT_PADDLE_LANG"] = "en" # found in an old issue
filename = "path_to_your_file.pdf"

elements = partition_pdf(
    filename=filename,
    strategy="ocr_only",
    languages=["en"],
    table_ocr_agent="unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle",
    ocr_agent="unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle",
)

Expected behavior
Script would run without errors and return elements.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment Info
Please run python scripts/collect_env.py and paste the output here.
Broken Env:

OS version:  Linux-6.8.0-52-generic-x86_64-with-glibc2.35
Python version:  3.10.12
unstructured version:  0.17.0
unstructured-inference version:  0.8.9
pytesseract is not installed
Torch version:  2.6.0+cpu
Detectron2 is not installed
PaddleOCR version:  2.6.2
Libmagic version: file-5.41
magic file from /etc/magic:/usr/share/misc/magic
LibreOffice is not installed

Working Env:

OS version:  Linux-6.8.0-52-generic-x86_64-with-glibc2.35
Python version:  3.10.12
unstructured version:  0.16.25
unstructured-inference version:  0.8.9
pytesseract is not installed
Torch version:  2.6.0+cpu
Detectron2 is not installed
PaddleOCR version:  2.6.2
Libmagic version: file-5.41
magic file from /etc/magic:/usr/share/misc/magic
LibreOffice is not installed

Additional context
From what I can tell, the ocr_agent isn't being passed to _partition_pdf_or_image_with_ocr_from_image. Since the tesseract_to_paddle_language call is only being done inside _partition_pdf_or_image_local > process_file_with_ocr > supplement_page_layout_with_ocr (which doesn't get called by _partition_pdf_or_image_with_ocr) passing languages=["en"] has the languages changed to the tesseract language structure ocr_languages = prepare_languages_for_tesseract(languages) here, which causes the call to ocr_agent = OCRAgent.get_agent(language=ocr_languages) to break here

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions