bug/PaddleOCR language specification issue

**Describe the bug**
After updating to version 0.17.0, I am experiencing the same issue as #3400. When supplying `languages=["en"]` to `partition_pdf` with a strategy of either `"auto"` or `"ocr_only"`, the OCR Agent is not passed through, which causes the following error to occur:
```
Traceback (most recent call last):
  File "/<MY_DIR>/main.py", line 20, in <module>
    elements = partition_pdf(
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/documents/elements.py", line 581, in wrapper
    elements = func(*args, **kwargs)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 816, in wrapper
    elements = func(*args, **kwargs)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 774, in wrapper
    elements = func(*args, **kwargs)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 228, in partition_pdf
    return partition_pdf_or_image(
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 379, in partition_pdf_or_image
    elements = _partition_pdf_or_image_with_ocr(
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 934, in _partition_pdf_or_image_with_ocr
    page_elements = _partition_pdf_or_image_with_ocr_from_image(
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 962, in _partition_pdf_or_image_with_ocr_from_image
    ocr_agent = OCRAgent.get_agent(language=ocr_languages)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/ocr_interface.py", line 34, in get_agent
    return cls.get_instance(ocr_agent_cls_qname, language)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/ocr_interface.py", line 49, in get_instance
    return loaded_class(language)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/paddle_ocr.py", line 23, in __init__
    self.agent = self.load_agent(language)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/paddle_ocr.py", line 45, in load_agent
    paddle_ocr = PaddleOCR(
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured_paddleocr/paddleocr.py", line 610, in __init__
    lang, det_lang = parse_lang(params.lang)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured_paddleocr/paddleocr.py", line 479, in parse_lang
    lang in MODEL_URLS["OCR"][DEFAULT_OCR_MODEL_VERSION]["rec"]
AssertionError: param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng
```

**To Reproduce**
Setup:
Run `pip install "unstructured[pdf]"==0.17.0 paddlepaddle unstructured.paddleocr`. I also had to run `pip uninstall torch -y` and then `pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu`

```
from unstructured.partition.pdf import partition_pdf
import unstructured.partition.utils.ocr_models.paddle_ocr as paddle_ocr_module
from unstructured_inference.inference.layoutelement import LayoutElements

paddle_ocr_module.LayoutElements = LayoutElements # workaround for #3931

os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle" # found in an old issue
os.environ["DEFAULT_PADDLE_LANG"] = "en" # found in an old issue
filename = "path_to_your_file.pdf"

elements = partition_pdf(
    filename=filename,
    strategy="ocr_only",
    languages=["en"],
    table_ocr_agent="unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle",
    ocr_agent="unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle",
)
```

**Expected behavior**
Script would run without errors and return elements.

**Screenshots**
If applicable, add screenshots to help explain your problem.

**Environment Info**
Please run `python scripts/collect_env.py` and paste the output here. 
Broken Env:
```
OS version:  Linux-6.8.0-52-generic-x86_64-with-glibc2.35
Python version:  3.10.12
unstructured version:  0.17.0
unstructured-inference version:  0.8.9
pytesseract is not installed
Torch version:  2.6.0+cpu
Detectron2 is not installed
PaddleOCR version:  2.6.2
Libmagic version: file-5.41
magic file from /etc/magic:/usr/share/misc/magic
LibreOffice is not installed
```

Working Env:
```
OS version:  Linux-6.8.0-52-generic-x86_64-with-glibc2.35
Python version:  3.10.12
unstructured version:  0.16.25
unstructured-inference version:  0.8.9
pytesseract is not installed
Torch version:  2.6.0+cpu
Detectron2 is not installed
PaddleOCR version:  2.6.2
Libmagic version: file-5.41
magic file from /etc/magic:/usr/share/misc/magic
LibreOffice is not installed
```

**Additional context**
From what I can tell, the `ocr_agent` isn't being passed to `_partition_pdf_or_image_with_ocr_from_image`. Since the `tesseract_to_paddle_language` call is only being done inside `_partition_pdf_or_image_local` > `process_file_with_ocr` > `supplement_page_layout_with_ocr` (which doesn't get called by `_partition_pdf_or_image_with_ocr`) passing `languages=["en"]` has the languages changed to the tesseract language structure `ocr_languages = prepare_languages_for_tesseract(languages)` [here](https://github.com/Unstructured-IO/unstructured/blob/66bf4b01984e75898ea256a6c00a541956fbbf5e/unstructured/partition/pdf.py#L335), which causes the call to `ocr_agent = OCRAgent.get_agent(language=ocr_languages)` to break [here](https://github.com/Unstructured-IO/unstructured/blob/66bf4b01984e75898ea256a6c00a541956fbbf5e/unstructured/partition/pdf.py#L962)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug/PaddleOCR language specification issue #3957

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug/PaddleOCR language specification issue #3957

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions