Open
Description
Describe the bug
User gets a TesseractError
when processing a particular document.
To Reproduce
Code was an API call with a certain image-based document.
Expected behavior
Document processed successfully.
Environment Info
Running in self-hosted open-source API.
Unstructured version 0.12.3.
Tesseract version 5.3.3
Additional context
User was able to successfully process the document with Tesseract version 4.1.1
Stack trace:
File "/home/notebook-user/unstructured/partition/pdf.py", line 213, in partition_pdf
return partition_pdf_or_image(
File "/home/notebook-user/unstructured/partition/pdf.py", line 298, in partition_pdf_or_image
elements = _partition_pdf_or_image_local(
File "/home/notebook-user/unstructured/utils.py", line 214, in wrapper
return func(*args, **kwargs)
File "/home/notebook-user/unstructured/partition/pdf.py", line 494, in _partition_pdf_or_image_local
final_document_layout = process_data_with_ocr(
File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 82, in process_data_with_ocr
merged_layouts = process_file_with_ocr(
File "/home/notebook-user/unstructured/utils.py", line 214, in wrapper
return func(*args, **kwargs)
File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 178, in process_file_with_ocr
raise e
File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 166, in process_file_with_ocr
merged_page_layout = supplement_page_layout_with_ocr(
File "/home/notebook-user/unstructured/utils.py", line 214, in wrapper
return func(*args, **kwargs)
File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 202, in supplement_page_layout_with_ocr
ocr_layout = ocr_agent.get_layout_from_image(
File "/home/notebook-user/unstructured/partition/utils/ocr_models/tesseract_ocr.py", line 48, in get_layout_from_image
ocr_df: pd.DataFrame = unstructured_pytesseract.image_to_data(
File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 591, in image_to_data
return {
File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 593, in <lambda>
Output.DATAFRAME: lambda: get_pandas_output(
File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 568, in get_pandas_output
return pd.read_csv(BytesIO(run_and_get_output(*args)), **kwargs)
File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 347, in run_and_get_output
run_tesseract(**kwargs)
File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 279, in run_tesseract
raise TesseractError(proc.returncode, get_errors(error_string))
unstructured_pytesseract.pytesseract.TesseractError: (-8, 'Estimating resolution as 252')