Skip to content

Commit 0f1576a

Browse files
committed
bug: PDF file upload failed - Could not initialize tesseract
Was getting error unstructured_pytesseract.pytesseract.TesseractError: (1, 'Error opening data file /usr/share/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'eng\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')
1 parent b9ba6e3 commit 0f1576a

File tree

1 file changed

+3
-0
lines changed

1 file changed

+3
-0
lines changed

lib/shared/file-import-dockerfile

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,9 @@ RUN pip uninstall -y `pip freeze | grep torch` && pip uninstall -y `pip freeze |
1010
# Torch is needed for image analysis in pdfs (using CPU version)
1111
RUN pip install torch==2.3.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
1212

13+
# This is required to process the pdf files produced by 'Microsoft: Print to PDF'
14+
RUN apk add --no-cache tesseract-eng
15+
1316
# Remove previous layers to create a smaller image
1417
FROM scratch
1518
COPY --from=source / /

0 commit comments

Comments
 (0)