Skip to content
This repository was archived by the owner on Jun 14, 2018. It is now read-only.
This repository was archived by the owner on Jun 14, 2018. It is now read-only.

pyocr with latest Tesseract fails with pyocr.error.TesseractError: "Error, unknown command line argument '-psm'\n") #99

@ddddavidmartin

Description

@ddddavidmartin

Good day,

I'm using pyocr through Paperless on a Ubuntu setup. I'm using the tesseract-ocr PPA [0] and on the latest version [1] pyocr throws an error.

[0]

cat /etc/apt/sources.list.d/alex-p-ubuntu-tesseract-ocr-artful.list
deb http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu artful main

[1]

tesseract --version
tesseract 4.0.0-beta.1-302-g3aa9
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.8 : zlib 1.2.11 : libwebp 0.6.0 : libopenjp2 2.3.0

Traceback:

littlebig@littlebig:~/Dev/paperless$ python3 /home/littlebig/Dev/paperless/src/manage.py document_consumer
Starting document consumer at /home/littlebig/paperless_consumption_dir with inotify
Parsers available: RasterisedDocumentParser
Consuming /home/littlebig/paperless_consumption_dir/BRW90CDB68D60F5_000798.pdf
Processing sheet #1: /tmp/paperless/paperless-b5bgnwtm/convert-0000.pnm -> /tmp/paperless/paperless-b5bgnwtm/convert-0000.unpaper.pnm
[pgm_pipe @ 0x55cbcbdfb980] Stream #0: not enough frames to estimate rate; consider increasing probesize
[image2 @ 0x55cbcbe00140] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x55cbcbe00140] Encoder did not produce proper pts, making some up.
OCRing the document
Parsing for eng
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/littlebig/Dev/paperless/src/paperless_tesseract/parsers.py", line 290, in image_to_string
    return ocr.image_to_string(f, lang=lang)
  File "/home/littlebig/.local/lib/python3.6/site-packages/pyocr/tesseract.py", line 367, in image_to_string
    raise TesseractError(status, errors)
pyocr.error.TesseractError: (1, b"Error, unknown command line argument '-psm'\n")
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/littlebig/Dev/paperless/src/manage.py", line 18, in <module>
    execute_from_command_line(sys.argv)
  File "/home/littlebig/.local/lib/python3.6/site-packages/django/core/management/__init__.py", line 364, in execute_from_command_line
    utility.execute()
  File "/home/littlebig/.local/lib/python3.6/site-packages/django/core/management/__init__.py", line 356, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/littlebig/.local/lib/python3.6/site-packages/django/core/management/base.py", line 283, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/home/littlebig/.local/lib/python3.6/site-packages/django/core/management/base.py", line 330, in execute
    output = self.handle(*args, **options)
  File "/home/littlebig/Dev/paperless/src/documents/management/commands/document_consumer.py", line 98, in handle
    self.loop_inotify(mail_delta)
  File "/home/littlebig/Dev/paperless/src/documents/management/commands/document_consumer.py", line 131, in loop_inotify
    self.loop_step(mail_delta)
  File "/home/littlebig/Dev/paperless/src/documents/management/commands/document_consumer.py", line 123, in loop_step
    self.file_consumer.consume_new_files()
  File "/home/littlebig/Dev/paperless/src/documents/consumer.py", line 107, in consume_new_files
    if not self.try_consume_file(file):
  File "/home/littlebig/Dev/paperless/src/documents/consumer.py", line 145, in try_consume_file
    date = parsed_document.get_date()
  File "/home/littlebig/Dev/paperless/src/paperless_tesseract/parsers.py", line 209, in get_date
    text = self.get_text()
  File "/home/littlebig/Dev/paperless/src/paperless_tesseract/parsers.py", line 80, in get_text
    self._text = self._get_ocr(images)
  File "/home/littlebig/Dev/paperless/src/paperless_tesseract/parsers.py", line 140, in _get_ocr
    raw_text = self._ocr([imgs[middle]], self.DEFAULT_OCR_LANGUAGE)
  File "/home/littlebig/Dev/paperless/src/paperless_tesseract/parsers.py", line 189, in _ocr
    r = pool.map(image_to_string, itertools.product(imgs, [lang]))
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
pyocr.error.TesseractError: (1, b"Error, unknown command line argument '-psm'\n")
littlebig@littlebig:~/Dev/paperless$

Has anyone else come across this? Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions