Skip to content
This repository was archived by the owner on Dec 18, 2019. It is now read-only.

OCR language

jflesch edited this page Sep 24, 2014 · 10 revisions

By default, Paperwork uses Tesseract for the OCR. If unavailable, it falls back on Cuneiform.

To get better results, OCR tools need to know the language used in the document(s).

The language available in the settings dialog of Paperwork are those understood by the automatically-selected OCR tool (Tesseract or Cuneiform). If your language is not in the list, it means the OCR tool doesn't have the data required to read your language.

Note that Paperwork also automatically use available spellcheckers (aspell, ispell, myspell, etc) to improve the detection of the orientation of the page. It means your spellchecker must have the dictionary corresponding to your language installed. Warning: if no spellcheck is installed or if it doesn't have the required dictionary, Paperwork will try to detect the orientation without spellchecking (--> no error dialog displayed)

Debian

# OCR (Tesseract)
$ sudo apt-get install tesseract-ocr tesseract-ocr-<lang>

# Spell checking (myspell)
$ sudo apt-get install myspell myspell-<lang>

Fedora

# OCR (Tesseract)
$ sudo yum install tesseract tesseract-langpack-<lang>

# Spell checking (aspell)
$ sudo yum install aspell aspell-<lang>

Ubuntu

# OCR (Tesseract)
$ sudo apt-get install tesseract-ocr tesseract-ocr-<lang>

# Spell checking (myspell)
$ sudo apt-get install myspell myspell-<lang>

Clone this wiki locally