Tesseract langauge file for SuperMarket receipts :)
Ubuntu
You will have to build the latest version of tesseract from source (version 3), since the aptitude repositories hold an old version (v2.04).
- Download tesseract-3.01.tar.gz from http://code.google.com/p/tesseract-ocr/downloads/list
- Unpack it. (
tar -zvxf tesseract-3.01.tar.gz) - Before proceeding to the next step, make sure you have the following packages:
sudo apt-get install autoconf automake libtoolsudo apt-get install libpng12-devsudo apt-get install libjpeg62-devsudo apt-get install libtiff4-devsudo apt-get install zlib1g-devsudo apt-get install libleptonica-dev
- Compile tesseract:
./autogen.sh./configureNOTE: In case you will get an error about leptonica, you will have to do the following:sudo apt-get remove libleptonica-devwget http://www.leptonica.com/source/leptonica-1.68.tar.gz- After installing leptonica, try running
./configurefor tesseract again.
makesudo make installsudo ldconfig
- Download the english language file from http://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.01.eng.tar.gz and unpack it into /usr/local/share/tessdata
Mac OSX
Oddly enough, homebrew hosts a newer version of tesseract (v3.0), which is good enough for us.
brew install tesseract
The workflow is divided into two phases, the first one is manual and the second we automate:
-
Phase 1
- Create an image of your receipt
- Convert the image to a b&w tif file
- Save the image under the receipts directory in its proper place
-
Phase 2
- Generate a box file by running:
node create_box_file.js tif_file_name - Fix the boxing results
- Run:
coffee runner_command.coffee [reciepts dir path]
- Generate a box file by running: