Tesseract as a microservice for use with API-X.
- Install
tesseract. On Ubuntu, this can be done withsudo apt-get install tesseract-ocr. If you want to install extra languages, they are available as separate packages in Ubuntu. You can use apt's autocomplete to get a quick list of them. - Install
pdftotext. - Clone this repository somewhere in your web root.
- Install
composer. Install instructions here. $ cd /path/to/Hypercubeand run$ composer install- For production, configure your web server appropriately (e.g. add a VirtualHost for Hypercube in Apache).
To use Hypercube with Apache you need to configure your Virtualhost with a few options:
- Redirect all requests to the Hypercube index.php file
- Make sure Hypercube has access to Authorization headers
Here is an example configuration for Apache 2.4:
Alias "/hypercube" "/path/to/Crayfish/Hypercube/public"
<Directory "/path/to/Crayfish/Hypercube/public">
FallbackResource /hypercube/index.php
Require all granted
DirectoryIndex index.php
SetEnvIf Authorization "(.*)" HTTP_AUTHORIZATION=$1
</Directory>This will put the Hypercube at the /hypercube endpoint on the web server.
Steps for upgrading Hypercube can be found in UPGRADE.md
Symfony uses .dotenv to set environment variables. You can check the .env in the root of the Hypercube directory.
To alter any settings, create a file called .env.local to store your specific changes. You can also set an actual environment
variable.
For production use make sure to set the add APP_ENV=prod environment variable.
If your tesseract installation is not on your path, then you can configure Hypercube to use a specific executable by editing
the app.tesseract_executable parameter in /path/to/Hypercube/config/services.yaml.
If your pdftotext installation is not on your path, then you can configure Hypercube to use a specific executable by editing
the app.pdftotext_executable parameter in /path/to/Hypercube/config/services.yaml.
You also need to set your Fedora Base Url to allow the Fedora Resource to be pulled in automatically. This is done in the
/path/to/Hypercube/config/packages/crayfish_commons.yaml.
In order to work on larger images, be sure post_max_size is sufficiently large and max_execution_time is set to 0 in your PHP
installation's ini file. You can determine which ini file is getting used by running the command $ php --ini.
To change your log settings, edit the /path/to/Hypercube/config/packages/monolog.yaml file.
You can also copy the file into one of the /path/to/Hypercube/config/packages/<environment> directories.
Where <environment> is dev, test, or prod based on the APP_ENV variable (see above). The files in the specific
environment directory will take precedence over those in the /path/to/Hypercube/config/packages directory.
The location specified in the configuration file for the log must be writable by the web server.
There are instructions in the /path/to/Hypercube/config/packages/security.yaml file describing what to change and what lines
to comment out to enable authentication.
We use the Lexik JWT Authentication Bundle for Symfony, more information here https://github.com/lexik/LexikJWTAuthenticationBundle
Hypercube is meant for use with API-X. It accepts only accepts one request, a GET with the URI of a Fedora resource in the Apix-Ldp-Resource header..
For example, suppose if you have a TIFF in Fedora at http://localhost:8080/fcrepo/rest/foo/bar. If running the PHP built-in server command described in the Installation section:
$ curl -H "Authorization: Bearer blabhlahblah" -H "Apix-Ldp-Resource: http://localhost:8080/fcrepo/rest/foo/bar" "http://localhost:8888"
This will return the OCR generated from the TIFF in Fedora. Additional arguments to tesseract can be provided using the X-Islandora-Args header. For example, to change the page layout:
$ curl -H "Authorization: Bearer blabhlahblah" -H "Apix-Ldp-Resource: http://localhost:8080/fcrepo/rest/foo/bar" -H "X-Islandora-Args: -psm 9" "http://localhost:8888"
But you're probably going to use Hypercube through API-X, which exposes it as svc:ocr. Assuming your API-X proxy is on port 8081, you can access the service with
$ curl -H "Authorization: Bearer blabhlahblah" "http://localhost:8081/services/foo/bar/svc:ocr"
Current maintainers:
If you would like to contribute, please get involved by attending our weekly Tech Call. We love to hear from you!
If you would like to contribute code to the project, you need to be covered by an Islandora Foundation Contributor License Agreement or Corporate Contributor License Agreement. Please see the Contributors pages on Islandora.ca for more information.
