A prototype application to support TGV bid
The project consists of several parts
- a set of utilities to get OCR data and related metadata from a variety of holding libraries (in
fetcher/) - an instance of the Typesense instant search service (
docker/Dockerfile.typesense) - a single entry point that retrieves data from remote resources, gathers it into a large JSONL file, and inserts it into Typesense (
fetcher/fetcher.py) - a demo frontend against the application API (
frontend/)
It is supported by a number of utility functions (fetcher/utils.py)
Most functionality in the Python scripts can be used either as a module or from the command-line.
It is accompanied by files which specify a containerised application, split over ./compose.yml and the Dockerfiles in docker/.
Supported sources
- anno.onb.ac.at (
anno.py) - api.digitale-sammlungen.de (
mdz.py) - iiif.onb.ac.at/ABO (
abo.py)
We also support gathering digitale-sammlungen.de item IDs from the BSB calendar pages (e.g. https://digipress.digitale-sammlungen.de/calendar/newspaper/bsbmult00000129). This functionality is in bsb.py and produces item IDs that can be used with the mdz.py script. It is currently not working correctly.
fetcher/fetcher.py takes the .txt files produced by each of these retrievers and produces a single newline delimited JSON file, which can be then loaded into the search backend.
We used requests_cache during development to help reduce the number of requests to remote servers.
fetcher/fetcher.py supports a number of command-line flags, which can be used to skip key steps in the data ingestion process.
Requires a working installation of Typesense. A compose.yml to use with e.g. Docker Compose is provided for convenience.
insert.py deletes any existing Typesense collection, creates a new one, and inserts the documents from the JSON file into the search backend.
I include a very simple demonstration (thanks to Copilot for Business) of how Typesense integration might look on the frontend. We certainly want to use snippets/highlighted "hits", which Typesense supports.
- Create a
.envfile (see samples below) - Run
docker-compose upfrom the root of the project directory - Visit application at localhost:8100
Some useful commands
docker-compose build --no-cache [SERVICE_NAME]to rebuild images from scratch
It is important to keep track of when environment variables are being "injected" into the application. Sometimes it is done during the image build and other times it is done during runtime.
The frontend application will be available at localhost:8100
TYPESENSE_API_KEY=replace-me
TYPESENSE_HOST=localhost
TYPESENSE_PORT=8100
TYPESENSE_UPSTREAM_HOST=typesense
TYPESENSE_PROTOCOL=http
TYPESENSE_PATH=/api
TYPESENSE_FETCHER_HOST=nginx
TYPESENSE_FETCHER_PORT=80
TYPESENSE_FETCHER_PROTOCOL=http
TYPESENSE_FETCHER_PATH=/apiThere are three services defined in compose.yml:
nginxpython-fetchertypesense
nginx hosts the frontend and reverse proxies /api to / in typesense.
The frontend served by nginx needs to have an API key to authenticate requests to typesense (which also needs to know this API key when launched). This is all arranged in the relevant Dockerfiles, and uses key-value pairs set in .env (which is not checked into source control).
typesense is the database and is available over HTTP within the docker network alpha (this network name is largely irrelevant for now).
python-fetcher retrieves the data, post-processes it and inserts it into the running Typesense instance. It deletes any existing collections before doing so.
There are some things to note before deployment to AHW given the "sidecar" pattern:
fetcher/fetcher.pyassumes that the database (Typesense) is available atnginx(hard-coded for the moment)frontendassumes that the database API is available at${TYPESENSE_HOST}:${TYPESENSE_PORT}under${TYPESENSE_PATH}(typically/api)nginxassumes thattypesensemakes its HTTP API available at on$TYPESENSE_UPSTREAM_HOST}at port 8081, which is referenced innginx.conf- This implies that
TYPESENSE_UPSTREAM_HOSTneeds changing when deploying to AHW
- This implies that