tgv-prototype

A prototype application to support TGV bid

Overview

The project consists of several parts

a set of utilities to get OCR data and related metadata from a variety of holding libraries (in fetcher/)
an instance of the Typesense instant search service (docker/Dockerfile.typesense)
a single entry point that retrieves data from remote resources, gathers it into a large JSONL file, and inserts it into Typesense (fetcher/fetcher.py)
a demo frontend against the application API (frontend/)

It is supported by a number of utility functions (fetcher/utils.py)

Most functionality in the Python scripts can be used either as a module or from the command-line.

It is accompanied by files which specify a containerised application, split over ./compose.yml and the Dockerfiles in docker/.

Data and metadata retrieval

Supported sources

anno.onb.ac.at (anno.py)
api.digitale-sammlungen.de (mdz.py)
iiif.onb.ac.at/ABO (abo.py)

We also support gathering digitale-sammlungen.de item IDs from the BSB calendar pages (e.g. https://digipress.digitale-sammlungen.de/calendar/newspaper/bsbmult00000129). This functionality is in bsb.py and produces item IDs that can be used with the mdz.py script. It is currently not working correctly.

fetcher/fetcher.py takes the .txt files produced by each of these retrievers and produces a single newline delimited JSON file, which can be then loaded into the search backend.

We used requests_cache during development to help reduce the number of requests to remote servers.

fetcher/fetcher.py supports a number of command-line flags, which can be used to skip key steps in the data ingestion process.

Search backend

Requires a working installation of Typesense. A compose.yml to use with e.g. Docker Compose is provided for convenience.

insert.py deletes any existing Typesense collection, creates a new one, and inserts the documents from the JSON file into the search backend.

Search frontend

I include a very simple demonstration (thanks to Copilot for Business) of how Typesense integration might look on the frontend. We certainly want to use snippets/highlighted "hits", which Typesense supports.

Usage

Install/Setup

Create a .env file (see samples below)
Run docker-compose up from the root of the project directory
Visit application at localhost:8100

Development

Some useful commands

docker-compose build --no-cache [SERVICE_NAME] to rebuild images from scratch

It is important to keep track of when environment variables are being "injected" into the application. Sometimes it is done during the image build and other times it is done during runtime.

`.env` file for local testing

The frontend application will be available at localhost:8100

TYPESENSE_API_KEY=replace-me
TYPESENSE_HOST=localhost
TYPESENSE_PORT=8100
TYPESENSE_UPSTREAM_HOST=typesense
TYPESENSE_PROTOCOL=http
TYPESENSE_PATH=/api
TYPESENSE_FETCHER_HOST=nginx
TYPESENSE_FETCHER_PORT=80
TYPESENSE_FETCHER_PROTOCOL=http
TYPESENSE_FETCHER_PATH=/api

Design

There are three services defined in compose.yml:

nginx
python-fetcher
typesense

nginx hosts the frontend and reverse proxies /api to / in typesense.

The frontend served by nginx needs to have an API key to authenticate requests to typesense (which also needs to know this API key when launched). This is all arranged in the relevant Dockerfiles, and uses key-value pairs set in .env (which is not checked into source control).

typesense is the database and is available over HTTP within the docker network alpha (this network name is largely irrelevant for now).

python-fetcher retrieves the data, post-processes it and inserts it into the running Typesense instance. It deletes any existing collections before doing so.

Notes for deployment to AWH

There are some things to note before deployment to AHW given the "sidecar" pattern:

fetcher/fetcher.py assumes that the database (Typesense) is available at nginx (hard-coded for the moment)
frontend assumes that the database API is available at ${TYPESENSE_HOST}:${TYPESENSE_PORT} under ${TYPESENSE_PATH} (typically /api)
nginx assumes that typesense makes its HTTP API available at on $TYPESENSE_UPSTREAM_HOST} at port 8081, which is referenced in nginx.conf
- This implies that TYPESENSE_UPSTREAM_HOST needs changing when deploying to AHW

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
.github		.github
config		config
docker		docker
fetcher		fetcher
frontend		frontend
scripts		scripts
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compose.yml		compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

tgv-prototype

Overview

Data and metadata retrieval

Search backend

Search frontend

Usage

Install/Setup

Development

`.env` file for local testing

Design

Notes for deployment to AWH

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

DurhamARC/tgv-prototype

Folders and files

Latest commit

History

Repository files navigation

tgv-prototype

Overview

Data and metadata retrieval

Search backend

Search frontend

Usage

Install/Setup

Development

.env file for local testing

Design

Notes for deployment to AWH

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

`.env` file for local testing

Packages