-
Notifications
You must be signed in to change notification settings - Fork 26
Add mergepdf microservice [minor] #501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 7 commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
1a82ec8
Build alpaca JAR
joecorall f0fd796
checkout https://github.com/Islandora/Alpaca/pull/94
joecorall 8b1d879
fix build
joecorall 7501bca
Use image with git installed
joecorall 9e8694b
Add merge PDF
joecorall 62b8bff
update README
joecorall 05522cf
Update README
joecorall 4779f7a
term_from_term_name route is protected by auth
joecorall d3f9882
Update README with jwt_auth requirement
joecorall 999029f
Merge branch 'main' into alpaca-94
seth-shaw-asu 3b1177e
Update push.yml
joecorall 5d61f0d
Enable MERGEPDF in Dockerfile
joecorall 06f94e7
Merge branch 'main' into alpaca-94
joecorall 1c04fa8
Added URL decoding for '%' characters in URLs.
joecorall 22d0647
Make the drupal URI configurable
joecorall 3758401
standardize alpaca install with scyllaridae pattern
joecorall 8136ada
use base to access download.sh
joecorall File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| build.gradle.kts | ||
| README.md | ||
| tests | ||
| tests/**/* |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,45 @@ | ||
| FROM imagemagick | ||
| FROM leptonica | ||
| FROM scyllaridae | ||
|
|
||
| ARG TARGETARCH | ||
|
|
||
| EXPOSE 8080 | ||
|
|
||
| WORKDIR /app | ||
|
|
||
| ARG \ | ||
| # renovate: datasource=repology depName=alpine_3_22/tesseract-ocr | ||
| TESSERACT_VERSION=5.5.0-r2 \ | ||
| # renovate: datasource=repology depName=alpine_3_22/ghostscript | ||
| GHOSTSCRIPT_VERSION=10.05.1-r0 \ | ||
| # renovate: datasource=repology depName=alpine_3_22/poppler-utils | ||
| POPPLER_VERSION=25.04.0-r0 | ||
|
|
||
| # hadolint ignore=DL3018 | ||
| RUN --mount=type=bind,from=imagemagick,source=/packages,target=/packages \ | ||
| --mount=type=bind,from=imagemagick,source=/etc/apk/keys,target=/etc/apk/keys \ | ||
| apk add --no-cache /packages/imagemagick-*.apk | ||
|
|
||
| RUN --mount=type=bind,from=leptonica,source=/packages,target=/packages \ | ||
| --mount=type=bind,from=leptonica,source=/etc/apk/keys,target=/etc/apk/keys \ | ||
| apk update && \ | ||
| apk add --no-cache \ | ||
| /packages/leptonica-*.apk \ | ||
| ghostscript=="${GHOSTSCRIPT_VERSION}" \ | ||
| tesseract-ocr=="${TESSERACT_VERSION}" \ | ||
| tesseract-ocr-data-eng=="${TESSERACT_VERSION}" \ | ||
| tesseract-ocr-data-fra=="${TESSERACT_VERSION}" \ | ||
| tesseract-ocr-data-spa=="${TESSERACT_VERSION}" \ | ||
| tesseract-ocr-data-ita=="${TESSERACT_VERSION}" \ | ||
| tesseract-ocr-data-por=="${TESSERACT_VERSION}" \ | ||
| tesseract-ocr-data-hin=="${TESSERACT_VERSION}" \ | ||
| tesseract-ocr-data-deu=="${TESSERACT_VERSION}" \ | ||
| tesseract-ocr-data-jpn=="${TESSERACT_VERSION}" \ | ||
| tesseract-ocr-data-rus=="${TESSERACT_VERSION}" \ | ||
| poppler-utils=="${POPPLER_VERSION}" | ||
|
|
||
| ENV \ | ||
| MAX_THREADS=5 | ||
|
|
||
| COPY --link rootfs / |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,25 @@ | ||
| # Merge PDF | ||
|
|
||
| Docker image for mergepdf. Aggregate IIIF manifests for books/paged-content into a PDF. | ||
|
|
||
| Built from [Islandora-DevOps/isle-buildkit mergpdf](https://github.com/Islandora-DevOps/isle-buildkit/tree/main/mergepdf) | ||
|
|
||
| ## Dependencies | ||
|
|
||
| Requires `islandora/scyllaridae` docker image to build. Please refer to the | ||
| [Scyllaridae Image README](../scyllaridae/README.md) for additional information including | ||
| additional settings, volumes, ports, etc. | ||
|
|
||
| ### IIIF Manifest | ||
|
|
||
| The drupal site requires a route available at `/node/{node}/book-manifest`. This View is installed by default in the [views.view.iiif_manifest.yml](https://github.com/Islandora-Devops/islandora-starter-site/blob/main/config/sync/views.view.iiif_manifest.yml) config in the Islandora Starter Site. | ||
|
|
||
| ### Taxonomy Term Name to TID | ||
|
|
||
| The drupal site requires a route available at `/term_from_term_name`. This View is installed by default in the [views.view.term_from_term_name.yml](https://github.com/Islandora-Devops/islandora-starter-site/blob/main/config/sync/views.view.term_from_term_name.yml) config in the Islandora Starter Site. | ||
|
|
||
| ## Settings | ||
|
|
||
| | Environment Variable | Default | Description | | ||
| | :------------------- | :-------------------------------------------------------- | :---------------------------------------------------------------------- | | ||
| | MAX_THREADS | 5 | How many images to download at once from a IIIF manifest | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,103 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| set -eou pipefail | ||
|
|
||
| URL="$1/book-manifest" | ||
| TMP_DIR=$(mktemp -d) | ||
| I=0 | ||
| MAX_THREADS=${MAX_THREADS:-5} | ||
| PIDS=() | ||
| RETRIES=3 | ||
|
|
||
| cleanup() { | ||
| rm -rf "$TMP_DIR" | ||
| } | ||
|
|
||
| trap cleanup EXIT | ||
|
|
||
| # Function to download and process the image with retries | ||
| download_and_process() { | ||
| local url="$1" | ||
| local output_file="$2" | ||
| local attempt=0 | ||
|
|
||
| while (( attempt < RETRIES )); do | ||
| if curl -s "$url" | magick - -resize 1000x\> "$output_file" > /dev/null 2>&1; then | ||
| return 0 | ||
| fi | ||
| attempt=$(( attempt + 1 )) | ||
| echo "Retrying ($attempt/$RETRIES) for $url..." | ||
| sleep 1 | ||
| done | ||
|
|
||
| echo "Failed to process $url after $RETRIES attempts." >&2 | ||
| return 1 | ||
| } | ||
|
|
||
| # Iterate over all images in the IIIF manifest | ||
| URLS=$(curl -sf "$URL" | jq -r '.sequences[0].canvases[].images[0].resource."@id"' | awk -F '/' '{print $7}' | sed -e 's/%2F/\//g' -e 's/%3A/:/g') | ||
| while read -r URL; do | ||
| # If we have reached the max thread limit, wait for any one job to finish | ||
| if [ "${#PIDS[@]}" -ge "$MAX_THREADS" ]; then | ||
| wait -n | ||
| NEW_PIDS=() | ||
| for pid in "${PIDS[@]}"; do | ||
| if kill -0 "$pid" 2>/dev/null; then | ||
| NEW_PIDS+=("$pid") | ||
| fi | ||
| done | ||
| PIDS=("${NEW_PIDS[@]}") | ||
| fi | ||
|
|
||
| # Run each job in the background | ||
| ( | ||
| local_img="$TMP_DIR/img_$I.jpg" | ||
|
|
||
| # Download and resize the image with retry logic | ||
| if ! download_and_process "$URL" "$local_img"; then | ||
| exit 1 | ||
| fi | ||
|
|
||
| # Make an OCR'd PDF from the image | ||
| tesseract "$local_img" "$TMP_DIR/img_$I" pdf > /dev/null 2>&1 | ||
| rm "$local_img" | ||
| ) & | ||
| PIDS+=("$!") | ||
| I="$(( I + 1))" | ||
| done <<< "$URLS" | ||
|
|
||
| FILES=() | ||
| for index in $(seq 0 $((I - 1))); do | ||
| FILES+=("$TMP_DIR/img_${index}.pdf") | ||
| done | ||
|
|
||
| wait | ||
|
|
||
| # Make the node title the title of the PDF | ||
| TITLE=$(curl -L "$1?_format=json" | jq -r '.title[0].value' | sed 's/(/\\(/g; s/)/\\)/g') | ||
| echo "[ /Title ($TITLE)/DOCINFO pdfmark" > "$TMP_DIR/metadata.txt" | ||
|
|
||
| gs -dBATCH \ | ||
| -dNOPAUSE \ | ||
| -dQUIET \ | ||
| -sDEVICE=pdfwrite \ | ||
| -dPDFA \ | ||
| -dNOOUTERSAVE \ | ||
| -dAutoRotatePages=/None \ | ||
| -sOutputFile="$TMP_DIR/ocr.pdf" \ | ||
| "${FILES[@]}" \ | ||
| "$TMP_DIR/metadata.txt" | ||
|
|
||
| # Instead of printing the PDF | ||
| # PUT it to the endpoint | ||
| NID=$(basename "$1") | ||
| BASE_URL=$(dirname "$1" | xargs dirname) | ||
| TID=$(curl "$BASE_URL/term_from_term_name?vocab=islandora_media_use&name=Original+File&_format=json" | jq '.[0].tid[0].value') | ||
joecorall marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| curl \ | ||
| -H "Authorization: $SCYLLARIDAE_AUTH" \ | ||
| -H "Content-Type: application/pdf" \ | ||
| -H "Content-Location: private://derivatives/pc/pdf/$NID.pdf" \ | ||
| -T "$TMP_DIR/ocr.pdf" \ | ||
| "$1/media/document/$TID" | ||
|
|
||
| echo "OK" | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.