Releases: adsabs/ADSfulltext
Adds a ghostscript timeout to `scripts/extract_pdf_with_pdftotext.sh`
The script extract_pdf_with_pdftotext.sh contains a call to ghostscript when it encounters a PDF that normal processing can't handle. For some pathological cases, ghostscript never returns useful output, and as a result the pipeline workers will keep processing the file and the pipeline may spawn multiple copies of this script.
This release adds a timeout command to the call to ghostscript, so that if the PDF fails to process within 30 seconds, the workers will receive a SIGINT back from the script.
Also includes:
- bump
spacy==2.2.4 - bump
pip==24.0andsetuptools==57.0in.github/workflow
Maintenance release: Update adsputils
What's Changed
- Update requirements.txt by @tjacovich in #146
New Contributors
- @tjacovich made their first contribution in #146
Full Changelog: v1.4.4...v1.4.6
Maintenance release
Add confirm publish variables for RabbitMQ communication.
Maintenance release: log pruning
#143 Pruned log messages during extraction
Pdftotext extraction script extended
Improve extract_pdf_with_pdftotext.sh script to avoid it being stuck while processing some PDFs with vector graphics. Includes updates in BeeHive as well (added ghostscript to fulltext image as necessary).
Maintenance release
#142 For XML extraction, change default to translate unprintable Unicode chars
Bug fix: Unicode translation map
#140 Fixed unicode translation bug, updated list of translated characters
Python 3
No new code, but release for deploying fulltext with Python 3 instead of Python 2
Maintenance release
#138 Add error handling to extract method
Fix Wiley XML parsing
#134 Fix for Wiley body extraction