Description
Continuation of #10, in a sense: different culprit, same pack of background tasks.
Now it turns out old pdfdraw -tt
(see also #34: this bugger has to go) is locked up forever at max CPU for spurious / egregious PDFs. (🎅 isn't English language fun 🎅 ho ho ho! 🤡 )
That's the text extraction background process going b0rk b0rk b0rk on you. No way out but hard "kill process" for each of these.
Targeted fix
Upgrading/migration to latest MuPDF mudraw
hOCR or JSON STEXT output -- the old pdfdraw
that comes with current Qiqqa installs is an antique patched MuPDF tool (#34 + #35) and lots have changed since then, including the relevant output format for extracted text.
As I intend to support more document types (via the hOCR/HTML fundamental format), Qiqqa should grok the new pdfdraw -o *.ocr.html
or similar output.
Also keep in mind the migration from the antique (obsoleted) LuceneNET version to SOLR / ElasticSearch: that's #23 + #298 + Technology areas and their function in Qiqqa + Towards migrating the PDF viewer / renderer / text extractor